Reference
[1] Anthropic. (2024). Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. URL https://www.anthropic.com/news/3-5-models-and-computer-use
[2] Baechler, G., Sunkara, S., Wang, M., et al. (2024). ScreenAI: A Vision-Language Model for UI and Infographics Understanding. Proceedings of the 33rd IJCAI, 3058–3068. https://doi.org/10.24963/ijcai.2024/339
[3] Chen, Q., Pitawela, D., Zhao, C., et al. (2024b). WebVLN: Vision-and-Language Navigation on Websites. Proceedings of the AAAI Conference on AI, 38(2), 1165–1173. https://doi.org/10.1609/aaai.v38i2.27878
[4] Cheng K, Sun Q, Chu Y, et al. (2024). SeeClick: Harnessing GUI grounding for advanced visual GUI agents. In: Proc. of the 62nd Annual Meeting of the ACL. ACL, pp 9313–9332. https://doi.org/10.18653/v1/2024.acl-long.505
[5] Deng S, Xu W, Sun H, et al. (2024a). Mobile-Bench: An evaluation benchmark for LLM-based mobile agents. In: Proc. of the 62nd Annual Meeting of the ACL. ACL, pp 8813–8831. https://doi.org/10.18653/v1/2024.acl-long.478
[6] Furuta H, Lee KH, Nachum O, et al. (2024). Multimodal web navigation with instruction-finetuned foundation models. In: Proc. of the 12th ICLR, URL https://openreview.net/forum?id=upKyBTClJm
[7] Gao L, Madaan A, Zhou S, et al. (2023). PAL: Program-aided language models. In: Proc. of the 40th ICML. PMLR, pp 10764–10799
[8] Guan Y, Wang D, Chu Z, et al. (2023). Intelligent Virtual Assistants with LLM-based Process Automation. In: Proc. of the 30th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining. ACM, pp 5018–5027. https://doi.org/10.1145/3637528.3671646
[9] He H, Yao W, Ma K, et al. (2024). WebVoyager: Building an end-to-end web agent with large multimodal models. In: Proc. of the 62nd Annual Meeting of the ACL. ACL, pp 6864–6890. https://doi.org/10.18653/v1/2024.acl-long.371
[10] Hong W, Wang W, Lv Q, et al. (2024). CogAgent: A visual language model for GUI agents. In: Proc. of the IEEE/CVF Conf. on CVPR. IEEE, pp 14281–14290. https://doi.org/10.1109/CVPR52733.2024.01354
[11] Kambhampati S. (2024). Can Large Language Models Reason and Plan? Annals of the New York Academy of Sciences p nyas.15125. https://doi.org/10.1111/nyas.15125
[12] Kapoor R, Butala YP, Russak M, et al. (2024). OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web. In: Proceedings of the ECCV. Springer-Verlag, p 161–178. https://doi.org/10.1007/978-3-031-73113-6_10
[13] Koh JY, McAleer S, Fried D, et al. (2024b). Tree Search for Language Model Agents. https://doi.org/10.48550/arXiv.2407.01476
[14] Lee J, Min T, An M, et al. (2024). Benchmarking Mobile Device Control Agents across Diverse Configurations. https://doi.org/10.48550/arXiv.2404.16660
[15] Li C, Gan Z, Yang Z, et al. (2024a). Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends in Computer Graphics and Vision, 16(1-2):1–214. https://doi.org/10.1561/0600000110
[16] Li T, Li G, Zheng J, et al. (2024b). MUG: Interactive multimodal grounding on user interfaces. In: Findings of the ACL: EACL 2024. ACL, pp 231–251
[17] Li W, Bishop W, Li A, et al. (2024c). On the effects of data scale on computer control agents. https://doi.org/10.48550/arXiv.2406.03679
[18] Liu X, Yu H, Zhang H, et al. (2024). AgentBench: Evaluating LLMs as Agents. In: Proc. of the 12th ICLR. https://openreview.net/forum?id=zAdUB0aCTQ
[19] Zheng B, Gou B, Kil J, et al. (2024a). GPT-4V(ision) is a generalist web agent, if grounded. In: Proc. of the 41st ICML. PMLR, pp 61349–61385
Last updated