MacOS Agent: an efficient Computer Use Agent for MacOS
BIGAI ML Group
Meet the Automation of Computer Use
We build a Computer Use Agent that redefines how you interact with macOS. Now it can solve diverse and complex tasks across commonly used applications.
Multimodal Understanding
Interprets screen content and task requirements to provide clear progress tracking and next steps.
Long-horizon Reasoning
Executes multi-step tasks through natural interactions with the macOS interface.
Cross-App Workflows
Coordinates tasks across different applications with consistent performance and reliability.
This is a fully open-source agent, including the code, prompt, and subsequent data and training code.
System Architecture
Our Computer Use Agent employs a hierarchical structure that seamlessly connects human intent with application-specific actions:
MacOS Agent is a hierarchical multi-agent system with three key components:
- ComputerUse Agent: The top-level agent that interfaces with human users. It receives natural language instructions and generates high-level execution plans, which are then forwarded to the MacAgent. This agent operates at an abstract level without direct access to application controls.
- MacAgent: The central coordinator that receives plans from the ComputerUse Agent. It analyzes these plans and determines the optimal execution strategy by:
- Identifying which app agents are needed
- Generating executable code to orchestrate these agents
- Managing the reactive workflow between different app agents
- App Agents: A collection of nine specialized agents that directly interface with macOS applications:
- Document Processing: Word Agent, TextEdit Agent
- Data & Presentations: Excel Agent, PowerPoint Agent
- System & Navigation: Finder Agent, Browser Agent
- Media & Communication: Preview Agent, QuickTime Agent, WeChat Agent, Calendar Agent
This architecture enables sophisticated task execution through coordinated agent interactions. The MacAgent's ability to dynamically orchestrate app agents allows for complex workflows while maintaining clear separation of concerns between planning and execution.
Agent Execution Traces
Citation
@article{zhang2025tongui, title={TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials}, author={Zhang, Bofei and Shang, Zirui and Gao, Zhi and Zhang, Wang and Xie, Rui and Ma, Xiaojian and Yuan, Tao and Wu, Xinxiao and Zhu, Song-Chun and Li, Qing}, journal={arXiv preprint arXiv:2504.12679}, year={2025} } @article{li2025iterative, title={Iterative Trajectory Exploration for Multimodal Agents}, author={Li, Pengxiang and Gao, Zhi and Zhang, Bofei and Mi, Yapeng and Ma, Xiaojian and Shi, Chenrui and Yuan, Tao and Wu, Yuwei and Jia, Yunde and Zhu, Song-Chun and others}, journal={arXiv preprint arXiv:2504.21561}, year={2025} }