Salesforce Ai Research introduced GTA1A new graphic user interface agent (GUI) which redefines the tip of the state of technology in human-computer interaction. Designed to operate independently in real -operating systems environments such as Linux, GTA1 deals with two critical bottlenecks in the development of Gui agents: Ambiguous task planning And Inaccurate actions closure. With a 45.2% task success rate on the Osworld Benchmark, GTA1 exceeds Cua d'Opnai (IT user agent), establishing a new record among the open source models.

Basic challenges among Gui Agents
The graphical interface agents generally translate high -level user instructions into action sequences – clicks, keys or user interface interactions – while observing user interface updates after each action to plan the following steps. However, two problems persist:
- Ambiguity planning: Several valid action sequences can fulfill a task, leading to execution paths with variable efficiency and reliability.
- Precision of Earth: Translating abstract action proposals into precise interactions of Gui at the level of coordinates is particularly difficult in dynamic interfaces with high resolution.
GTA1 introduces new mechanisms to resolve both.
Smarter planning via Testing
Traditional planners engage in a single proposal for action at each decision point, limiting robustness. GTA1 Test time scale introduces a simple but effective solution: simultaneously sample several candidate actions at each step and use a Multimodal judge model– typically a Great language model—For assess and select the most suitable.
This technique avoids the premature engagement of sub-optimal plans and allows the agent to better explore the execution paths without requiring future deployment, which is unrealizable in the graphical interface environments due to irreversible actions. Above all, this method can work with any planner and of course with an increasing complexity of tasks and a size of action space.
Learning to strengthen the precision of the earth
For the landing of the graphical interface, most of the previous models are based on a supervised fine adjustment to predict the center of the elements of the target user interface, which limits generalization. GTA1 adopts a strengthening learning framework (RL) based on Optimization of the relative group policy (GRPO). Rather than relying on the intermediate reasoning (“thought”) or the forecasting of delimitation boxes, the model learns directly from Click -based awards: It is only rewarded when the planned coordinate is in the correct user interface item.
Thanks to this reward structure, GTA1 reaches advanced precision without the complexity or the general costs of the supervision of the thought chain. In particular, an ablation study shows that the abolition of auxiliary signals such as “reflection” or box rewards based on IOU actually improves the performance of earthing, in particular in static environments.
Performance through landmarks

GTA1 establishes a new standard in several assessments:
- OSWORLD (task success rate): GTA1-7B reached 45.2%outperforming Openai Cua (42.9%) and Claude 3.7 (28.0%).
- Papot of screen (grounding precision): GTA1-7B scores 50.1%Before models like UGROUND-72B (34.5%).
- V2 screen papot (multiplatform earthen): GTA1-72B 94.8%almost corresponding to the best proprietary models.
- Osworld-G (Linux Gui Grounding): GTA1-7B reached 67.7%surpassing all anterior open source approaches.
These results validate the effectiveness of planning and landing innovations introduced in GTA1.
Additional design facts
- Data cleaning: Malted annotations aligned with data sets like Aria-Iui and OS-ALAS are filtered using Omniparser to improve the loyalty of the training signal.
- Model scale: The approach evolves well through models from 7b to 72b, GTA1-7B offering the best compromise between performance and calculation.
- Reusability of the judge: The multimodal judge used in the scaling of time testing can be the same LLM used for planning, by reducing general costs.
Conclusion
GTA1 demonstrates that robust and precise graphical interface agents can be built using a modular frame in two steps improved by the diversity of testing time planning and precise Earth based on RL. By making the complexity unnecessary, such as the reasoning of the chain of thoughts in static tasks – AI sales services have introduced an architecture of lean and effective agent that pushes the border into the open digital interaction.
Discover the Paper,, Codes,, Model 7B,, 32B model And 72B model. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter,, YouTube And Spotify And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
