Web automation agents have become an increasing objective of artificial intelligence, in particular because of their ability to carry out human actions in digital environments. These agents interact with websites via graphic user interfaces (Guis), imitating human behavior such as click, type and navigate web pages. This approach bypassing the need for dedicated application programming interfaces (API), which are often unavailable or limited in many web applications. Instead, these agents can work universally on web fields, making flexible tools for a wide range of tasks. The evolution of large languages models (LLMS) has enabled these agents to interpret not only web content, but also to reason, plan and act with increasing sophistication. As their capacities increase, the same is true to assess them on more than simple navigation tasks. The landmarks that are in the past for the first models are no longer capable of measuring the complete extent of the capacities of modern agents.
As these web agents are progressing, an urgent problem arises: their competence in the management of banal digital tasks, with high memory intensity and several stages remains insufficiently measured. Many tasks that humans perform on websites, such as recovery of data from different pages, carrying out calculations based on previous inputs or the application of complex rules, require a significant cognitive effort. These are not just challenges of navigation; They test the long -term memory, logic and planning. However, most benchmarks focus on simplified scenarios, not reflecting the types of digital tasks that people often prefer to avoid. In addition, the limits of these landmarks become more apparent as the agents improve their performance. The ambiguities of the instructions of the task or the inconsistencies in the expected results begin to distort the evaluations. When the agents generate reasonable but slightly divergent responses, they are incorrectly penalized due to definitions of vague tasks. These defects make it difficult to distinguish between the real limitations of the model and the reference gaps.
Previous efforts to assess web agents focused on benchmarks such as Webarena. Webarena has won generalized adoption due to its reproducibility and its ability to simulate real world websites, including Reddit, Gitlab and E-Commerce platforms. He offered more than 800 tasks designed to test an agent's ability to achieve web objectives in these environments. However, these tasks mainly focused on general navigation and have not adequately disputed more advanced agents. Other landmarks, such as Mind2web, Gaia and Mmin, have contributed by exploring real web tasks or specific environments to the platform like ServiceNow, but everyone came with compromises. Some lacked interactivity, others did not support reproducibility and some were too closely worn. These limitations have created a gap to the extent of the progress of agents in areas that require complex decision -making, long -term memory and precise data processing on several web pages.
Researchers from the University of Tokyo presented Webchorerena. This enlarged framework is based on the structure of the Webarena but considerably increases the difficulty and the complexity of the tasks. Webchorearen offers a total of 532 newly organized tasks, distributed on the same four simulated websites. These tasks are designed to be more demanding, reflecting scenarios where agents must engage in tasks such as data aggregation, recall of memory and reasoning in several stages. Above all, the reference was built to ensure complete reproducibility and normalization, allowing fair comparisons between agents and avoiding the ambiguities found in previous tools. The inclusion of various types of tasks and entry methods helps simulate the realistic use of the web and assesses agents on a more practical and more difficult scale.
Webchorearena classifies its tasks in four main types. One hundred and seventeen tasks fall under massive memory, forcing agents to extract and remember large volumes of information, such as the compilation of all customers names linked to high-value transactions. Calculation tasks, which include 132 entries, involve arithmetic operations such as identification of the highest spending months depending on several data points. Long -term memory tasks number 127 and test the agent's ability to connect information on different pages, such as recovery of the pricing rules of one site and apply it to another. 65 additional tasks are classified as “others”, including operations such as the allocation of labels to Gitlab which do not correspond to traditional task formats. Each task specifies its input modality, with 451 tasks resolved with any type of observation, 69 requiring only textual inputs and 12 dependent exclusively from image entries.
When evaluating the reference, the researchers used three major models of large language: GPT-4O, Claude 3.7 Sonnet and Gemini 2.5 Pro. These were tested jointly with two advanced web agents, Agentoccam and Browsergym. The results highlighted the increased difficulty of Webchorearena compared to previous references. GPT-4O, which had reached 42.8% precision on the Webarena, managed only 6.8% on Webchorearena. Claude 3.7 Sonnet and Gemini 2.5 PRO performed better, Gemini reaching a peak accuracy of 44.9%. Despite being the most efficient, this result always reflects important gaps in capacities when they treat more complex tasks of webchorearena. The reference index has also proven more sensitive in detecting performance differences between models, making it a precious tool to compare continuous progress in web agent technologies.
Several key research dishes include:
- Webchorearena includes 532 tasks: 117 massive memory, 132 calculations, 127 long -term memory and 65 others.
- The tasks are divided between purchases (117), the shopping administrator (132), the Reddit (91), the Gitlab (127) and 65 inter-site scenarios.
- Types of input: 451 tasks can be resolved with any input, 69 require a textual input and 12 need image input.
- GPT-4O scored only 6.8% on Webchorearena, compared to 42.8% on the Webarena.
- Gemini 2.5 Pro obtained the highest score at 44.9%, indicating the current limits of complex management tasks.
- Webchorearena provides a clearer performance gradient between models than Webarena, improving comparative analysis value.
- In total, 117 task models were used to ensure diversity and reproducibility on approximately 4.5 instances per model.
- The reference required more than 300 hours of annotation and refinement, reflecting its rigorous construction.
- The evaluations use channel correspondence, URL correspondence and HTML structural comparisons to assess the accuracy.
In conclusion, this research highlights the disparity between the general navigation competence and the cognitive capacities of the higher order necessary for the tasks on the web. The newly introduced webchorerena is a robust and detailed reference designed specifically to push web agents in territories where they must rely on reasoning, memory and logic. It replaces ambiguity with standardization, and its tasks imitate the digital chore that agents must learn to manage if they have to become really useful for automating real world activities.
Discover the Paper,, GitHub page And Project page. All the merit of this research goes to researchers in this project.
🆕 Did you know? Marktechpost is the fastest growth media platform – injured by more than a million monthly readers. Book a strategy call to discuss the objectives of your campaign. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit and subscribe to Our newsletter.
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
