UC Berkeley presents Cybergym: a real cybersecurity assessment framework to assess AI agents on large -scale vulnerabilities through massive bases

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Cybersecurity has become an important area of ​​interest in artificial intelligence, driven by growing dependence on major software systems and expansion capacities of AI tools. As threats evolve in complexity, ensuring that the security of software systems has become more than a question of conventional protections; He now believes himself with automated reasoning, vulnerability detection and an understanding at the code level. Modern cybersecurity requires tools and methods that can simulate real world scenarios, identify hidden defects and validate the integrity of the system through various software infrastructure. In this environment, researchers have developed references and methods to systematically assess the ability of AI agents to understand, detect and even exploit vulnerabilities, by establishing parallels with human security researchers. However, filling the gap between the reasoning of the AI ​​and the cybersecurity complexities of the real world remains a key challenge.

Problem with existing references

An urgent problem is the lack of effective means of assessing whether AI systems are really able to understand and manage safety tasks in realistic conditions. Simplified reference tasks often dominate current test methods, which rarely reflect the disorderly reality and in layers of large -scale software standards. These environments involve complex input conditions, deep code paths and subtle vulnerabilities that require more than the surface inspection. Without robust evaluation methods, it is difficult to determine whether AI agents can be reliable to perform tasks such as detection of vulnerability or exploitation of development. More importantly, current references do not reflect the scale and shades of vulnerabilities found in widely used software systems, leaving a critical evaluation gap.

Limits of current tools

Several benchmarks have been used to assess cybersecurity capacities, including Cybench and the NYU CTF bench. These are focused on style tasks Capture it that offer limited complexity, generally involving small code bases and forced test environments. Certain benchmarks try to hire vulnerabilities of the real world, but they often do it on a limited scale. In addition, many tools are based on synthetic test cases or accurate challenge problems, which do not represent the diversity of software entries, execution paths and types of bugs found in real systems. Even the specialized agents created for security analysis have been tested on references with only tens or a few hundred tasks, well below the complexity of real threat landscapes.

Cybergyme presentation

The researchers presented CyberggiaA large -scale comparative analysis tool specially designed to assess AI agents in real cybersecurity contexts. Developed at the University of California in Berkeley, Cybergym includes 1,507 distinct reference tasks from real vulnerabilities found and corrected on 188 major open source software projects. These vulnerabilities were originally identified by Oss-Fuzz, a continuous fuzzing campaign maintained by Google. To guarantee realism, each reference instance includes the pre-Patch complete code base, an executable and a textual description of the vulnerability. Agents must generate a concept proof test that reproduces vulnerability in the unlikely version, and Cybergym evaluates success according to the triggering of vulnerability in the pre-light version and absent in the post-patch version. This reference particularly emphasizes the generation of concept evidence (POC), a task that requires agents to cross complex code paths and to synthesize entries to fulfill specific security conditions. The cybergyme is modular and contained, allowing easy expansion and reproducibility.

Cybergal assessment levels

The cybergyme assessment pipeline is built around four difficulty levels, each increasing the amount of input information provided. At level 0, the agent receives only the code base without vulnerability index. Level 1 adds a description of natural language. Level 2 presents proof of proof of Truth on the ground (POC) and a trace of collision stack, while level 3 includes the patch itself and the post-patch code base. Each level has a new layer of reasoning and complexity. For example, at level 1, the agents must deduce the location and the context of vulnerability only from its textual description and its code base. To guarantee the reference quality, Cybergym applies filters such as the verification of the information of patch validation messages, the validation of the reproducibility of concept proof (POC) and the deletion of redundancy by comparing the traces of the battery. The final data set includes code bases with a median of 1,117 files and 387,491 code lines, ranging to more than 40,000 files and 7 million lines of code. The size sizes also vary, modifying a median of 1 file and seven lines, but sometimes covering 40 files and more than 3,000 lines. Vulnerabilities target various types of crash, with 30.4% linked to reading the paste-dual and 19.0% due to the use of uninitialized value.

Experimental results

When they were tested against this reference, existing agents have shown limited success. Among the four agent executives, Open, Codex, Enigma and Cybench, the most efficient was open combined with the Sonnet Claude-3.7, which reproduced only 11.9% of the target vulnerabilities. This performance fell significantly when it treated longer POC inputs, because the success rates were the highest for POCs less than 10 bytes (43.5%) and dropped 8% for lengths of more than 100 bytes. Open Source models, such as Deepseek-V3, won, with only a success rate of 3.6%. Even specialized models refined for code reasoning, such as Swe-Gym-32B and R2E-GYM-32B, have failed to generalize, marking less than 2%. Surprisingly, richer entry information at higher difficulty levels increased performance: level 3 was a success of 17.1%, while level 0 has only reached 3.5%. The analysis also revealed that most of the successful poc reproductions occurred between 20 and 40 stages of execution, with many executions exceeding 90 stages and finally failed. Despite these challenges, the agents discovered 15 zero-day vulnerabilities unknown and two disclosed but not corrected in real world projects, demonstrating their latent capacity for a new discovery.

Main to remember

  • Reference volume and realism: CYBERGYM contains 1,507 tasks derived from real and corrected vulnerabilities on 188 software projects, making it the most important and most realistic reference index of this type.
  • Limitations of agents: Even the combination of the most efficient agent model reproduced only 11.9% of vulnerabilities, many combinations marking less than 5%.
  • Difficulty scale: providing additional entries, such as traces of battery or fixes, considerably improved performance, with level 3 tasks, giving a success rate of 17.1%.
  • Sensitivity to length: the agents fought with tasks involving long POC. The POC exceeding 100 bytes, which represented 65.7% of the data set, had the lowest success rates.
  • Discovery potential: 15 new zero-day vulnerabilities have been discovered by the POC generated by agents, validating their potential use in the real security analysis.
  • Model behavior: Most successful exploits were generated at the start of the execution of the task, with diminished yields after 80 steps.
  • Tools interactions: Agents have better performed when authorized to interact with the tools (for example, using “AWK”, “GREP” or to install “XXD”) and to adapt the POC according to execution comments.

Conclusion

In conclusion, this study highlights a critical problem: the evaluation of AI in cybersecurity is not only difficult, but essential to understand its limits and its capacities. Cybergym stands out by offering a large -scale and real frame to do this. The researchers addressed the question with a practical and detailed reference which obliges the agents to reason deeply on entire code bases, to manage valid exploits and to adapt by iteration. The results clearly show that if current agents are promising, in particular to discover new bugs, there is still a long road to come to allow AI to contribute reliably to cybersecurity.


Discover the Paper,, GitHub page,, Classification. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.