Introduction
As large language models (LLMS) are advancing in software engineering tasks – from code generation to bug fixation – performance optimization remains an elusive border, in particular at the level of the repository. To fill this lake Sweep—The first reference specially designed to assess the capacity of LLM to optimize code performance in the real world standards.
Unlike previous references to accuracy or efficiency in terms of function (for example, Swe-Bench, Mercury, Effibench), Swe-Perf captures the complexity and contextual depth of the performance setting on the standard of the benchmark. It provides a reproducible quantitative foundation to study and improve modern LLM performance optimization capacities.


Why Swe-Perf is necessary
The acts of real world code are often large, modular and complex. Optimizing them for performance requires understanding of interfile interactions, execution paths and calculation strangles – extends beyond the range of data sets at the isolated function.
Today, LLMs are widely evaluated on tasks such as the correction of syntax or small function transformations. But in production environments, the performance setting on standards can provide larger advantages on a system scale. Swe-Perf is explicitly designed to measure LLM capabilities in such parameters.


Construction of the data set
Swe-Perf is built from more than 100,000 traction requests in high-level Github standards. The final data set covered 9 standards, in particular:
- 140 organized instances demonstrating measurable and stable performance improvements.
- Full code bases pre- and post-optimization.
- Target functions categorized like Oracle (file level) or realistic (re-levels).
- Unit tests and Docker environments for reproducible execution and performance measurement.
- Expert authors patches used as gold standards.
To guarantee validity, each unit test must:
- Pass before and after the patch.
- View statistically significant execution gains on 20 repetitions (Mann-Whitney U test, p <0.1).
The performance is measured using a minimum performance gain (Δ), insulating statistical improvements attributable to the patch during noise filtering.
Reference parameters: Oracle VS realistic
- Oracle parameter: The model only receives the target functions and the corresponding files. This setting tests localized optimization skills.
- Realistic adjustment: The model receives an entire repository and must identify and optimize critical performance paths independently. It is a similar analog to the functionality of human engineers.
Evaluation measures
Swe-Perf defines a three-level assessment framework, signaling each metric independently:
- Apply: Can the patch generated by the model be applied proper?
- Correction: Does the patch preserve functional integrity (do all unit tests pass)?
- Performance: Does the patch give an improvement in measurable execution?
The metrics are not aggregated in a single score, allowing a more nuanced assessment of compromises between syntactic accuracy and performance gains.
Experimental results
The reference assesses several high -level LLM in the oracle and realistic parameters:
Model | Setting | Performance (%) |
---|---|---|
Claude-4-OPUS | Oracle | 1.28 |
GPT-4O | Oracle | 0.60 |
Gemini-2.5-Pro | Oracle | 1.48 |
Claude-3.7 (without agent) | Realistic | 0.41 |
Claude-3.7 (Openhands) | Realistic | 2.26 |
Expert (human patch) | – | 10.85 |
In particular, even the most efficient LLM configurations are clearly below human level performance. The method based on the open agent, built on Claude-3.7-Sonnet, surpasses the other configurations in the realistic framework but is always lagging behind compared to the optimizations designed by experts.
Key observations
- Managers based on agents and disabled are better suited to complex optimization and in several stages, the outperformance of direct model prompts and approaches based on pipelines like agent.
- Performance degrades As the number of target functions increases – LLMS fights with wider optimization glasses.
- LLMs have a limited scalability In long -term scenarios, where expert systems continue to show performance gains.
- Corrective analysis LLMs focus more on low -level code structures (for example, imports, environmental configuration), while experts target high -level semantic abstractions for performance adjustment.
Conclusion
Swe-Perf represents a pivotal step towards the measurement and improvement of LLM performance optimization capacities in realistic software engineering workflows. It reveals a significant difference in capacity between existing models and human experts, offering a solid base for future research in the performance setting on the standard of the repository. As the LLM evolves, Swe-Perf can serve as a North Star guiding them to a practical improvement of software ready for large-scale production.
Discover the Paper,, GitHub page And Project. All the merit of this research goes to researchers in this project.
Sponsorship opportunity: Reach the most influential AI developers in the United States and Europe. 1M + monthly players, 500K + community manufacturers, endless possibilities. (Explore sponsorship)
Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.
