Tiktok researchers introduce Swe-Perf: the first benchmark for optimizing code performance at the standard

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Introduction

As large language models (LLMS) are advancing in software engineering tasks – from code generation to bug fixation – performance optimization remains an elusive border, in particular at the level of the repository. To fill this lake Sweep—The first reference specially designed to assess the capacity of LLM to optimize code performance in the real world standards.

Unlike previous references to accuracy or efficiency in terms of function (for example, Swe-Bench, Mercury, Effibench), Swe-Perf captures the complexity and contextual depth of the performance setting on the standard of the benchmark. It provides a reproducible quantitative foundation to study and improve modern LLM performance optimization capacities.

Screenshot 2025 07 21 at 1.55.51 AM 1Screenshot 2025 07 21 at 1.55.51 AM 1
Image source: https://arxiv.org/abs/2507.12415

Why Swe-Perf is necessary

The acts of real world code are often large, modular and complex. Optimizing them for performance requires understanding of interfile interactions, execution paths and calculation strangles – extends beyond the range of data sets at the isolated function.

Today, LLMs are widely evaluated on tasks such as the correction of syntax or small function transformations. But in production environments, the performance setting on standards can provide larger advantages on a system scale. Swe-Perf is explicitly designed to measure LLM capabilities in such parameters.

Screenshot 2025 07 21 at 1.55.08 AM 1Screenshot 2025 07 21 at 1.55.08 AM 1
Image source: https://arxiv.org/abs/2507.12415

Construction of the data set

Swe-Perf is built from more than 100,000 traction requests in high-level Github standards. The final data set covered 9 standards, in particular:

  • 140 organized instances demonstrating measurable and stable performance improvements.
  • Full code bases pre- and post-optimization.
  • Target functions categorized like Oracle (file level) or realistic (re-levels).
  • Unit tests and Docker environments for reproducible execution and performance measurement.
  • Expert authors patches used as gold standards.

To guarantee validity, each unit test must:

1500X500
  1. Pass before and after the patch.
  2. View statistically significant execution gains on 20 repetitions (Mann-Whitney U test, p <0.1).

The performance is measured using a minimum performance gain (Δ), insulating statistical improvements attributable to the patch during noise filtering.

Reference parameters: Oracle VS realistic

  • Oracle parameter: The model only receives the target functions and the corresponding files. This setting tests localized optimization skills.
  • Realistic adjustment: The model receives an entire repository and must identify and optimize critical performance paths independently. It is a similar analog to the functionality of human engineers.

Evaluation measures

Swe-Perf defines a three-level assessment framework, signaling each metric independently:

  1. Apply: Can the patch generated by the model be applied proper?
  2. Correction: Does the patch preserve functional integrity (do all unit tests pass)?
  3. Performance: Does the patch give an improvement in measurable execution?

The metrics are not aggregated in a single score, allowing a more nuanced assessment of compromises between syntactic accuracy and performance gains.

Experimental results

The reference assesses several high -level LLM in the oracle and realistic parameters:

Model Setting Performance (%)
Claude-4-OPUS Oracle 1.28
GPT-4O Oracle 0.60
Gemini-2.5-Pro Oracle 1.48
Claude-3.7 (without agent) Realistic 0.41
Claude-3.7 (Openhands) Realistic 2.26
Expert (human patch) 10.85

In particular, even the most efficient LLM configurations are clearly below human level performance. The method based on the open agent, built on Claude-3.7-Sonnet, surpasses the other configurations in the realistic framework but is always lagging behind compared to the optimizations designed by experts.

Key observations

  • Managers based on agents and disabled are better suited to complex optimization and in several stages, the outperformance of direct model prompts and approaches based on pipelines like agent.
  • Performance degrades As the number of target functions increases – LLMS fights with wider optimization glasses.
  • LLMs have a limited scalability In long -term scenarios, where expert systems continue to show performance gains.
  • Corrective analysis LLMs focus more on low -level code structures (for example, imports, environmental configuration), while experts target high -level semantic abstractions for performance adjustment.

Conclusion

Swe-Perf represents a pivotal step towards the measurement and improvement of LLM performance optimization capacities in realistic software engineering workflows. It reveals a significant difference in capacity between existing models and human experts, offering a solid base for future research in the performance setting on the standard of the repository. As the LLM evolves, Swe-Perf can serve as a North Star guiding them to a practical improvement of software ready for large-scale production.


Discover the Paper,, GitHub page And Project. All the merit of this research goes to researchers in this project.

Sponsorship opportunity: Reach the most influential AI developers in the United States and Europe. 1M + monthly players, 500K + community manufacturers, endless possibilities. (Explore sponsorship)


Screen Shot 2021 09 14 at 9.02.24 AM

Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.