Tiktok Researchers Introduce Swe-Perf: The First Benchmark For Optimizing Code Performance At The Standard

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Introduction

As large language models (LLMS) are advancing in software engineering tasks – from code generation to bug fixation – performance optimization remains an elusive border, in particular at the level of the repository. To fill this lake Sweep—The first reference specially designed to assess the capacity of LLM to optimize code performance in the real world standards.

Unlike previous references to accuracy or efficiency in terms of function (for example, Swe-Bench, Mercury, Effibench), Swe-Perf captures the complexity and contextual depth of the performance setting on the standard of the benchmark. It provides a reproducible quantitative foundation to study and improve modern LLM performance optimization capacities.

Tiktok researchers introduce Swe-Perf: the first benchmark for optimizing code performance at the standard — Image source: https://arxiv.org/abs/2507.12415

Why Swe-Perf is necessary

The acts of real world code are often large, modular and complex. Optimizing them for performance requires understanding of interfile interactions, execution paths and calculation strangles – extends beyond the range of data sets at the isolated function.

Today, LLMs are widely evaluated on tasks such as the correction of syntax or small function transformations. But in production environments, the performance setting on standards can provide larger advantages on a system scale. Swe-Perf is explicitly designed to measure LLM capabilities in such parameters.

Construction of the data set

Swe-Perf is built from more than 100,000 traction requests in high-level Github standards. The final data set covered 9 standards, in particular:

140 organized instances demonstrating measurable and stable performance improvements.
Full code bases pre- and post-optimization.
Target functions categorized like Oracle (file level) or realistic (re-levels).
Unit tests and Docker environments for reproducible execution and performance measurement.
Expert authors patches used as gold standards.

To guarantee validity, each unit test must:

Pass before and after the patch.
View statistically significant execution gains on 20 repetitions (Mann-Whitney U test, p <0.1).

The performance is measured using a minimum performance gain (Δ), insulating statistical improvements attributable to the patch during noise filtering.

Reference parameters: Oracle VS realistic

Oracle parameter: The model only receives the target functions and the corresponding files. This setting tests localized optimization skills.
Realistic adjustment: The model receives an entire repository and must identify and optimize critical performance paths independently. It is a similar analog to the functionality of human engineers.

Evaluation measures

Swe-Perf defines a three-level assessment framework, signaling each metric independently:

Apply: Can the patch generated by the model be applied proper?
Correction: Does the patch preserve functional integrity (do all unit tests pass)?
Performance: Does the patch give an improvement in measurable execution?

The metrics are not aggregated in a single score, allowing a more nuanced assessment of compromises between syntactic accuracy and performance gains.

Experimental results

The reference assesses several high -level LLM in the oracle and realistic parameters:

Model	Setting	Performance (%)
Claude-4-OPUS	Oracle	1.28
GPT-4O	Oracle	0.60
Gemini-2.5-Pro	Oracle	1.48
Claude-3.7 (without agent)	Realistic	0.41
Claude-3.7 (Openhands)	Realistic	2.26
Expert (human patch)	–	10.85

In particular, even the most efficient LLM configurations are clearly below human level performance. The method based on the open agent, built on Claude-3.7-Sonnet, surpasses the other configurations in the realistic framework but is always lagging behind compared to the optimizations designed by experts.

Key observations

Managers based on agents and disabled are better suited to complex optimization and in several stages, the outperformance of direct model prompts and approaches based on pipelines like agent.
Performance degrades As the number of target functions increases – LLMS fights with wider optimization glasses.
LLMs have a limited scalability In long -term scenarios, where expert systems continue to show performance gains.
Corrective analysis LLMs focus more on low -level code structures (for example, imports, environmental configuration), while experts target high -level semantic abstractions for performance adjustment.

Conclusion

Swe-Perf represents a pivotal step towards the measurement and improvement of LLM performance optimization capacities in realistic software engineering workflows. It reveals a significant difference in capacity between existing models and human experts, offering a solid base for future research in the performance setting on the standard of the repository. As the LLM evolves, Swe-Perf can serve as a North Star guiding them to a practical improvement of software ready for large-scale production.

Discover the Paper,, GitHub page And Project. All the merit of this research goes to researchers in this project.

Sponsorship opportunity: Reach the most influential AI developers in the United States and Europe. 1M + monthly players, 500K + community manufacturers, endless possibilities. (Explore sponsorship)

Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Introduction

Why Swe-Perf is necessary

Construction of the data set

Reference parameters: Oracle VS realistic

Evaluation measures

Experimental results

Key observations

Conclusion

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About

Tiktok researchers introduce Swe-Perf: the first benchmark for optimizing code performance at the standard

Introduction

Why Swe-Perf is necessary

Construction of the data set

Reference parameters: Oracle VS realistic

Evaluation measures

Experimental results

Key observations

Conclusion

Leave a Comment Cancel reply

Join our community

LEARNOPOLY

Categories

Popular

About