Automatic learning engineering (Mle) implies the development, adjustment and deployment of automatic learning systems that require iterative experimentation, model optimization and robust handling of data pipelines. As the complexity of the model increases, the same is true for the challenges associated with the effective orchestration of end -to -end work flows. The researchers explored the automation of Mle tasks using AI agents to manage these requests. Great languages (LLM) models, in particular those with solid -problem coding and problem solving, have shown potential to considerably improve this process. Their role in the automation of structured workflows is now tested through rigorous references and environments adapted to imitate the Mle real world scenarios.
A main obstacle in the automation of automatic learning engineering lies in the intrinsically iterative nature and focused on labor feedback. Tasks such as the adjustment of the hyperparameter, the debugging of the model and the pre -treatment of the data cannot be resolved in a single step; They require repeated changes and evaluations. Traditional assessment tools for AI models are often based on static data sets and do not allow rear -time feedback in real time or interactive problem solving. This limitation prevents LLM agents from learning through tests and errors, an essential component to master the engineering tasks that evolve or require multiple attempts to success.
Previous tools to assess LLM in engineering or coding tasks mainly focused on individual subtacions or isolated challenges. These include tools such as Mlagentbench and Dsbench, which are based on narrow test cases from Kaggle competitions or synthetic data sets. Although they cover more than basic tasks, they do not allow agents to perform the execution, debug or interpretation of the results in a live framework. Other environments, such as Swe-Gym, focus exclusively on software engineering and lack management of workflows specific to automatic learning. These limitations have slowed down the creation of versatile and very efficient Mle agents which can manage the complexities of the project in real time.
Researchers from the Georgia Institute of Technology and the University of Stanford introduced Mle-Dojo, a framework with an interactive environment that connects LLM agents to automatic learning tasks of the real world derived from more than 200 kaggle. This framework supports the analysis of tabular data, computer vision, natural language processing and the challenges of forecasting chronological series. The research introduced Mle-Dojo to allow agents to write, execute and revise the code in a sandy parameter and rich in feedback. The objective was to reproduce the interactive cycles that human engineers follow, allowing structured learning for agents. The environment includes preinstalled outbuildings, evaluation measures and supports the supervised adjustment learning strategies.
The structure of Mle-Dojo consists of modular components which support a wide range of Mle challenges. Each task runs in its own Docker container, the insulation for safety and reproducibility. The agents interact with the environment thanks to a partially observable Markov decision process, receiving observations, performing actions and earning rewards depending on the performance. The environment takes care of five main types of action: requesting task information, validating the code, executing code, recovering the history of interactions and resetting the environment. It also provides a detailed observation space which includes data sets, execution results and error messages. The agent receives structured feedback after each interaction, allowing improvement by step. This modular configuration helps maintain interoperability and simplifies the addition of new tasks to the system.
The evaluation included eight frontier LLMS-GEMINI-2.5-PRO, DEEPSEEK-R1, O3-MINI, GPT-4O, GPT-4O-MINI, Gemini-22 Gemini-2.5-Pro obtained the highest Elo note from 1257, followed by Deepseek-R1 at 1137 and O3-MINI at 1108. Gemini-2.5-Pro led with 61.95%, indicating its higher performance on human benchmarks. Models like GPT-4O-Mini only executed code 20% of the time, adopting conservative strategies, while O3-Mini has performed in more than 90% of cases. The average gemini-2.5-Pro failure rate has remained the lowest in validation and the execution phases, strengthening its robustness. Among the fields, computer vision has posed the biggest challenge, most models that marked less than 60 years in Humanrank. Reasoning models have generally produced longer outputs and maintained the higher consistency of performance between iterations.
Research highlights the difficulty of applying LLM to full automatic learning workflows. It describes a complete solution in Mle-Dojo which allows learning through interaction, not just completion. Mle-Dojo establishes a new standard for the training and evaluation of Autonomous Mle agents by simulating engineering environments more precisely.
Discover the Paper,, Project page And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 90K + ML Subdreddit.
Nikhil is an intern consultant at Marktechpost. It pursues a double degree integrated into materials at the Indian Kharagpur Institute of Technology. Nikhil is an IA / ML enthusiast who is still looking for applications in fields like biomaterials and biomedical sciences. With a strong experience in material science, he explores new progress and creates opportunities to contribute.
