Poe-World + Planner surpasses RL Ballines Reinforcement in Montezuma’s revenge with minimum demonstration data

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The importance of symbolic reasoning in global modeling

Understanding the functioning of the world is the key to creating AI agents that can adapt to complex situations. Although models based on neural networks, such as Dreamer, offer flexibility, they require massive amounts of data to learn effectively, much more than humans. On the other hand, more recent methods use program synthesis with large languages ​​models to generate global models based on code. These are more economical in terms of data and can well generalize from a limited input. However, their use was mainly limited to simple fields, such as text or grid worlds, because the scaling of complex dynamic environments remains a challenge due to the difficulty of generating large complete programs.

Limits of existing programmatic global models

Recent research has studied the use of programs to represent global models, often taking advantage of large languages ​​models to synthesize the Python transition functions. Approaches like Worldcoder and Codeworldmodels generate a single program, which limits their scalability in complex environments and their ability to manage uncertainty and partial observability. Some studies focus on high -level symbolic models for robotic planning by integrating visual entry into abstract reasoning. Previous efforts have used languages ​​specific to the restricted field adapted to specific benchmarks or used conceptually linked structures, such as factorial graphics in diagram networks. Theoretical models, such as Aixi, also explore global modeling using Turing machines and history -based representations.

Present presentation of the world: modular and probabilistic world models

Researchers from Cornell, Cambridge, Alan Turing Institute and the University of Dalhousie present Poe-World, an approach to learning symbolic global models by combining many small LLM synthetic programs, each capturing a specific rule of the environment. Instead of creating a large program, Poe-World builds a modular and probabilistic structure that can learn brief demonstrations. This configuration supports generalization to new situations, allowing agents to plan effectively, even in complex games such as the revenge of Pong and Montezuma. Although he does not model the data from raw pixels, he learns observations of symbolic objects and emphasizes precise modeling on exploration for effective decision -making.

Architecture and learning mechanism for the world

Poe-Monde models the environment as a combination of small interpretable python programs called programmatic experts, each responsible for a specific rule or behavior. These experts are weighted and combined to predict future states according to the observations and past actions. By dealing with characteristics as conditionally independent and learning from the whole history, the model remains modular and evolving. Hard constraints refine predictions and experts are updated or cut as new data is collected. The model supports planning and learning strengthening by simulating probable future results, allowing effective decision -making. The programs are synthesized using LLMS and interpreted in a probabilistic manner, with weights of experts optimized via a gradient descent.

Empirical assessment on Atari Games

The study assesses their agent, Poe-World + Planner, on the revenge of Atari and Montezuma, including harder and modified versions of these games. Using minimum demonstration data, their method surpasses basic lines such as PPO, React and Worldcoder, in particular in low data settings. Poe-World demonstrates a strong generalization by accurately modeling the game dynamics, even in modified environments without new demonstrations. It is also the only method for scoring regularly positively in the revenge of Montezuma. Pre-training policies in the simulated environment in the world POE accelerate learning of the real world. Unlike the limited and sometimes inaccurate models of Worldcoder, Poe-World produces more detailed representations devoted to constraints, leading to better planning and more realistic behavior.

Conclusion: symbolic and modular programs for IA evolution planning

In conclusion, understanding the functioning of the world is crucial to building adaptive AI agents; However, traditional depth learning models require large data sets and have trouble updating in a flexible way with a limited input. Inspired by the way humans and symbolic systems recombine knowledge, the study offers Poe world. This method uses major language models to synthesize modular and programmatic “experts” which represent different parts of the world. These experts combine in terms of composition to form a symbolic and interpretable global model which supports a strong generalization from minimum data. Tested on Atari games like the revenge of Pong and Montezuma, this approach demonstrates effective planning and performance, even in unknown scenarios. The code and demos are accessible to the public.


Discover the Paper,, Project page And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our Subseubdredit 100k + ml and subscribe to Our newsletter.


Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.