Meta Ai Publishes V-Jepa 2: Open-Source self-supervised world models for understanding, prediction and planning

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Meta Ai introduced V-Jepa 2, an open source world model designed to learn from video on the internet and allow a robust visual understanding, a future state prediction and zero-shot planning. Drawing on the predictive architecture in joint embellishment (JEPA), V-Jepa 2 shows how self-supervised learning from passive internet video, combined with minimum robot interaction data, can give a modular base for intelligent physical agents.

Evolving self-supervised pre-training from 1 m of video hours

V-Ipa 2 is pre-trained on more than a million hours of video on the internet level combined with 1 million images. Using an objective of clearing the visual mask, the model learns to rebuild spatio-temporal patches masked in a latent space. This approach avoids the ineffectiveness of prediction at the level of pixels by focusing on the dynamics of predictable scenes while ignoring unrelevant noise.

To set up Jepa prevailing at this level, Meta Researchers introduced four key techniques:

  • Data scale: Built a 22 m data set (Videomix222m) from public sources such as SSV2, Kinetics, Howto100M, Yt-Tem-1b and Imagenet.
  • Model scale: Has extended the coder's capacity to more than 1b settings using VIVE-G.
  • Training schedule: Adopted a progressive and prolonged resolution strategy of levy at 252K iterations.
  • Spatial-temporal increase: Trained on gradually longer and high resolution clips, reaching 64 images with a resolution of 384 × 384.

These design choices led to an average accuracy of 88.2% on six reference tasks, including SSV2, diving -48, buffoon, kinetics, coin and impainting – leaving the previous basic lines.

Understanding via learning of masked representation

V-jepa 2 presents strong capacities for understanding movements. On the reference in V2 something, it reaches a precision of 77.3% of the TOP-1, models outperforming like Internvideo and Vidoomaev2. For understanding of appearance, it remains competitive with pre-training models of image text at the cutting edge of technology like Dinov2 and Pecoreg.

The representations of the encoder were evaluated using attentive probes, verifying that self-supervised learning alone can produce transferable visual characteristics and the field applicable on various classification tasks.

Temporal reasoning via a answer to video questions

To assess temporal reasoning, the encoder V-Jepa 2 is aligned with a multimodal Great language model And evaluated on several response tasks to video questions. Despite the lack of supervision of language during pre-training, the model realizes:

  • 84.0% on perceptionst
  • 76.9% on tempcompass
  • 44.5% on MVP
  • 36.7% on the temporal bench
  • 40.3% on the tomato

These results question the hypothesis that alignment in visual language requires co-training from the start, demonstrating that a pre-trained video encoder can be aligned post hoc with strong generalization.

V-jepa 2-ac: ​​learning the models of the latent world for robotic planning

A key innovation in this version is V-Jépa 2-Ac, a variant conditioned by the action of the pre-trained encoder. Setted using only 62 hours of robot video not marked from the Droid data set, V-Ipa 2-AC learns to predict future video incorporations packaged on the actions and the poses of the robot. Architecture is a 300 m parameter transformer with attention to the block causal, formed using a forcing and deployment objective of teachers.

This allows zero-shot planning by controlling the predictive model. The model infer of action sequences by minimizing the distance between the imagined future states and the visual objectives using the Cross Entropy Method (CEM). He was successful in tasks such as achievement, seizure and pick-and-place on invisible robot weapons in different laboratories-without any reward supervision or additional data collection.

Benchmarks: robust performance and planning efficiency

Compared to basic lines like Octo (behavior behavior) and cosmos (latent diffusion models), v-jepa 2-ac:

  • Perform plans in ~ 16 seconds per step (against 4 minutes for the cosmos).
  • Reaches a success rate of 100% on reach tasks.
  • Overpass others in the entry and manipulation tasks between the types of objects.

It operates in particular using a monocular RGB camera without calibration or environment specific to the environment, strengthening the generalization capacity of the learned world model.

Conclusion

Meta's V-Jeta 2 represents an important progression of self-supervised evolutionary learning for physical intelligence. By cutting out the observation learning from the conditioning of the action and taking advantage of passive video on a large scale, V-Jepa 2 demonstrates that visual representations for general use can be exploited both for perception and control in the real world.


Discover the Paper,, Models on the embraced face And GitHub page. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 99K + ML Subreddit and subscribe to Our newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.