AI learns to synchronize sight and sound

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Imagine watching a video where someone slams a door, and the behind the scenes instantly connects the exact moment of this sound with the visual of the door closure – without ever being said to be a door. These are future MIT researchers and international collaborators build, thanks to a breakthrough in automatic learning that imitates the way humans intuitively connect and sound.

The team of researchers presented CAV-MAE SYNC, an upgraded AI model that learns of fine grain connections between audio and visual data – All without labels provided by humans. Potential applications range from video editing and content conservation to smarter robots that better understand the environments of the real world.

According to Andrew Rouditchenko, a MIT and co-author of the study, humans naturally treat the world using both sight and sound, so the team wants the AI ​​to do the same. By integrating this type of audiovisual understanding into tools such as large language models, they could completely unlock new types of AI applications.

The work is based on a previous model, CAV-MAE, which could process and align the visual and audio data of the videos. This system learned by coding for unmarked video clips in representations called tokens and automatically corresponded to the corresponding audio and video signals.

However, the original model lacked precision: it treated long audio and video segments like a single unit, even if a particular sound – like a dog bark or a door slam – has only occurred.

The new model, Cav-Mae Sync, corrects this by dividing the audio into smaller pieces and by mapping each piece in a specific video frame. This fine grain alignment allows the model to associate a single image with the exact sound that occurs at that time, considerably improving precision.

They give the model a more detailed view of time. This makes a big difference with regard to real world tasks such as the search for the right video -based video clip.

Cav-Mae Sync uses a double learning strategy to balance two objectives:

  • A contrastive learning task that helps the model to distinguish the corresponding audiovisual pairs from those incompatible.
  • A reconstruction task where AI learns to recover specific content, such as finding a video based on an audio request.

To support these objectives, the researchers have introduced special “global tokens” to improve contrastive learning and the “recording tokens” which help the model focus on the fine details of reconstruction. This “room for maneuver” allows the model to perform the two tasks more effectively.

The results speak for themselves: CAV-MAE synchronization surpasses the previous models, including more complex and eager data systems, video recovery and audiovisual classification. He can identify actions such as a played musical instrument or a pet making noise with remarkable precision.

For the future, the team hopes to further improve the model by integrating even more advanced data representation techniques. They also explore the integration of textual entries, which could open the way to a truly multimodal AI system – which sees, hears and reads.

In the end, this type of technology could play a key role in the development of intelligent assistants, improving accessibility tools or even the spread of robots that interact with humans and their environments more naturally.

Dive deeper Research behind audiovisual learning here.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.