Humans naturally learn by establishing links between sight and sound. For example, we can watch someone play cello and recognize that the movement of the cellist generates the music we hear.
A new approach developed by MIT researchers and elsewhere improves the ability of an AI model to learn in the same way. This could be useful in applications such as journalism and film production, where the model could help the conservation of multimodal content thanks to automatic recovery and audio recovery.
In the longer term, this work could be used to improve the ability of a robot to understand the environments of the real world, where hearing and visual information is often closely connected.
Improving the previous work of their group, the researchers have created a method that helps learning models to the aligning the corresponding audio and visual data from video clips without the need for human labels.
They have adjusted the way in which their original model is formed so that it learns a finer correspondence between a particular video frame and the audio which occurs at that time. The researchers also made architectural adjustments that help the system to balance two separate learning objectives, which improves performance.
Together, these relatively simple improvements strengthen the accuracy of their approach in video recovery tasks and in the classification of action in audiovisual scenes. For example, the new method could automatically and precisely correspond to the sound of a slamming door with the visual of closing in a video clip.
“We build AI systems that can treat the world as humans do, in terms of having both audio and visual information at the same time and able to transparently process the two modalities. In the meantime, if we can integrate this audiovisual technology into some of the tools that we use daily Document on this research.
He is joined on the newspaper by the main author Edson Araujo, a graduate student at Goethe University in Germany; Yuan Gong, an old postdoc MIT; Saurabhchand Bhati, a current postdoc; Samuel Thomas, Brian Kingsbury and Leonid Karlinsky by IBM Research; Rogerio Feris, principal scientist and director of the Mit-Ibm Watson AI laboratory; James Glass, principal researcher and chief of the group of spoken language systems with computer mit and artificial intelligence laboratory (CSAIL); And the main author Hilde Kuehne, computer teacher at Goethe University and affiliated professor at the Mit-Ibm Watson Ai Lab. The work will be presented during the conference on computer vision and the recognition of models.
Synchronization
This work is based on an automatic learning method Researchers have developed A few years ago, which provided an effective way to form a multimodal model to simultaneously process audio and visual data without the need for human labels.
The researchers feed this model, called CAV-MAE, unmarked video clips and it codes separately visual and audio data in representations called tokens. Using the natural audio of the recording, the model automatically learns to map pairs of audio and visual tokens corresponding close to each other in its internal representation space.
They found that the use of two learning objectives balances the model learning process, which allows CAV-MAE to understand the corresponding audio and visual data while improving its ability to recover video clips that correspond to user requests.
But Cav-Mae deals with audio and visual samples as a single unit, therefore a 10-second video clip and the sound of the slap of a door are mapped, even if this audio event occurs in a second of the video.
In their improved model, called CAV-MAE SYNC, the researchers divided the audio into smaller windows before the model calculates its data representations, so it generates distinct representations which correspond to each smaller audio window.
During the training, the model learns to combine an audio video frame that occurs during this framework.
“In doing so, the model learns a finer correspondence, which contributes to performance later when we aggregate this information,” explains Araujo.
They also incorporated architectural improvements that help the model to balance its two learning objectives.
Adding “room for maneuver”
The model incorporates a contrastive objective, where he learns to associate similar audio and visual data, and a reconstruction objective which aims to recover specific audio and visual data based on user requests.
In CAV-MAE synchronization, researchers have introduced two new types of data representations, or tokens, to improve the model's learning capacity.
They include dedicated “global tokens” which help the contrasting learning objective and dedicated “recording tokens” which help the model focus on important details for the reconstruction objective.
“Essentially, we add a little more room for maneuver to the model so that it can perform each of these two tasks, contrastive and reconstructive, a little more independently. This benefited overall performance,” added Araujo.
Although the researchers had a certain intuition, these improvements would improve the performance of CAV-MAE synchronization, it took a meticulous combination of strategies to move the model in the direction they wanted to.
“Because we have several modalities, we need a good model for the two modalities by themselves, but we must also lead them to merge and collaborate,” explains Rouditchenko.
In the end, their improvements improved the model's ability to recover videos based on an audio request and predict the class of an audiovisual scene, such as a dog bar or an instrument game.
Its results were more precise than their previous work, and it also worked better than more complex and advanced methods which require larger amounts of training data.
“Sometimes very simple ideas or small models you see in data have great value when applied in addition to a model on which you work,” explains Araujo.
In the future, researchers wish to integrate new models that generate better data representations in CAV-MAE synchronization, which could improve performance. They also want to allow their system to manage text data, which would be an important step towards the generation of a large audiovisual language model.
This work is funded, in part, by the German Federal Ministry of Education and Research and the Mit-IBM Watson AI Lab.
