Recent developments have shown that RL can considerably improve LLM reasoning capacities. Based on this progress, the study aims to improve LLM Audio – models that deal with audio and text to perform tasks like the answer to questions. The MMAU benchmark is a widely used data set designed to assess these models, with multiple questions about sounds, speech and music, some of which require external knowledge. A previous approach, R1-AQA, used GRPO (optimization of the relative group policy) to refine the QWEN2-AUDIO model on the AVQA data, obtaining cutting-edge results (SOTA) on MMAU. Inspired by this, the authors applied GRPO to refine QWEN2.5-ROMNI-7B, a new multimodal model, further improving performance. In addition, they have introduced a method to automatically generate AI audio data, leading to even better results.
Compared to methods like Sari, which uses a more complex mixture of supervised fine adjustment and RL with structured reasoning, the authors' approach is simpler, based only on RL without explicit reasoning stages. They also carried out experiments with text entries only to study the role of the GRPO in performance gains. Surprisingly, the fine setting of models using only text data has given almost the same improvements as training with audio and text. This observation suggests that the GRPO mainly improves the model's reasoning capacity by the text, contributing considerably to its improved performance in QA audio tasks.
Researchers from MIT CSAIL, Goethe University, IBM research, and others present OMNI-R1, an refined version of the LLM QWEN2.5-OMNI Multimodal using the GRPO strengthening learning method. Trained on the AVQA data set, OMNI-R1 defines new cutting-edge results on the MMAU reference in all audio categories. Surprisingly, a large part of the improvement stems from improved text reasoning rather than an audio input. The fine setting with text data only has also led to notable performance gains. In addition, the team generated large -scale QA audio data sets using Chatgpt, which increases precision more. Their work highlights the significant impact of text reasoning in LLM audio performance and promises the public release of all resources.
The Omni-R1 Fine-Tunes Qwen2.5-OMNI model using the GRPO strengthening learning method with a simple quick format that allows direct selection of response, which makes it economical in memory for 48 GB of GPU. GRPO avoids a value function by comparing the group outings using a reward based solely on the accuracy of the responses. The researchers used audio legends of the Qwen-2 audio to extend the training and Incité Chatgpt data to generate new pairs of questions and answers. This method produced two data sets: AVQA-GPT and VGGS-GPT-Covered respectively from the 40K and 182K audios. Training on these automatically generated data sets has improved performance, VGGS-GPT helping OMNI-R1 to reach advanced precision on MMAU reference.
The researchers refined Qwen2.5-OMNI using GRPO on AVQA, AVQA-GPT and VGGS-GPT data sets. The results show notable performance gains, with the best average score of 71.3% on the Vggs-Gpt Mau Test-Mini. Qwen2.5-OMNI has surpassed the reference bases, including SARI, and showed strong reasoning even without audio, suggesting a solid understanding based on the text. Grpo Fine-Tuning has improved Qwen2-Audio more significantly due to its lower initial texture. Surprisingly, the fine-free adjustment increased audio, while text data sets like Arc-Easy have given comparable results. Improvements mainly stem from improved text reasoning, although the audio -based fine adjustment remains slightly higher for optimal performance.
In conclusion, OMNI-R1 is an audio LLM developed by the END QWEN2.5-OMNI adjustment using the GRPO strengthening learning method for an improved audio question answer. Omni-R1 obtains new cutting-edge results on the MMAU reference through sounds, speech, music and overall performance. Two new large-scale data sets, AVQA-GPT and VGGS-GPT, were created using automatically generated questions, which increases the accuracy of the model more. Experiments show that the GRPO mainly improves textual reasoning, contributing considerably to performance. Surprisingly, the fine adjustment with only text (without audio) has improved audio performance, highlighting the value of the strong understanding of the basic language. These results offer profitable strategies for developing audio compatible linguistic models.
Discover the Paper. All the merit of this research goes to researchers in this project. Also, don't hesitate to follow us Twitter And don't forget to join our 95K + ML Subdreddit.
Sana Hassan, consulting trainee at Marktechpost and double -degree student at Iit Madras, is passionate about the application of technology and AI to meet the challenges of the real world. With a great interest in solving practical problems, it brings a new perspective to the intersection of AI and real life solutions.
