Nvidia has just published Audio Flamingo 3: an open source model progressing general audio intelligence

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Having heard of the general artificial intelligence (AG)? Meet your hearing counterpart –General Audio Intelligence. With Audio Flamingo 3 (AF3)Nvidia introduces a major leap into the way the machines understand and reason on sound. Although the previous models can transcribe speech or classify audio clips, they did not have the capacity to interpret the audio in a way rich in context and in the form of humanity – a word, an ambient sound and music, and on prolonged durations. AF3 changes this.

With Audio Flamingo 3, Nvidia presents A large, fully open-open audio-langaining model (LAMM) This does not only hear but also understands and reasons. Built on a five-step study program and fueled by the AF-Whisper encoder, AF3 supports long audio entrances (up to 10 minutes), multiple multiple cats, demand on demand and even vocal interactions. This defines a new bar for the way AI systems interact with sound, bringing us closer to AG.

1500X500
Screenshot 2025 07 15 at 9.04.28 PM 1Screenshot 2025 07 15 at 9.04.28 PM 1

The main innovations behind the Flamant Audio 3

  1. AF-Whisper: a unified audio encoder AF3 uses AF-Whisper, a new adapted coder from Whisper-V3. He treats speech, ambient sounds and music using the same architecture – resolve a major limitation of previous Lalms that used separate encoders, leading to inconsistencies. AF-Whisper operates sets of audio caption data, synthesized metadata and a dimension integration space of 1280 dense dimension to align with text representations.
  2. Audio thought chain: reasoning on demand Unlike Static QA systems, AF3 is equipped with “reflection” capacities. Using the AF -Think data set (250k examples), the model can make a chain reasoning when invited, allowing it to explain its inference steps before arriving at a response – a key step towards a transparent audio AI.
  3. Multi-tours and multi-audio conversations Thanks to the AF Chat data set (75K dialogues), AF3 can maintain contextual conversations involving several audio entries through turns. This imitates the interactions of the real world, where humans refer to previous audio clues. He also introduces vocal conversations with voice using a text module to say streaming.
  4. Long audio reasoning AF3 is the first fully open model capable of reasoning on audio entries up to 10 minutes. Trained with Longaudio-XL (examples of 1.25 m), the model supports tasks such as the summary of meetings, understanding of the podcast, detection of sarcasm and temporal earth.
Screenshot 2025 07 15 at 9.05.05 PM 1Screenshot 2025 07 15 at 9.05.05 PM 1

Advanced benchmarks and actual ability

AF3 exceeds open and closed models on more than 20 benchmarks, in particular:

  • MMAU (AVG): 73.14% (+ 2.14% on Qwen2.5-O)
  • LongaudioBench: 68.6 (GPT-4O assessment), gemini 2.5 pro
  • Librispeeches (ASR): 1.57% Wer, outperforming phi-4-mm
  • Clotoaqa: 91.1% (compared to 89.2% Qwen2.5-O)

These improvements are not only marginal; They redefine what is expected of audio language systems. AF3 also introduces comparative analysis in the vocal chat and the generation of speech, carrying out the generation latency of 5.94s (vs 14.62 s for Qwen2.5) and better similarity scores.

Data pipeline: data sets that teach audio reasoning

Nvidia has not only ladder the calculation – they redesigned the data:

  • Audioskills-XL: 8m examples combining ambient reasoning, music and speech.
  • Longaudio-XL: Covers the long speech of audio books, podcasts, meetings.
  • AF-Think: Promotes short -haired style inference.
  • AF-Chat: Designed for multi-tours and multi-audio conversations.

Each data set is fully open, as well as the training code and revenue, allowing reproducibility and future research.

Open source

AF3 is not only a model fall. NVIDIA Out:

  • Model weight
  • Training revenues
  • Inference code
  • Four open data sets

This transparency makes AF3 the most accessible cutting-edge audio-langaining model. It opens up new research guidelines in hearing reasoning, low latency audio agents, musical understanding and multimodal interaction.

Conclusion: towards general audio intelligence

Audio Flamingo 3 shows that deep audio understanding is not only possible but reproducible and open. By combining the scale, the new training strategies and various data, Nvidia provides a model that listens, includes and the reasons in a way that the previous Lalms could not.


Discover the Paper,, Codes And Model on the embraced face. All the merit of this research goes to researchers in this project.

Ready to connect with 1 million developers / engineers / researchers? Find out how NVIDIA, LG AI Research and the best IA companies operate Marktechpost to reach their target audience (Learn more)


Screen Shot 2021 09 14 at 9.02.24 AM

Asif Razzaq is the CEO of Marktechpost Media Inc .. as a visionary entrepreneur and engineer, AIF undertakes to exploit the potential of artificial intelligence for social good. His most recent company is the launch of an artificial intelligence media platform, Marktechpost, which stands out from its in-depth coverage of automatic learning and in-depth learning news which are both technically solid and easily understandable by a large audience. The platform has more than 2 million monthly views, illustrating its popularity with the public.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.