A neural codec language model

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

A team of Microsoft researchers introduced a new AI system that is able to imitate a person's voice with a recording of only three seconds. Scientists have formed a Neural codec language model called Vall-e Using discreet codes derived from a standard Neuronal Audio Codec model and consider text to dissection (TTS) as a conditional language modeling rather than regression of the continuous signal.

The new application was created on the basis of Meta's encode audio compression technology and was initially intended to improve the quality of telephone conversations. Other works have shown that the model is capable of much more. Vall-E can not only imitate a voice, but also simulate the tone and even copy the acoustics of the environment in which the original recording has been carried out. For example, if the original recording has been made from a telephone conversation, the result will look like a telephone conversation.

VALL-E developers used more than 60,000 hours of records during the pre-training phase, which is hundreds of times larger than the amount of materials used for other existing systems. VALL-E emerges the learning capacities in the context and can be used to synthesize high quality personalized speech using as little as a 3-second audio recording.

In addition to reducing training time to generate a new voice, Vall-E creates a much more natural synthetic voice than other models. According to the results of the experiments, Vall-E considerably surpasses current TTS systems in terms of natural speech and similarity of speakers.

See the Model Demo on the website.

In the samples presented on this website, the “loudspeaker prompt” column contains discourse samples. In the “Ground Truth” column, there is the required text pronounced by the person's voice as the recorded sample. The “basic” column is an example of the traditional synthesis of text-discuction. And finally, the “Vall-E” column demonstrates the result of the work of the new IA model.

Try a practice TTS service provided by Qudata As a free example of traditional online vocal text converters. It is completely free and available for desktop and mobile devices.

Microsoft did not make the Source of Vall-E Public, noting that it may include potential risks in the improper use of the model, such as simulate the voice identification or the identity of a specific speaker. Therefore, all those who wish to test the operation of the model will not be able.

See also:
An unofficial implementation of Pytorch de Vall-E, based on the Tokenizer Encode.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.