The AI ​​tool generates high quality images faster than advanced approaches | News put

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

The ability to quickly generate high -quality images is crucial to producing realistic simulated environments that can be used to form autonomous cars in order to avoid unpredictable dangers, which makes them safer in real streets.

But the techniques generating artificial intelligence are increasingly used to produce such images have drawbacks. A type of popular model, called a diffusion model, can create incredibly realistic images but is too slow and very intensive in calculation for many applications. On the other hand, the self -regressive models that feed the LLM as Chatgpt are much faster, but they produce best quality images which are often riddled with errors.

MIT and Nvidia researchers have developed a new approach that brings together the best of two methods. Their hybrid image generation tool uses an autoregressive model to quickly capture the overview, then a small diffusion model to refine the details of the image.

Their tool, known as Hart (abbreviation of the hybrid self -regressive transformer), can generate images that correspond or exceed the quality of advanced diffusion models, but do it about nine times faster.

The generation process consumes less calculation resources than typical diffusion models, allowing Hart to locally run a laptop or a commercial smartphone. A user only needs to enter a natural language prompt in the Hart interface to generate an image.

Hart could have a wide range of applications, such as helping researchers to train robots to accomplish complex world tasks and to help designers produce striking scenes for video games.

“If you paint a landscape, and paint the whole canvas once, it may not seem very good. But if you paint the overview, then refine the image with smaller brush strokes, your painting could be much better. New article on Hart.

He is joined by the co-directed author Yecheng Wu, undergraduate student at Tsinghua University; The main author Song Han, associate professor of the Department of Electrical and Computer Science (EECS), member of the Mit-IBM Watson AI Lab, and an eminent scientist of Nvidia; As well as others at MIT, Tsinghua University and Nvidia. Research will be presented at the international conference on representations of learning.

The best of both worlds

Popular diffusion models, such as stable diffusion and Dall-E, are known to produce very detailed images. These models generate images via an iterative process where they predict a certain amount of random noise on each pixel, subtracting the noise, then repeat the process of prediction and “de-naked” several times until they generate a new image completely free of noise.

Since the distribution model detects all the pixels in an image at each step, and that there can be 30 or more steps, the process is slow and expensive by calculation. But because the model is likely to correct the details it has mistaken, the images are of high quality.

Self -regressive models, commonly used to predict the text, can generate images by predicting the patches of a sequential image, some pixels at the same time. They cannot go back and correct their mistakes, but the sequential prediction process is much faster than diffusion.

These models use representations called tokens to make predictions. An autoregressive model uses a self -denote to compress the raw image pixels in discreet tokens and rebuild the image from planned tokens. Although this increases the speed of the model, the loss of information that occurs during compression causes errors when the model generates a new image.

With Hart, the researchers have developed a hybrid approach that uses a self -regressive model to predict compressed and discreet image tokens, then a small diffusion model to predict residual tokens. Residual tokens compensate for the loss of information from the model by capturing the details left aside by discreet tokens.

“We can get a huge boost in terms of reconstruction quality. Our residual tokens learn from high frequency details, such as the edges of an object, or the hair, eyes or mouth of a person. These are places where discreet tokens can make mistakes, ”says Tang.

Since the diffusion model only predicts the remaining details after the self -regressive model has done its job, it can do the eight stages, instead of the usual 30 or more than a standard diffusion model requires generating an entire image. This minimum overload of the additional diffusion model allows Hart to keep the speed advantage of the self -regressive model while considerably improving its ability to generate complex image details.

“The diffusion model has easier work, which leads to more efficiency,” he adds.

Outperforming larger models

During Hart development, researchers encountered challenges in the effective integration of the diffusion model to improve the self -regressive model. They found that the incorporation of the diffusion model in the first stages of the self -regressive process led to an accumulation of errors. Instead, their final design of the application of the diffusion model to predict residual tokens only, because the final step has considerably improved the quality of generation.

Their method, which uses a combination of a self -regressive transformer model with 700 million parameters and a light diffusion model with 37 million parameters, can generate images of the same quality as those created by a diffusion model with 2 billion parameters, but it does it about nine times faster. It uses approximately 31% less calculation than advanced models.

In addition, because Hart uses an autoregressive model to make the major part of the work – the same type of model that feeds LLM – it is more compatible for integration with the new class of generative unified vision models. In the future, one could interact with a unified generative model of the vision, perhaps by asking it to show the intermediate steps necessary to assemble a piece of furniture.

“LLM are a good interface for all kinds of models, such as the models and multimodal models that can reason. It is a way to push intelligence towards a new border. An effective model of image generation would unlock many possibilities, ”he says.

In the future, researchers want to take this path and build visual language models above Hart architecture. Since Hart is scalable and generalizable to several methods, they also want to apply it for tasks of video generation and audio prediction.

This research was funded, in part, by the Mit-ibm Watson Ai Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program and the US National Science Foundation. The GPU infrastructure for the formation of this model was given by Nvidia.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.