A new way of modifying or generating images | News put

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

mit token opt3

The generation of AI images – which is based on neural networks to create new images from a variety of inputs, including text prompts – should become an industry of a billion dollars by the end of this decade. Even with today's technology, if you wanted to make a fanciful image of, let's say, a friend planting a flag on Mars or flying without concern in a black hole, it could take less than a second. However, before being able to perform tasks like this, image generators are generally formed on massive data sets containing millions of images which are often associated with associated text. The formation of these generative models can be an arduous chore that takes weeks or months, consuming large calculation resources in the process.

But what happens if it was possible to generate images via AI methods without using a generator at all? This real possibility, as well as other intriguing ideas, has been described in a search document Presented at the international conference on automatic learning (ICML 2025), which was held in Vancouver, British Columbia, earlier this summer. The article, describing new techniques to manipulate and generate images, was written by Lukas Lao Beyer, student researcher graduated in the MIT for information laboratory and decision -making systems (LIDS); Tianhong Li, a post-doctorator at the IT and artificial MIT (CSAIL) computer laboratory; Xinlei Chen from Facebook Ai Research; Sertac Karaman, professor of aeronautics and astronautics and director of the lids; And Kaiming He, Associate Professor of Electric Engineering and IT.

This group effort had its origins in a class project for a graduate seminar on deep generative models that Lao Beyer took last fall. In conversations during the semester, he became evident both for Lao Beyer and he, who taught the seminar, that this research had real potential, which went far beyond the limits of a typical assignment of duties. Other employees were quickly brought into the effort.

The starting point for Lao Beyer's investigation was an article in June 2025, written by researchers from the Munich Technical University and Chinese Society bytedance, which introduced a new way of representing visual information called unidimensional tokens. With this device, which is also a kind of neural network, an image 256×256 pixels can be translated into a sequence of only 32 numbers, called tokens. “I wanted to understand how such a high level of compression could be reached and what the tokens themselves really represented,” explains Lao Beyer.

The previous generation of tokens would generally disclose the same image in a table of 16×16 tokens – with each token encapsulating information, in a very condensed form, which corresponds to a specific part of the original image. The new 1D tokens can code an image more effectively, using much fewer tokens overall, and these tokens are able to capture information on the whole image, not just a single quadrant. In addition, each of these tokens is a 12 -digit number made up of 1 and 0, allowing 212 (or around 4,000) possibilities. “It is like a 4,000-words vocabulary that constitutes an abstract and hidden language spoken by the computer,” he explains. “It's not like a human language, but we can always try to discover what it means.”

This is exactly what Lao Beyer had initially undertook to explore – a work that provided the seed for the ICML 2025 paper. The approach he adopted was quite simple. If you want to know what a particular token does, Lao Beyer says: “You can simply remove it, exchange with a random value and see if there is a recognizable change in the exit.” The replacement of a token, he found, modifies the quality of the image, transforming a low-resolution image into a high resolution or vice versa image. Another token affected the blur in the background, while another has further influenced the brightness. He also found a token which is linked to “pose”, which means that, like a Robin, for example, the head of the bird could pass from right to left.

“It was a result never seen before, because no one had observed visually identifiable changes in the manipulation of tokens,” explains Lao Beyer. The discovery raised the possibility of a new approach to edit images. And the MIT group has shown, in fact, how this process can be rationalized and automated, so that the tokens do not have to be modified by hand, one at a time.

He and his colleagues obtained an even more consecutive result involving the generation of images. A system capable of generating images normally requires a tokenzer, which compresses and code for visual data, as well as a generator that can combine and organize these compact representations in order to create new images. MIT researchers found a way to create images without using a generator at all. Their new approach uses a 1D Tokenizer and a so-called Detokenzer (also known as the decoder), which can rebuild an image from a chain of tokens. However, with advice provided by a standard neural network called clip – which cannot generate images alone, but can measure how a given image corresponds to a certain text prompt – the team was able to convert an image of a red panda, for example, in tiger. In addition, they could create images of a tiger, or from any other desired form, starting completely from zero – from a situation in which all the tokens are initially attributed random values (then iteratively modified so that the rebuilt image is increasingly corresponding to the desired text prompt).

The group has shown that with this same configuration – based on a tokenizer and a detuken, but no generator – they could also do “deterioration”, which means filling out parts of images that had been erased in a way. Avoiding the use of a generator for certain tasks could result in a significant reduction in calculation costs because generators, as mentioned, normally require in -depth training.

What may seem strange in the contributions of this team, he explains, “is that we have invented anything again. We did not invent of Tokenizer 1D, and we did not invent the clip model either. But we discovered that new capacities can occur when you have put all these parts. ”

“This work redefines the role of tokenizers,” comments Resing Xie, computer scientist at New York University. “It shows that image tokenizers – tools generally used only to compress images – can actually do much more. The fact that a simple (but highly compressed) token can manage tasks such as detention or edition guided by text, without needing to form a complete generative model, is quite surprising.”

Zhuang Liu of Princeton University agrees, saying that the work of the MIT group “shows that we can generate and manipulate images in a much easier way than we thought previously. Basically, he demonstrates that the generation of images can be a by-product of a very effective image compressor, potentially reducing the cost of image generation. ”

There could be many applications outside the field of computer vision, suggests Karaman. “For example, we could consider the tokenization of the actions of robots or autonomous cars in the same way, which can quickly broaden the impact of this work.”

Lao Beyer thinks in similar lines, noting that the extreme quantity of compression offered by 1D tokenizers allows you to do “incredible things”, which could be applied to other areas. For example, in the field of autonomous cars, which is one of its research interests, the tokens could represent, instead of images, the various routes that a vehicle could take.

Xie is also intrigued by applications that can come from these innovative ideas. “There are really cool use cases that it could unlock,” he says.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.