
The approach presented uses synthetic data to improve the accuracy of AI models that recognize images.
For an automatic learning model to diagnose diseases in medical images, it must be trained to do so. The formation of an image classification model generally requires a huge set of data, millions of examples of similar images. And this is where the problems arise.
The use of data from real medical images is not always ethical. After all, it could be an invasion of people's privacy, a copyright violation or the data set could be biased against a particular racial or ethnic group. To minimize these risks, we can give up the real image data set and use image generation programs instead. This approach will create a synthetic data set for the formation of an image classification model. However, these methods are limited because expertise is often necessary to manually develop image generation programs that can create effective training data.
Researchers from the Massachusetts Institute of Technology, the Mit-Ibm Watson AI Lab and others have analyzed all the problems encountered to generate image data sets and presented a different solution to the problem. They refused to develop a personalized image generation program and gathered a large collection of basic image generation programs for a specific training task from programs accessible to the public on the Internet.
Their set included 21,000 different programs capable of creating images of simple textures and colors. The programs were small, generally taking only a few lines of code. The researchers did not change these programs and immediately used them to generate a set of images.
They used this data set to form a computer vision model. Based on test results, it turned out that the models formed on such a set of classified data are more precisely precisely than other models formed synthetically. And yet, these models were still lower than the models formed on real data. Researchers also found that increasing the number of image processing programs in the data set increases the performance of the model, which achieves higher precision.
It turned out that the use of many programs that do not require additional work with them is in fact better than using a small set of programs that require additional treatment. The data is certainly important, but this experience has shown that you can also get good results without real data.
Research carried out allows us to rethink the data pre-training process. Automatic learning models are generally pre-formed. They are first trained on a set of data, after having created parameters, then they can be used to solve other problems.
For example, a model designed to classify x-ray images can first be pre-formed using a huge set of images generated by synthesis. And it is only then that it will be formed using a much smaller data set of real X -rays to perform its real task. The problem with this method is that synthetic images must correspond to certain properties of real images. And this, in turn, requires additional work with the programs that generate such synthetic images. This complicates the formation process of models.
Instead, Lab Watson AI researchers used simple image generation programs in their work. There were many, gathered on the internet. The programs had to generate images quickly, so that scientists have chosen those who were written in a simple programming language and contained only a few fragments of code. The requirements for the generation of images were also quite simple, it had to be images that looked like abstract art.
These programs operated so quickly that it was not necessary to prepare a set of images in advance to form the model. The programs generated images and the model was immediately formed on them. This considerably simplifies the process.
Scientists have used their wide range of image generation programs to pre-training computer vision models for supervised and not supervised image classification tasks. In supervised training, image data is labeled, while in an undershowed training, the model learns to classify images without labels.
When they compared their pre-formed models to modern computer vision models that have been formed using synthetic data, their models were more precise, placing the images in the correct categories more often. Although the precision levels are still lower than that of the models formed on real data, this method has reduced the performance difference between the models formed on real data and the models formed on synthetic data of 38%.
This research also shows that performance evolves logarithmically with the number of generative programs. If more programs are collected, the model will work even better. Thus, researchers point out that there is a way to extend their approach.
To determine the factors affecting the accuracy of the model, the researchers used each image generation program for pre-training separately. They found that the most diverse images generated by the program, the best is the efficient model. It has also been observed that the color images that fill the whole canvas are better to improve the performance of the model.
This approach to pre-training has proven to be very successful. Researchers plan to apply their methods to other types of data, such as multimodal data that include text and images. They also want to further explore the means to improve image classification performance.
Read more details on the study in the article.
