Imagine a radiologist examining a pulmonary radiography of a new patient. She notes that the patient has swelling in the fabric but has no wide heart. Seeking to speed up the diagnosis, it could use a visual language learning model to search for similar patient reports.
But if the model wrongly identifies the relations with the two conditions, the most likely diagnosis could be very different: if a patient has a fabric swelling and an enlarged heart, the condition is very likely to be linked to the heart, but without extended heart, there could be several underlying causes.
In a new study, MIT researchers have found that visual language models are extremely likely to make such an error in real world situations because they do not understand negation – words like “no” and “not” which specify what is false or absent.
“These negation words can have a very significant impact, and if we simply use these models blind, we can encounter catastrophic consequences,” explains Kumail Alhamoud, a student graduate of MIT and principal author of This study.
The researchers tested the capacity of the visual language models to identify negation in image legends. Models have often made a random supposition. Based on these results, the team has created a set of images data with corresponding legends which include negation words describing missing objects.
They show that recycling a visual language model with this data set leads to performance improvements when a model is invited to recover images that do not contain certain objects. It also stimulates precision on the answer to multiple choice questions with nestled legends.
But researchers warn that more work is necessary to solve the deep causes of this problem. They hope that their research alerting potential users of a previously unnoticed gap that could have serious implications in high stake contexts where these models are currently used, by determining which patients receive certain treatments to identify the defects of products in manufacturing factories.
“This is a technical document, but there are more important problems to consider. If something as fundamental as negation is broken, we should not use major vision / language models in many ways that we now use – without intensive evaluation, “said the main author Marzyeh Ghassemi, an associate professor in the Department of Electrical and Computer Science (EECS) and a member of the Institute of Medical Critics and Laboratories for Decision Systems.
Ghassemi and Alhamoud are joined on the newspaper by Shaden Alshammari, a student graduated from MIT; Yonglong Tian of Openai; Guohao Li, a former post-doctoral student at the University of Oxford; Philip HS Torr, professor at Oxford; and Yoon Kim, assistant professor of EECS and member of the computer intelligence and artificial intelligence laboratory (CSAIL) at MIT. Research will be presented during the conference on computer vision and recognition of models.
Neglect negation
The language vision models (VLM) are formed using huge collections of corresponding images and legends, which they learn to code as a number of numbers, called vector representations. The models use these vectors to distinguish the different images.
A VLM uses two separate encoders, one for text and one for images, and encoders learn to produce similar vectors for an image and its corresponding text legend.
“The legends express what is in the images-they are a positive label. And that is in fact the whole problem. No one looks at an image of a dog that was jumping over a fence and the dissect by saying” a dog jumping over a fence, without helicopters “,” explains Ghassemi.
Since Image Caption Data Sets do not contain examples of negation, VLM never learns to identify it.
To deepen this problem, the researchers have designed two reference tasks that test VLM's ability to understand negation.
For the first, they used a large language model (LLM) to recover images in an existing data set by asking the LLM to think of related objects and not in an image and to write them in legend. Then they tested models by inviting them with negation words to recover images containing certain objects, but no others.
For the second task, they have designed multiple choice questions that ask a VLM to select the most appropriate legend in a list of closely linked options. These legends differ only by adding an object to an object which does not appear in the image or by drowning an object which appears in the image.
Models have often failed two tasks, with image recovery performance lowering almost 25% with denied legends. Regarding the answer to multiple questions, the best models have only reached 39% precision, with several high -performance models or even below random luck.
One of the reasons for this failure is a shortcut that researchers calls the affirmation bias – the VLMs ignore the words of negation and rather focus on objects in the images.
“It happens not only for words like” no “and” no “. Whatever way you express negation or exclusion, the models will simply ignore it, ”explains Alhamoud.
This was consistent in each VLM they tested.
“A solved problem”
Since VLMs are generally not trained on image legends with a negation, researchers have developed data sets with negation words as the first step towards solving the problem.
Using a set of data with 10 million pairs of image text legends, they prompted an LLM to offer related legends that specify what is excluded from images, producing new legends with negation words.
They had to be particularly cautious that these synthetic legends are always read naturally, or this could make a VLM failed in the real world in the face of more complex legends written by humans.
They found that the Finetuning VLM with their data set led to performance gains at all levels. It has improved image recovery capacities of models by around 10%, while increasing the performance of the task of answers to multiple choice questions by around 30%.
“But our solution is not perfect. We only make recovery data sets, a form of data increase. We have not even touched the way these models work, but we hope it is a signal that this is a solved problem and others can take our solution and improve it, ”explains Alhamoud.
At the same time, he hopes that their work encourages more users to think about the problem they want to use a VLM to solve and design some examples to test it before deployment.
In the future, researchers could develop this work by teaching the VLMS to process the text and the images separately, which could improve their ability to understand negation. In addition, they could develop additional data sets that include image caes pairs for specific applications, such as health care.
