A large language model (LLM) deployed to make treatment recommendations can be triggered by non -clinical information in patient messages, such as striking faults, additional white space, missing gender markers or the use of uncertain, dramatic and informal language, according to a study by MIT researchers.
They found that making stylistic or grammatical changes increases the probability that a LLM recommends that a patient self-generate his state of health reported rather than coming for an appointment, even when this patient should ask for medical care.
Their analysis also revealed that these non -clinical variations in the text, which imitate the way people really communicate, are more likely to modify the recommendations for the treatment of a model for patients, which has led to a higher percentage of women who have been wrongly advised not to ask for medical care, according to human doctors.
This work “is solid proof that the models must be audited before use in health care – which is a framework where they are already used,” explains Marzyeh Ghassemi, an associate professor of the Department of Electric and IT engineering (EECS), member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems, and the Senior Author.
These results indicate that LLMs take into account the non -clinical decision -making information in a previously unknown way. It highlights the need for more rigorous LLM studies before being deployed for applications with high issues such as making treatment recommendations, according to researchers.
“These models are often trained and tested on the questions of medical examinations, but then used in tasks which are quite far from that, such as assessing the severity of a clinical case. There is still so much LLM that we do not know, ”adds Abinitha Gourabathina, a graduate student and principal author of the study.
They are joined on the paperwhich will be presented at the ACM conference on equity, responsibility and transparency, by the graduate student Eileen Pan and the post-doctoral student Walter Gerych.
Mixed messages
Large language models such as the Openai GPT-4 are used for Clinical note project and sorting patient messages In health establishments around the world, with the aim of rationalizing certain tasks to help overloaded clinicians.
An increasing set of work has explored the clinical reasoning capacities of LLM, in particular from a point of view of equity, but few studies have evaluated how non -clinical information affects the judgment of a model.
Interested in the impact of LLM reasoning, Gourabathina conducted experiences where she exchanged gender clues in patients' notes. She was surprised that the formatting of errors in the guests, such as an additional white space, caused significant changes in LLM responses.
To explore this problem, the researchers have designed a study in which they modified the input data of the model by exchanging or deleting gender markers, adding a colorful or uncertain language, or by inserting an additional space and typing faults in the patient's messages.
Each disturbance has been designed to imitate the text that could be written by a person in a population of vulnerable patients, based on psychosocial research on how people communicate with clinicians.
For example, additional spaces and typing faults simulate the writing of patients with a limited mastery of English or those who have less technological aptitude, and the addition of uncertain language represents patients with health anxiety.
“The medical data sets on which these models are formed are generally cleaned and structured, and not a very realistic reflection of the patient population. We wanted to see how these very realistic changes in the text could have an impact on the use cases downstream, ”explains Gourabathina.
They used an LLM to create disturbed copies of thousands of patient notes while ensuring that text changes are minimal and kept all clinical data, such as drugs and previous diagnosis. Then, they evaluated four LLM, including the large GPT-4 commercial model and a smaller LLM built specifically for medical environments.
They prompted each LLM with three questions according to the patient's note: if the patient managed at home, if the patient came for a clinical visit and if a medical resource is allocated to the patient, as a laboratory test.
Researchers compared LLM recommendations to real clinical responses.
Incoherent recommendations
They have experienced inconsistencies in the processing recommendations and a significant disagreement among LLMs when they were fed by disturbed data. Overall, LLMs showed an increase of 7 to 9% of self -management suggestions for the nine types of messages from modified patients.
This means that LLMs were more likely to recommend that patients do not consult medical care when messages contained typing faults or neutral gender pronouns, for example. The use of a colorful language, such as slang or dramatic expressions, has had the greatest impact.
They also found that the models made approximately 7% additional errors for patients and were more likely to recommend that patients have self-managed at home, even when researchers have removed all the genus of the clinical context.
Many of the worst results, as patients have said to self-generate when they have a serious illness, would probably not be captured by tests that focus on the overall clinical accuracy of the models.
“In research, we tend to examine aggregated statistics, but there are many things that are lost in the translation. We must examine the direction in which these errors occur – do not recommend visiting when you should be much more harmful than doing the opposite, ”explains Gourabathina.
The inconsistencies caused by non -clinical language become even more pronounced in conversational contexts where an LLM interacts with a patient, which is a case of common use for patient -oriented chatbots.
But in monitoring workResearchers have found that these same changes in patient messages do not affect the accuracy of human clinicians.
“In our review work, we also note that large languages are fragile to changes that human clinicians are not,” explains Ghassemi. “It may not be surprising – LLMs have not been designed to prioritize patient medical care. The LLMs are flexible and efficient on average so that we can think that this is a good use case. But we do not want to optimize a health care system that only works well for patients in specific groups. ”
Researchers want to develop this work by conceiving disturbances of natural language which capture other vulnerable populations and better imitate real messages. They also want to explore how LLM deduces the sex of the clinical text.
