3 Questions: How to help students recognize potential biases in their AI data sets | News put

by Brenden Burgess

When you buy through links on our site, we may earn a commission at no extra cost to you. However, this does not influence our evaluations.

Each year, thousands of students take courses that teach them to deploy artificial intelligence models that can help doctors diagnose the disease and determine appropriate treatments. However, many of these courses omit a key element: students training to detect defects in training data used to develop models.

Leo Anthony Celi, principal researcher at the Mit's Institute for Medical Engineering and Science, doctor at the Beth Israel Deaconess Medical Center and Associate Professor at the Harvard Medical School, documented these shortcomings in a new paper And hopes to persuade courses to teach students to assess their data in more detail before integrating it into their models. Many previous studies have shown that models trained mainly on clinical data for white men do not work well when applied to people from other groups. Here, Celi describes the impact of such a bias and how educators could remedy it in their teachings on AI models.

Q: How does the bias enter these data sets and how can these shortcomings be processed?

A: Any problem in the data will be cooked in any data modeling. In the past, we have described instruments and devices that do not work well between individuals. As an example, we found that Impulse oxymeters Overestimer oxygen levels for people of color, as there were not enough people of color registered in the clinical trials of the devices. We remind our students that medical devices and equipment are optimized on healthy young men. They have never been optimized for an 80 -year -old woman with heart failure, yet we use them for these purposes. And the FDA does not require a device to work on this diversified population on which we will use it. All they need is proof that it works on healthy subjects.

In addition, the electronic health folder system is in no way to be used as AI construction blocks. These files have not been designed to be a learning system, and for this reason, you must pay very attention to the use of electronic health files. The electronic health file system must be replaced, but that will not happen anytime soon, we must therefore be smarter. We must be more creative on the use of the data we have now, no matter how bad they are, in the construction of algorithms.

A promising avenue that we explore is the development of a transformer model Digital data from electronic health records, including, but without limiting itself, the results of laboratory tests. The modeling of the underlying relationship between laboratory tests, vital signs and processing can mitigate the effect of missing data due to the social determinants of health and implicit biases of providers.

Q: Why is it important for AI lessons to cover potential biases sources? What did you find when you analyzed the content of these courses?

A: Our course at MIT started in 2016, and at some point, we realized that we encourage people to run to build models that are too high for a statistical measurement of the model performance, while in fact the data we use are filled with problems that people are not aware. At that time, we wondered: how common is this problem?

Our suspicion was that if you were looking at the lessons where the program is available online, or online courses, that none of them even disturbs the students that they should be paranoid on the data. And it is true, when we examined the various online courses, it is a question of building the model. How do you build the model? How do you view the data? We found that out of 11 courses that we examined, only five included sections on biases in data sets, and only two contained a significant discussion on biases.

That said, we cannot reduce the value of these courses. I have heard a lot of stories where people get ready according to these online courses, but at the same time, given their influence, their impact, we must really double to require them to teach good skills, because more and more people are attracted to this multive. It is important that people really prepare for the agency to be able to work with AI. We hope that this article will highlight this huge gap in the way we now teach AI to our students.

Q: What type of content should course developers integrate?

A: One, giving them a question list of questions at the start. Where does these data come from? Who were the observers? Who were the doctors and nurses who collected the data? And then learn the landscape of these institutions a bit. If it is an intensive care database, they must ask who arrives at the USI and who does not go to intensive care, as this already introduces a sampling selection. If all minority patients are not even admitted to intensive care because they cannot reach the intensive care unit in time, the models will not work for them. Really, for me, 50% of the content of the course should really understand the data, if not more, because the modeling itself is easy once you understand the data.

Since 2014, the Critical Data Consortium has organized datathons (“hackathons” of data) worldwide. During these rallies, doctors, nurses, other health workers and data scientists come together to cross the databases and try to examine health and diseases in the local context. Manuals and review documents have diseases based on observations and trials involving a close demography generally from countries with research resources.

Our main objective now, what we want to teach them are the skills of critical thinking. And the main ingredient of critical thinking is to bring together people from different backgrounds.

You cannot teach critical thinking in a room full of CEOs or in a room full of doctors. The environment is simply not there. When we have datathons, we don't even have to teach them how to make critical thinking. As soon as you bring the right mixture of people – and that does not only come from different horizons but from different generations – you don't even have to tell them how to think critically. It just happens. The environment is suitable for this kind of thought. So, we now say to our participants and our students, please, please don't start creating any model unless you really understand how the data came, which patients are underway in the database, which devices were used to measure and these devices are systematically exact between individuals?

When we have events around the world, we encourage them to search for local data sets, so that they are relevant. There is resistance because they know that they will discover how bad their sets of data are. We say it's good. This is how you correct it. If you don't know how bad they are, you will continue to collect them in a very bad way and they are useless. You have to recognize that you are not going to do things the first time, and it's perfectly good. Mimic (medical information marked for the intensive care database built at the Beth Israel Deaconess Medical Center) took a decade before having a decent scheme, and we only have a decent pattern because people told us how bad Mimic.

We may not have the answers to all these questions, but we can talk about something in people who help them realize that there are so many problems in the data. I am always delighted to look at the blog articles of people who attended a datathon, who say that their world has changed. Now, they are more excited by the field because they realize the immense potential, but also the immense risk of damage if they do not do it properly.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.