"AI consultation is like tossing a coin"! Even missing 67% of patients, Nature can’t stand it anymore
James Alex Posted from Aofei Temple
Qubit | Official account QbitAI
“Some of AI’s medical decisions are actually a coin toss.”
Kun-Hsing Yu, a data scientist at Harvard Medical School, made a surprising statement.
He also added:
Even the winning model with an accuracy of 90% in the competition was only 60-70% accurate when tested on a subset of the original dataset, which was a disastrous failure. This surprised us.
The views of the above-mentioned scientists come from a recently published article in Nature.
The content raises questions about the reproducibility of AI in the medical field , and presents the hidden dangers caused by the black box attributes of AI in many medical fields and scenarios.
What is even more noteworthy is that despite the existing problems, AI is still widely used in the medical field.
For example, hundreds of U.S. hospitals were already using an AI model to flag early symptoms of sepsis, but in 2021, the model was found to fail to identify it 67% of the time.
So, what medical risks does AI bring and how to solve them?
Read on.
△
Source:
Nature
Artificial intelligence makes medical treatment difficult
Let’s start with the story of how Kun-Hsing Yu, a data scientist at Harvard Medical School, discovered AI’s “coin toss”.
In the medical field, there have been constant doubts about the use of AI in diagnosing and detecting the human body. Kun-Hsing Yu also hopes to have an intuitive experience in this research.
He selected lung cancer, one of the most common cancers. 3.5 million Americans die from this disease every year. If CT scans were screened earlier, many people could avoid death.
This field has indeed attracted much attention from the machine learning community. To this end, the industry also held a competition for lung cancer screening in 2017.
The event is part of Kaggle's Data Science Bowl event. The data is provided by the organizer and covers chest CT scan data of 1,397 patients. Participating teams need to develop and test algorithms. In the final competition, awards will be awarded based on accuracy. According to the official announcement, at least five winning models have an accuracy of over 90%.
But Kun-Hsing Yu ran another round of testing and was shocked to find that even using a subset of the original competition data, the top accuracy of these "winning" models dropped to 60-70%.
△
The model structure shared by a contestant
The above situation is not unique.
Sayash Kapoor, a Princeton Ph.D., reported reproducibility failures and pitfalls in 329 studies in 17 fields, including medicine.
Based on the research, the doctor and his professor also organized a seminar, which attracted 600 researchers from 30 countries.
A senior researcher from Cambridge said at the scene that he used machine learning technology to predict the epidemic trend of COVID-19, but due to issues such as data deviations from different sources and training methods, none of the model predictions were accurate. Another researcher also shared that he used machine learning to study psychological subjects, but could not reproduce the problem.
At the seminar, some participants pointed out the "pitfalls" that Google had encountered before.
In 2008, they used machine learning to analyze data sets generated by user searches to predict influenza outbreaks. Google is also advocating for this.
But in fact, it failed to predict the 2013 influenza outbreak . An independent research institution pointed out that the model associated and locked some seasonal words that were not related to influenza epidemics. In 2015, Google stopped publicizing its forecasts.
Kapoor believes that in terms of reproducibility, both the code and data sets behind the AI model should be available and error-free. The Cambridge ML researcher who is working on the new coronavirus epidemic model added that data privacy issues, ethical issues, and regulatory obstacles are also the focus of problems that cause reproducibility.
They go on to add that data sets are one source of the problem . Publicly available data sets are currently scarce, making it easy for models to produce biased judgments. For example, in a specific data set, doctors prescribe more drugs to one race than another, which may lead the AI to associate the disease with race rather than the disease itself.
Another problem is the phenomenon of "exhaustive questions" in training AI . Due to insufficient data sets, the data set used to train the model overlaps with the test set. Even some parties involved are not aware of this situation, which may also lead to people being overly optimistic about the accuracy of the model.
△
Dr. Sayash Kapoor
Despite the existence of problems, AI models have still been applied in actual diagnosis scenarios, and have even gone directly to the hospital to see a doctor.
In 2021, a medical diagnosis model called Epic Sepsis Model was exposed to serious missed detection problems.
This model is used for sepsis screening to avoid the occurrence of this systemic infection by identifying early disease characteristics of patients. However, researchers at the University of Michigan Medical School investigated and analyzed the medical treatment of 27,697 people and found that the model failed to identify 67% of patients with sepsis.
The company has since made major changes to the model.
A computational biologist pointed out that the reason why this problem is difficult to solve is also related to the lack of transparency of the AI model. “We deploy algorithms in practice that we don’t understand and don’t know what biases they carry,” he added.
△
Articles exposing problems with Epic Sepsis Model
What is clear is that as long as the above problems remain unresolved, business giants and related entrepreneurial projects will also be in trouble——
Last year, Google Health announced that its personnel would be split into various teams. A few days ago, Verily, a life and health subsidiary incubated by Google, was revealed to have laid off about 15% of its employees.
Are there any improvement measures?
Regarding this current situation, some researchers and industry insiders are also working on improving medical AI.
On the one hand, it is to build reliable and very large data sets.
Covers data on institutions, countries, and populations and is open to everyone.
This kind of database has actually already appeared, such as the national biobanks of the United Kingdom and Japan, and the database of the intensive care unit remote monitoring system eICU cooperation.
Take the eICU collaborative research database, for example, which contains approximately 200,000 ICU admission-related data, jointly provided by Philips Healthcare and MIT's Computational Physiology Laboratory.
In order to standardize the contents of the database, standards for collecting data need to be established. For example, an observable data model for the Healthcare Outcomes Partnership allows healthcare organizations to collect information in the same way, which will help enhance machine learning research in healthcare.
Of course, at the same time, attention must also be paid to strictly protecting patient privacy, and only patients with their consent are eligible to include their data in the database.
On the other hand, if you want to improve the quality of machine learning, it is also helpful to eliminate redundant data.
Because in machine learning, redundant data will not only extend the running time and consume more resources; it is also likely to cause model overfitting - that is, the trained model performs well on the training set, but does not perform well on the test set. Poor performance.
For predicting protein structure, which is very popular in the AI circle, this problem has been effectively alleviated. During the machine learning process, the scientists successfully removed proteins from the test set that were too similar to those used in the training set .
However, the differences between the medical data of each patient are not as obvious as the differences in the structure of different proteins. In a database, there may be many individuals with very similar conditions.
Therefore, we need to think clearly about what data to show to the algorithm in order to balance the relationship between the representativeness and richness of the data.
Søren Brunak, a translational disease systems biologist at the University of Copenhagen, commented.
In addition, you can also ask industry leaders to develop a checklist to standardize the research and development steps in the field of medical AI.
Then, researchers can more easily figure out what to do first and what to do next, and operate in an orderly manner; they can also check some issues that may have been missed, such as whether a study is retrospective or prospective, and the intended use of the data and models. Does it match, etc.
In fact, there are already a variety of machine learning checklists, most of which were first proposed by the "EQUATOR Network", an international initiative aimed at improving the reliability of health research.
Previously, Dr. Kapoor of Princeton mentioned above also published a list of 21 questions with his team.
They suggest that for a model to predict outcomes, researchers need to confirm that the data in the training set is earlier than the test set, so as to ensure that the two data sets are independent and will not have data overlap and influence each other.
Reference links:
[1]https://www.nature.com/articles/d41586-023-00023-2
[2]https://www.wired.com/story/machine-learning-reproducibility-crisis/
[3 ]https://mp.weixin.qq.com/s/TEoe3d9DYuO7DGQeEQFghA
-over-
"Artificial Intelligence" and "Smart Car" WeChat communities invite you to join!
Friends who are interested in artificial intelligence and smart cars are welcome to join the exchange group to communicate and discuss with AI practitioners, and not miss the latest industry development & technological progress.
PS. When adding friends, please be sure to note your name-company-position~
click here
Featured Posts