The AI Doctor isn’t ready to see you yet

Medical AI models that perform well on held-out data fail abysmally on independent datasets.

Sep 06, 2024

My Stable Diffusion prompting skills remain sub-optimal.

One of the big, long-term goals of medical research is this thing called “precision medicine.” It’s a phrase that means different things to different people. Broadly, the idea is that we could greatly improve how we promote health and treat disease by understanding how each person’s unique biological makeup contributes to their health. Rather than treating diseases as general categories — diabetes, colon cancer, depression — we make treatment decisions based on highly individual data, and thereby achieve greater rates of success.

Individual data is widely seen as the key to making precision medicine successful: if we can measure the right things at scale (genetic variants, microbiomes, proteomes, wearable sensors, etc.), then we can use patterns in complex, individual data that will predict a patient’s response to treatment. If we collect enough data about an individual, surely we should be able to mine that data to make better treatment decisions. This, of course, sounds like a problem for deep learning. Deep learning excels at finding patterns in the input data that predict outcomes, which is exactly the problem that the precision medicine field needs to solve.

It seems simple: gather datasets from clinical trials, train an AI model, and predict patient outcomes. A paper that came out in Science earlier this year shows how badly this can fail. A team from the Department of Psychiatry at Yale showed that AI models trained to predict the efficacy of antipsychotic medication in schizophrenia patients did extremely well when trained on data from individual clinical trials. But these models “performed no better than chance when applied out-of-sample” to other clinical trials. Even worse, pooling data from multiple clinical trials and testing the model on a held-out clinical trial didn’t help at all. The model overfit the training data and failed to generalize to new data.

This raises at least two important issues related to AI models in medical research and practice. The first is that it is absolutely critical to establish robust performance standards. In the Yale study, the models performed very well when their performance was tested by the standard cross-validation approach, which is to hold out a random fraction of your data for testing, while you train on the rest. Accurately predicting test data that is just a random sample of your overall training dataset is not a stringent test of AI models. This is well-known, yet most studies that I see use cross-validation as the primary performance assessment, rather than out-of-sample testing on a new dataset, which requires finding more data. The broader deep learning field in medical research needs to establish performance standards.

The second issue raised by this study is that there may be much more uncontrolled variability between clinical trials, particularly for neuropsychiatric disease. Clearly the Yale models were overfit to the training data, but the fact that training models on pooled clinical trial data was not successful suggests that these trials differed from each other in very significant ways that might not be captured by simply comparing protocols. As a commentary on the Yale paper puts it,

[P]redictions of machine learning models trained on data from a specific context—a population, country, setting, or time period—might rely on features that are associated but not causally related with a clinical outcome in a given study but are not predictive in other contexts.

This seems like a more serious problem for AI-guided precision medicine than overfitting. Overfitting is a known problem that can be solved. But if we can’t collect consistent data across populations and settings, what hope is there for developing models that generalize?

Schizophrenia is a notoriously difficult condition to study and treat, and so the dismal results of the Yale study may not hold for other conditions. But these questions — Is the model overfit? What are the hidden heterogeneities in the data? — should be kept front of mind each time we read a new paper describing AI models that predict disease state. A paper out this week in Nature describes a model that seems to do well predicting tumor origin and patient prognosis from histological slides of biopsies. The prediction task differs from the Yale study. Predicting tumor type from an image of a tumor is probably more straightforward than predicting how a schizophrenia patient will respond to medication, based on medical chart data. But we should beware the illusion of generalizability in medical AI models.

Discussion about this post

Ready for more?