Quantified Health: One reason why AI has a reproducibility problem in medicine

Kaggle is a subsidiary of Google that organizes machine learning competitions via their website kaggle.com. Various third-party organizations sponsor these competitions, providing labeled data over which contestants can train and test their machine learning algorithms with the winners sometimes receiving monetary prizes. Typically the test data (for which the labels are held out) are divided into a public leaderboard set that contestants can repeatedly evaluate their algorithm against to gauge progress, and the final private competition data which is saved for the final submission at the very end.

One consequence of this format is that there is information (data) leakage while the leaderboard test set is being used. Because participants are allowed to make repeated submissions, they can over time indirectly glean the answers even though they cannot access the target labels directly. A simple example is that they can make repeated predictions for only the first example until they get it correct, and then they know what the correct label is. As a result, the accuracy on the leaderboard test set is often very different from that of the final test set, and the ordering of the top teams can change dramatically.

A recent news article in Nature magazine titled “The reproducibility issues that haunt health-care AI” described the results of a Kaggle contest whose objective was diagnosing lung cancer from chest CT scans. There were ~ 2000 images divided among a training set (n=1397), public (leaderboard) test set (n=198), and final test set (n=506). The predictor had to classify lung cancer or not based on the scan images. During the open public phase of the contest, several groups were able to exceed 90% accuracy on the leaderboard test data (i.e. best prediction from numerous submissions).

However during the test phase, the results were markedly worse (60-70% accuracy) than against the leaderboard dataset. More tellingly, the correlation between leaderboard performance and final test performance was rather weak. Researchers examining the results of the contest calculated the Spearman correlation coefficent, which is the standard (Pearson) correlation coefficient of the rank variables. In other works if the top 10 teams ranked identically on the public leaderboard as well as the final results, then the Spearman correlation coefficient would be 1 (perfect correlation). If the ranking on the leaderboard was completely uncorrelated (random) with respect to ranking on the final test set, then the Spearman correlation would be 0. The Spearman coefficients were 0.23 for all teams, and 0.39 for the top 10 teams indicating only moderate correlation (Figure 1).

One explanation for the unexpectedly weak correlation was gaming of the leaderboard results as described above taking advantage of the rules in which as many as 5 submissions per day were allowed with the prediction accuracy results of each submission being provided. Such a tactic would not be possible in real world clinical trials, because one cannot make numerous predictions and iteratively improve based on the results. A clinical trial is like the final test dataset in which the medical prediction algorithm makes one and only one final prediction before any results are revealed.

However, even the final test dataset in the Kaggle contests may be vulnerable to information leakage, which is defined as “the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment. Leakage is often subtle and indirect, making it hard to detect and eliminate.” One form of indirect leakage is that the training data and the final test data originate from the same or very similar data sources such as the same hospital. Data (e.g. CT images) from another hospital may be significantly different in certain respects (e.g. machine that takes images) so that a predictor trained (and tested) on data from hospital A may have trouble predicting data from hospital B.

One way to avoid information leakage as much as possible is through a prospective trial in which the prediction algorithm is trained and the parameters fixed before the labels of the final test dataset are known. Ideally, the predictor could be finalized and then used to predict over a wide variety of new data from different sources in the prospective trial. This would ensure out-of-distribution testing in which training and test data are from different probability distributions, and help to assess the robustness of the predictor to unfamiliar inputs.

In summary, the Kaggle contests provide a blatant example of how information leakage can occur and result in the overestimation of prediction accuracy from the public leaderboard results. But it is important to keep in mind that leakage can be much more subtle, and that the best way to avoid it is to finalize the predictor and make the predictions before the test data labels (e.g. cancer or not) have been compiled; in other words, run a prospective trial. Even better would be the continual updating of prediction accuracy once the method is in the field (after approval) so that the predictor can be evaluated on a wide range of populations and circumstances.

Figure 1. "The log-loss score distribution of the top 250 teams in the Kaggle Data Science Bowl Competition. The log-loss scores of the public test set and the final test set of each team were plotted. The red horizontal line indicates the log-loss of outputting the cancer probability as 0.5 for each patient. The blue horizontal line shows the log-loss of outputting cancer probability of each patient as the prevalence of cancer (0.26) in the training set" (Yu et al. Journal of Medical Internet Research, 2020).

Quantified Health

Pages

Sunday, April 30, 2023

One reason why AI has a reproducibility problem in medicine

No comments:

Post a Comment