The term ‘overuse’ refers to the unnecessary adoption of AI or advanced ML techniques where alternative, reliable or superior methodologies already exist. In such cases, the use of AI and ML techniques is not necessarily inappropriate or unsound, but the justification for such research is unclear or artificial: for example, a novel technique may be proposed that delivers no meaningful new answers.Many clinical studies have employed ML techniques to achieve respectable or impressive performance, as shown by area under the curve (AUC) values between 0.80 and 0.90, or even >0.90 (Box 1). A high AUC is not necessarily a mark of quality, as the ML model might be over-fit (Fig. 1). When a traditional regression technique is applied and compared against ML algorithms, the more sophisticated ML models often offer only marginal accuracy gains, presenting a questionable trade-off between model complexity and accuracy1,2,8,9,10,11,12. Even very high AUCs are no guarantees of robustness, as an AUC of 0.99 with an overall event rate of <1% is possible, and would lead to all negative cases being predicted correctly, while the few positive events were not.Fig. 1: Model fitting.Given a dataset with data points (green points) and a true effect (black line), a statistical model aims to estimate the true effect. The red line exemplifies a close estimation, whereas the blue line exemplifies an overfit ML model with over-reliance on outliers. Such a model might seem to provide excellent results for this particular dataset, but fails to perform well in a different (external) dataset.Full size imageThere is an important distinction between a statistically significant improvement and a clinically significant improvement in model performance. ML techniques undoubtedly offer powerful ways to deal with prediction problems involving data with nonlinear or complex, high-dimensional relationships (Table 1). By contrast, many simple medical prediction problems are inherently linear, with features that are chosen because they are known to be strong predictors, usually on the basis of prior research or mechanistic considerations. In these cases, it is unlikely that ML methods will provide a substantial improvement in discrimination2. Unlike in the engineering setting, where any improvement in performance may improve the system as a whole, modest improvements in medical prediction accuracy are unlikely to yield a difference in clinical action.Table 1 Definitions of several key terms in machine learningFull size tableML techniques should be evaluated against traditional statistical methodologies before they are deployed. If the objective of a study is to develop a predictive model, ML algorithms should be compared to a predefined set of traditional regression techniques for Brier score (an evaluation metric similar to the mean squared error, used to check the goodness of a predicted probability score), discrimination (or AUC) and calibration. The model should then be externally validated. The analytical methods, and the performance metrics on which they are being compared, should be specified in a prospective study protocol and should go beyond overall performance, discrimination and calibration to also include metrics related to over-fitting.Conversely, some algorithms are able to say “I don’t know” when faced with unfamiliar data13, an output that is important but often underappreciated, as knowledge that a prediction is highly uncertain may, itself, be clinically actionable.Box 1 Recommendations to avoid overuse and misuse of AI in clinical research
Whenever appropriate, (predefined) sensitivity analyses using traditional statistical models should be presented alongside ML models.
Protocols should be published and peer reviewed whenever possible, and the choice of model should be stated and substantiated.
All model performance parameters should be disclosed and, ideally, the dataset and analysis script should be made public.
Publications using ML algorithms should be accompanied by disclaimers about their decision-making process, and their conclusions should be carefully formulated.
Researchers should commit to developing interpretable and transparent ML algorithms that can be subjected to checks and balances.
Datasets should be inspected for sources of bias and necessary steps taken to address biases.
The type of ML technique used should be chosen taking into account the type, size and dimensionality of the available dataset.
ML techniques should be avoided when dealing with very small, but readily available, convenience clinical datasets.
Clinician–researchers should aim to procure and utilize large, harmonized multicenter or international datasets with high-resolution data, if feasible.
A guideline on the choice of statistical approach, whether ML or traditional statistical techniques, would aid clinical researchers and highlight proper choices.