Direct-to-consumer medical machine learning and artificial intelligence applications

While regulators such as the FDA focus primarily on the risks to a user’s safety, marketing directly to consumers presents unique concerns that call for particular care.

To begin, let us distinguish two important concepts. First, we have the technical performance of an AI/ML system as given by, say, its specificity/sensitivity (or equivalently, given a prevalence level, its false positive and false negative error rates). Such metrics are often used to describe a system’s accuracy. Second, we have the beliefs/judgements of individual consumers that are formed in part by taking into account the information provided by the AI/ML system. These too can be described as false positive (believing one has a disease that is absent) or false negative (failing to believe one has a disease that is present). We can articulate the same idea probabilistically: high confidence in having a disease that is absent is a mistake in the false positive error direction, whereas low confidence in a disease that is present is a mistake in the false negative error direction⁷. But the veracity of a belief is not the same as the accuracy of an AI/ML system.

These two can substantially come apart when an AI/ML system is introduced into actual practice settings. In Box 1, we provide an example of how the divergence of personal belief and diagnostic accuracy can create major problems for DTC medical AI/ML apps. As we illustrate, even with extremely high sensitivity/specificity, the frequency of false positive judgements among consumers may still be very high. This is in part because consumers are not medical experts and are probably unable to place DTC medical AI/ML app diagnoses in context. Without reliable prior information about the likelihood of a disease’s occurrence, they are liable to identify their own posterior belief—that is, the probability that they have the disease—with the app’s diagnosis. This mistake—base-rate neglect—is perhaps the most common fallacy in medical decision-making. Doctors themselves often fall into the trap, but non-expert consumers will be particularly prone to it^8,9. This discrepancy between personal belief and diagnostic accuracy can be further exacerbated by the way the system’s conclusion is expressed and the gravity of the disease. For example, if patients are given a conclusion with very precise language (for example, ‘83.7% chance of disease’), they may overestimate its reliability. Likewise, if the disease is very serious, even a low-probability risk can be quite daunting.

The problem of exaggerated false positive judgements may be further compounded by two factors. First, DTC medical AI/ML apps are targeted to a large, generally young, target demographic. Consider, for example, the Apple Watch irregular rhythm notification feature’s user base. Within such a heterogeneous and overall healthy younger population, diseases such as AFib will be very rare. This has the effect of deflating the base rate of disease, which increases the probability of a false positive judgement. Second, DTC medical AI/ML apps are marketed for quick and inexpensive use; and using them many times is effortless and instantaneous. For example, apps designed to detect skin cancer from mobile phone images of lesions on a person’s body: consumers can retest the same spot at an alternative distance, or under different lighting, for example (see Box 1). This further increases the probability of false positive judgements by increasing the overall number of tests performed. Thus, an app’s ability to detect the presence of disease can be excellent, while the probability that someone actually has a disease on receiving a positive diagnosis may still be very low. Failing to recognize this can lead to unreasonably high assessments by users of the likelihood that they have a disease after receiving a positive diagnosis.

More generally, this is because for relatively rare diseases, whereas most negatives are indeed true negatives, most positives are false positives (see figure in Box 1). This is a well-known point in Bayesian decision theory, but it tends to be overlooked in the DTC medical AI/ML app discussion, and indeed, it is not something ordinary consumers can be expected to understand. For example, in a well-known study, patients in US clinics were asked to interpret diagnostic tests for several well-known diseases, including HIV and strep throat. They were asked to compare the probability of a positive test when the disease is present (sensitivity) against the probability that a person has the disease if she tests positive (posterior belief/positive predictive value). Most patients estimated these probabilities to be nearly the same¹⁰. This implies that even though a regulator such as the FDA can determine the optimal accuracy threshold (and false positive versus false negative ratio) required for approval, the observed proportion of mistakes under a system’s actual use can be quite different, which can result in unexpected costs for the system as a whole.

Finally, the typical consumer is risk and/or loss averse, so the cost of a positive diagnosis will loom larger in their mind than the benefit of a negative one, a well-known effect in behavioural economics and decision-making¹¹. Indeed, in the context of DTC genetic counselling, we have seen that when people learn that they are susceptible to a certain disease, they routinely overestimate their risk of contracting that disease¹². In other words, people unduly amplify low-probability adverse outcomes in their personal decision-making. Note that while the result in this example is similar to base-rate neglect, the point here is slightly different: risk and loss aversion can further increase base-rate neglect when the outcome being estimated is harmful (compare, instead, estimating a stranger’s health outcomes). This further exacerbates the discrepancy between a positive diagnosis and the veracity of a person’s belief that they have a certain disease. This issue has been recognized for other medical diagnostics, such as HIV self-testing, where researchers have called for confirmatory lab testing, which would reduce the effect of base-rate neglect, and for pre- and post-test counselling, to help patients interpret their results¹³. The discrepancy between the assessment and the personal belief may be made worse by the fact that AI/ML DTC apps, as opposed to ordinary DTC apps, typically produce verdicts using opaque algorithms that even informed users are unable to interpret and thus appropriately weigh against their prior body of evidence.

Note that from the standard regulatory perspective of risk to an individual’s safety, the typical worry when it comes to DTC medical AI/ML apps would probably be false negative, not false positive, judgements. That is, if a diagnostic device fails to identify a present disease, it may lull the patient into a false sense of security, thereby leaving the condition untreated and letting it deteriorate over time. This is indeed a risk, capable of generating both individual and social costs—for example, if people delay treatment it may be more costly to treat them later.

However, as we begin to see, a particularly important risk we face when placing DTC medical AI/ML apps into the hands of consumers is from false positive judgements: when someone falsely believes they have a disease, they will probably schedule a doctor’s visit to confirm the diagnosis. They may also see one or more specialists or request unnecessary prophylactic medication (for example, blood thinners, angiotensin converting enzyme inhibitors and so on). While each of these steps is trivial at an individual level, collectively, they can generate a substantial misuse of scarce medical resources.

Accordingly, while regulators typically focus on a device’s technical specifications (for example, specificity/sensitivity), what also matters for DTC medical AI/ML apps is the veracity of users’ beliefs and their subsequent actions. Even though a positive diagnosis from a very accurate system would not lead a rational Bayesian to a high probability that they actually have a disease, the typical user is unlikely to be a perfectly rational Bayesian and may thereby consume disproportionate healthcare resources. When we reframe the problem this way, it becomes clear that more attention should be paid to how consumers interact with medical AI/ML apps and the negative externalities on healthcare infrastructure generated by their collective behaviour. We discuss some possible approaches below.

Box 1 The (dis)value of information

Suppose that a skin cancer screening app is marketed directly to consumers, which takes an image (a pigmented skin lesion) as its input and provides a binary ‘yes’/‘no’ disease prediction as its output. Once downloaded, a consumer is able to perform tests on any lesion of her body. Furthermore, she is able to retest any particular lesion (for example, from a different angle, at an alternative distance or under different lighting). Suppose that the underlying disease is present in 1% of skin lesions population-wide, and that this app is highly accurate, with 99% sensitivity and 95% specificity. If a patient receives a positive diagnosis, then the rational Bayesian posterior that she has the disease should be 16.7%. To make things more realistic, suppose that the patient performs five tests on a single lesion, obtaining two negative diagnoses, followed by three positive diagnoses. If we assume for the sake of illustration that the tests are independent, conditional on disease state, then following all five tests, the probability of disease will be only 0.86%. These are the rational Bayesian verdicts. The reason these numbers look strikingly low is because for medically rare diseases such as skin cancer, while most negatives are true negatives, most positives are false positives. The figure illustrates this situation after 100,000 tests. Out of 5,940 positive diagnoses, 4,950 of them (84%) are false positives. Meanwhile, out of 94,060 negative diagnoses, only 10 (0.01%) are false negatives.

Comparison of true positives to false positives with a highly accurate hypothetical AI/ML skin cancer screening app after 100,000 diagnoses.

However, the situation in practice is probably much different. Most users of AI/ML apps are not domain experts and are thereby not aware of the prevalence of disease. Moreover, they are extra sensitive to adverse health outcomes. As a result, and as previous research in judgement and decision-making indicates, they are likely to assume that the prior odds are closer to 1:1 (ref. ¹⁰), thereby putting more weight on the machine prediction than a rational Bayesian should. In this case of assuming prior odds of 1:1, the posterior probability that a person has the disease, after observing one positive diagnosis, will be approximately 95% (a sixfold increase from the Bayesian answer, which is 16.7%). Meanwhile, the posterior probability that one has the disease, after observing two negative diagnoses followed by three positive ones, will be 46.2% (a 53-fold increase from the rational Bayesian answer, which is 0.86%). While this is only a theoretical illustration, it points to the perils of overestimating posterior probability of disease from introducing AI/ML assessments into an imperfectly rational and risk-averse population.

Hannah