Machine learning heavily relies on probability theory. Hence, managing uncertainty (read imperfect or incomplete information) is key to machine learning (ML) projects.
Ideally, deep learning makes it possible to produce dependable predictions on data from the same distribution the models were trained on. However, there are often disparities in the distribution of data on which the model was trained and to which a model is applied. For example, a 2018 study found that deep learning models trained to detect pneumonia in chest x-rays did not achieve the same degree of accuracy when they were evaluated on previously unseen data from hospitals.
Methods such as Gaussian processes are very helpful in data analysis and decision making. For instance, an autonomous car may use this information to decide whether it should brake or not.
That said, in assessing data and making decisions, it is important to also be able to question whether a model is certain about its output. While this is an underlying concern of Bayesian machine learning, deep learning models often ignore these questions— leading to situations in which it is difficult to tell whether a model is making a reasonable prediction or making guesses at random.
There are two major types of uncertainty in deep learning: epistemic uncertainty and aleatoric uncertainty.
Epistemic uncertainty specifically refers to what a model doesn’t know because it was fed inappropriate training data. This occurs when a model doesn’t have a sufficient amount of data and knowledge, which usually happens when there aren’t enough samples available for training the AI.
The collection of observations acquired from the domain cannot be chosen without some systematic bias. While some level of bias is unavoidable, uncertainty increases if the level of variance and bias in the sample is an unsuitable representation of the task or project for which the model will be used.
Unfortunately, in most cases, developers have little control over the sampling process, and obtain their data from a database or CSV file that they have access to. It is impossible to achieve complete coverage of a domain: there will always be some unobserved cases.
Aleatoric uncertainty describes the uncertainty that comes about as a result of the natural stochasticity of observations. Observations from a domain that has been used to train a model are always incomplete and imperfect.
High aleatoric uncertainty occurs when there are few or no observations made while training a model. This type of uncertainty cannot be remedied by providing additional data.
Noise in observations occurs when the observations from the domain aren’t concise: In other words, they contain noise. “Observations,” in this instance, refers to what was measured or collected: It is the input as well as the expected output of a model. “Noise,” on the other hand, refers to the variability in observation. This variability could be natural or an error, and affects both the input and the output of the model.
Since data in the real world is messy and imperfect, we should be skeptical of data and develop systems that can navigate uncertainties.
ML models are susceptible to errors, but some models can be useful despite being wrong: The variables include the procedure used to develop the model, including the selection of samples, decisions made in training hyperparameters, and in the construction of model predictions.
Hence, given the uncertainty in deep learning, the goal should be to build models with good relative performance and improve on the established learning models to account for the margin of errors.