Restoring Credibility Of Machine Learning Pipeline Output Challenged In Study Of Real World Deployments Through Blockchain Data

All domains are going to be turned upside down by machine learning (ML). This is the consistent story that we keep hearing over the past few years. Except for the practitioners and some geeks, most people are not aware of the nuances of ML. ML is definitely related to Artificial Intelligence (AI). Whether it is a pure subset or a closely related area depends on who you ask. The dream of general AI for machines to solve previously unseen problems in all domains using cognitive skills had turned into AI winter as this approach did not yield results for more than forty or fifty years. The resurgence of ML turned the field around. ML became tractable as the horsepower of computers increased and much more data about different domains became available to train models. ML turned the focus away from trying to model the whole world using data and symbolic logic to make predictions using statistical methods on narrow domains. Deep learning is characterized by the assembly of multiple ML layers, thus leading us back to the dream of general AI. For example, driverless cars.

In general, there are three separate approaches in ML; one is called supervised learning, the second semi-supervised learning and the third is unsupervised learning. Their differences stem from the degree of human involvement guiding the learning process.

The success of ML comes from the ability of models trained through data from a particular domain called a training set to make predictions in real world situations in the same domain. In any ML pipeline a number of candidate models are trained using data. At the end of the training, an essential of amount of basic structure of the domain is encoded in the model. This allows for the ML model to generalize to create predictions in the real world. For example, a large number of cat videos and non-cat videos can be fed in to train a model to recognize cat videos. At the end of the training a certain amount of cat-videoness is encoded in successful predictors.

ML is used in many familiar systems; including movie recommendations based on viewing data and market basket analysis which suggest new products based on the current contents of shopping carts to quote a few. Facial recognition, skin cancer prediction from clinical images, identifying retinal neuropathy from retinal scans, predictions of cancer from MRI scans are all in the domain of ML. Of course, recommender systems for movies are vastly different in scope and importance from those predicting skin cancer or the beginnings of retinal neuropathy and hence blindness.

Showing wildly varied prediction of trajectory of Infections from an underspecified epidemiological … [+] model initialized at differing starting points

From paper:

The key idea after this training is to use an independent and identically distributed (iid) evaluation procedure using data drawn from the training distribution which the predictors have not yet encountered. This evaluation is used to choose the candidate for deployment in the real world. Many candidates can perform similarly during this phase, even though there are subtle differences between them due to the starting assumptions, number of runs, data that they trained on etc.

MORE FOR YOU

Ideally the iid evaluation is a proxy for the expected performance of the model. This helps separate the wheat from the chaff. The duds from the iid-optimal models. That there would be some structural misalignment between the training sets and the real world is obvious. The real world is messy, chaotic, images are blurry, the operators are not trained to capture pristine images, there are equipment breakdowns. All predictors deemed equivalent at the evaluation phase should have should have shown similar defects in the real world. A paper written by three principals and backed by about thirty other researchers all from google

GOOG
, probes this theory to explain many high profile failures of ML models in the real world. This includes the highly publicized Google health fiasco where the model did not perform well in field tests in Thailand aimed at diagnosing retinal neuropathy from scans.

The paper notes that all predictors that performed similarly during the evaluation phase did not perform equally in the real world. Uh oh, this means that the duds and the good performers could not be distinguished at the end of the pipeline. Moreover some candidates performed better when the images were blurry, some when the data had unusual perspectives, revealing a difference in predictive ability in different presentations from the same domain. This paper is a sledgehammer taken to the process of choosing a predictor and the current practices of implementation of a ML pipeline.

The paper identifies the root cause of this behavior as underspecification in ML pipelines. Underspecification is a well understood and well documented phenomenon in ML, it arises due to the presence of more unknowns than independent linear equations expressable in a training set. The first claim in the paper is that underspecification in ML pipelines is a key obstacle to reliably training models that behave as expected in deployment. The second claim is that underspecification is ubiquitous in modern applications of ML, and has substantial practical implications. There is no easy cure for the underspecification. All deployed ML predictors using the current pipeline are tainted to some degree.

The solution is to be aware of the perils of underspecification and choose multiple predictors, and then subject them to stress tests using more real world data and choose the best performer; in other words, expand the testing regime with more real world data. All this points to the need for better quality data to be used in both the training and evaluation set, which brings us to the increasing use of blockchains and smart contracts to implement solutions in many areas. These changes have a way of allowing cleaner structured real world data to be made available earlier. Access to higher quality and varied training data may reduce underspecification and hence create a pathway to better ML models, faster.

Another solution is to release multiple successful candidates after the gating done through iid evaluation. This allows many parallel models to be stress tested using the wisdom of the crowd. There is no longer a single winner, there are multiple winners; looks like ML predictors are looking more like the humans that they were meant to replace.

Hannah