Breaking News

Data Drift and Machine Learning Model Sustainability – Analytics Insight

In many real-world applications where machine learning models have been deployed in production, often the data evolve over time and thus models built for analyzing such data quickly become obsolete over time. It becomes essential for data scientists to monitor the model performance over time. Is the machine learning model deployed sustainable and performing consistently? Usually, the scenario that occurs over time, is not because the model stops performing well but simply because the model can no longer capture the right variability of the data be it the dependent or independent variables. The reason for this is not to do with the machine learning model itself, but the data distributions. There is a shift that occurs at the data level, as the distribution of the data used to train an ML model, called the source distribution, is different from the distribution of the newly available data, the target distribution. As a result of this, the relationships between input and output data can change over time, meaning that in turn there are changes to the unknown underlying mapping function. This gap is where the concept of data drift comes in.

Over time, a machine learning model starts to lose its predictive power, a concept known as model drift. What is generally called data drift is a change in the distribution of data used in a predictive task. There are different types of data drift such as:

Feature Drift (or covariate drift):
It happens when some previously infrequent or even unseen feature vectors become more frequent, and vice versa. However, the relationship between the feature and the target is still the same. [1] Covariate Drift is defined as the case where:
Example: Temperature readings changing from in degree Fahrenheit to degree Celsius.

Figure 1. Changes in the distribution of the feature “period” over time,
yields to covariate shift

Concept Drift:
It is the phenomenon where the statistical properties of the class variable — in other words, the target we want to predict — change over time. Hence, this drift is focussed on the change in the target variable with time. [1] Concept Drift is defined as the case where:
Example: Inventory changes over time

Figure 2. Concept drift can be detected by the divergence of model & data decision boundaries and the concomitant loss of predictive ability

Dual Drift:
This applies to a situation wherein feature as well as concept drift occurs. [1] Dual Drift is defined as a case where:

Following are a number of techniques that have been explored in python using the ‘scikit-multiflow’ library. The multitude of techniques have been implemented by showing a simulation of how it will work on real life data.
Adaptive Windowing(ADWIN):
ADWIN [2] (ADaptive WINdowing) is an adaptive sliding window algorithm for detecting change, and keeping updated statistics about a data stream. ADWIN allows algorithms not adapted for drifting data, to be resistant to this phenomenon. The general idea is to keep statistics from a window of variable size while detecting concept drift.
This algorithm is implemented in python for a specialized tool which can perform drift detection. The scikit-multiflow library has this functionality. A sample code to simulate concept drift and the way it can be detected using this library is as follows

The results of the simulation conducted to detect data drift look like:

Drift Detection Method (DDM):
DDM (Drift Detection Method) [3] is a concept change detection method based on the PAC learning model premise, that the learner’s error rate will decrease as the number of analysed samples increase, as long as the data distribution is stationary.
If the algorithm detects an increase in the error rate, that surpasses a calculated threshold, either change is detected or the algorithm will warn the user that change may occur in the near future, which is called the warning zone.
The results of the simulation conducted to detect data drift using the DDM are attached below:

Early Drift Detection Method (EDDM):
EDDM (Early Drift Detection Method) [4] aims to improve the detection rate of gradual concept drift in DDM, while keeping a good performance against abrupt concept drift.
This method works by keeping track of the average distance between two errors instead of only the error rate. For this, it is necessary to keep track of the running average distance and the running standard deviation, as well as the maximum distance and the maximum standard deviation.
The results of the simulation implemented to detect data drift using the DDM are attached below:

Drift Detection Method based on Hoeffding’s bounds with moving average-test (HDDM_A):
HDDM_A [5] is a drift detection method based on the Hoeffding’s inequality. HDDM_A uses the average as estimator. It receives as input a stream of real values and returns the estimated status of the stream: STABLE, WARNING or DRIFT.
The results of the simulation conducted to detect data drift using the DDM are attached below:

Drift Detection Method based on Hoeffding’s bounds with moving weighted average-test (HDDM_W):
HDDM_W [5] is an online drift detection method based on McDiarmid’s bounds. HDDM_W uses the EWMA statistic as estimator. It receives as input a stream of real predictions and returns the estimated status of the stream: STABLE, WARNING or DRIFT.
The results of the simulation conducted to detect data drift using the DDM are attached below:

Kolmogorov-Smirnov Windowing method for concept drift detection (KSWIN):
KSWIN (Kolmogorov-Smirnov Windowing) [6] is a concept change detection method based on the Kolmogorov-Smirnov (KS) statistical test. KS-test is a statistical test with no assumption of underlying data distribution. KSWIN can monitor data or performance distributions. Note that the detector accepts one dimensional input as array.
The results of the simulation conducted to detect data drift using the DDM are attached below:

Page-Hinkley method for concept drift detection (PageHinkley):
This change detection method works by computing the observed values and their mean up to the current moment. Page-Hinkley won’t output warning zone warnings, only change detections. The method works by means of the Page-Hinkley test [7]. In general lines it will detect a concept drift if the observed mean at some instant is greater then a threshold value lambda.
The results of the simulation conducted to detect data drift using the DDM are attached below:

Different methods and metrics are used depending on the type of drift that needs to be monitored. The methods depending on the type of drift are as follows:
Measuring Concept Drift:
Following are the proposed quantitative measures of concept drift including the key measure drift magnitude which measures the distance between two concepts Pt(X,Y) and Pu(X,Y). Using the Hellinger distance method which measures the total variation distance:
where Z represents a vector of random variables.

Measuring Covariate Drift:
For the conditional drifts it is necessary to deal with multiple distributions, one for each value of the conditioning attributes. We address this by weighted averaging, as described:

For a given subset of the covariate attributes there will be a conditional probability distribution over the possible values of the covariate attributes for each specific class, y. The conditional marginal covariate drift is the weighted sum of the distances between each of these probability distributions from time period t to u, where the weights are the average probability of the class over the two time periods.

For each subset of the covariate attributes there will be a probability distribution over the class labels for each combination of values of those attributes, x ̄ at each time period. Therefore, the Conditional Class Drift can be calculated as the weighted sum of the distances between these probability distributions where the weights are the average probability over the two periods of the specific value for the covariate attribute subset.

The following figure reflects the way data drift can be monitored and dealt with in the machine learning model at scale during production. As a part of the pipeline, implementing a system that periodically trains the model after some time t, or once it detects a drift using some of the methods aforementioned is the most robust way to overcome drift at a production level.
Figure 3. ML model solution flow at production with data drift being monitored

This paper brings to forefront the concept of data drift an often missed dimension in setting up a ML workflow in production systems.  The concept of data drift is demonstrated with simulated data  in this study, where we further discuss in detail  some key metrics to measure the same. A high level ML pipeline is also mentioned on incorporating the drift related error correction in any ML workflow. With this work we aim to highlight that drift modelling and correction is not to be taken lightly and should be a part and parcel of any automated ML pipeline in production systems.

Webb, G. I., Lee, L. K., Goethals, B., & Petitjean, F. (2018). Analyzing concept drift and shift from sample data. Data Mining and Knowledge Discovery, 32(5), 1179-1199.
Bifet, Albert, and Ricard Gavalda. “Learning from time-changing data with adaptive windowing.” In Proceedings of the 2007 SIAM international conference on data mining, pp. 443-448. Society for Industrial and Applied Mathematics, 2007.
João Gama, Pedro Medas, Gladys Castillo, Pedro Pereira Rodrigues: Learning with Drift Detection. SBIA 2004: 286-295
Early Drift Detection Method. Manuel Baena-Garcia, Jose Del Campo-Avila, Raúl Fidalgo, Albert Bifet, Ricard Gavalda, Rafael Morales-Bueno. In Fourth International Workshop on Knowledge Discovery from Data Streams, 2006.
Frías-Blanco I, del Campo-Ávila J, Ramos-Jimenez G, et al. Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(3): 810-823.
Christoph Raab, Moritz Heusinger, Frank-Michael Schleif, Reactive Soft Prototype Computing for Concept Drift Streams, Neurocomputing, 2020
S. Page. 1954. Continuous Inspection Schemes. Biometrika 41, 1/2 (1954), 100–115.

Dr.  Anish Roy Chowdhury is currently an Industry Data Science Leader at Brillio,  a Leading Digital Services Organization.  In previous roles he was with ABInBev as a Data Science Research lead working in areas of Assortment Optimization, Reinforcement Learning to name a few, He also led several machine learning projects in areas of Credit Risk, Logistics and Sales forecasting. In his stint with HP Supply Chain Analytics he developed data Quality solutions for logistics projects and worked on building statistical models to predict spares part demands for large format printers. Prior to HP, he has 6 years of Work Experience in the IT sector as a DataBase Programmer.  During his stint in IT he has worked for Credit Card Fraud Detection among other Analytics related Projects.   He has a PhD in Mechanical Engineering (IISc Bangalore) . He also holds a MS degree in Mechanical Engineering from Louisiana State Univ. USA. He did his undergraduate studies from NIT Durgapur with published research in GA- Fuzzy Logic applications to Medical diagnostics
Dr. Anish is also a highly acclaimed public speaker with numerous best presentation awards from National and international conferences and has also conducted several workshops in Academic institutes on R programming and MATLAB. He also has several academic publications to his credit and is also a Chapter Co-Author for a Springer Publication and a Oxford University Press, best selling publication in MATLAB.

Paulami Das is a seasoned Analytics Leader with 14 years’ experience across industries. She is passionate about helping businesses tackle complex problems through Machine Learning. Over her career, Paulami has worked several large and complex Machine Learning-centric projects around the globe.
In her current role as Head of Data Science of Brillio Technologies she heads a team that solves some of the most challenging problems for companies across industries using AI tools and techniques. Her team is also instrumental in driving innovation through building state-of-the-art AI-based products in the areas of Natural Language Processing, Computer Vision, and Augmented Analytics.
Prior to Brillio, Paulami was the Director of Business Development of Cytel, for whom she helped scale the new Analytics business lines. Prior to Brillio, Paulami held Analytics leadership positions with JP Morgan Chase and Dell.
Paulami graduated from IIT Kanpur with a degree in Electrical Engineering. She also holds an MBA from IIM Ahmedabad.
Muskan Gupta is a final year UG student at NMIMS University pursuing Data Science Engineering from Mukesh Patel School of Technology, Management & Engineering. She has been interning with Brillio since May, 2020 and has had key contributions to this whitepaper. Her interests lie in Machine Learning and NLP. She has had experience working on projects in the same throughout her education and in the duration of her internship.