Data Quality in Machine Learning. – Finextra

We regularly see and hear phrases like “data
is the life blood of an organisation” or “the
world’s most valuable resource is no longer oil, but data”. There is no denying that data is an incredibly valuable resource.  But a theme that is overlooked in many articles or only mentioned in passing is the importance of data quality.
Technology by itself is not a panacea. You can have any technology you like, and you can have much data as you like but if you don’t have high quality data you are taking an immense risk.
This short paper starts by looking at different types of data: quantitative, qualitative, and then looks the challenges of using this data in Machine Learning applications.

Quantitative vs Qualitative Data
Quantitative data and the results stemming from it are applauded by many as being “scientific” and more “valuable” than non-quantitative data. However quantitative data is not without faults and limitations. Firstly, quantitative data often results in binary
result for example a “yes” or “no” answer. This then maybe used to make decisions without understanding the true meaning of that answer. This approach can result in decisions that do not lead to the most optimal result and even opportunities being missed.
 Secondly, there have been many papers written expounding the benefits of quantitative data and it is reasonable to assume many more similar papers will be written in the future. Sometimes we fall into the trap of believing if something is said enough times
it must be true or at least have an element of truth.  Thirdly, it is assumed a strong correlation is synonymous with absolute certainty. We sometime say we have found a correlation with 95% certainty and we focus on the 95% certainty. We forget this also
means there is a 5% chance the correlation does not exist.
Qualitative data also suffers from faults such as the bias of the researcher, it can be difficult or impossible to replicate the results of qualitative data and the cost and time to generate qualitative data can be considerable.
Whether you are using quantitative or qualitative data the quality of data is key. No matter what technology you use to cut and slice the data, but rubbish data generates rubbish results.
There are many articles extolling the benefits of Machine Learning or AI ability to improve decision making using qualitative or un-structured data quality of quantitative data.  
It would make life of so many people easier if data came in nice easy to use structured packages. An article by

Forbes estimated less than 20% of data is structured. Given that so much data is unstructured makes life a little more challenging. Machine Learning applications such as BERT from Google provide an excellent application for making meaning of many un-structured
data sets.

Data in Machine Learning
Machine learning is dependent on data: quantitative, qualitative, structured and unstructured. More importantly, Machine Learning is dependent on good quality data. The importance of data is illustrated when looking at high level Machine Learning process.

Step 1. Data collection
Step 2 Data Annotation
Step 3 Ingest Data into model
Step 4 Train the model
Step 5 Evaluate results
Step 6 Additional classification of data /fine tuning
Step 7 Seek additional data to enhance model

Step 2 in this process provides an example of the importance of data. Annotation of data is very expensive and time consuming but also critical to the success of the machine leaning application. One challenge that is overlooked at this stage is the variation
in understanding of text by those carrying out the classification. For example, if one person’s background allows them to use elaborated code and another uses restricted code it is likely to result in different interpretation of the same text. You can try
and overcome some of these challenges by having guidelines, data reviewed several times and then reaching a consensus. But this adds to the cost and time to build a production Machine Learning application.
 Another key challenge is selection bias throughout this process or even reverse engineering data to generate desired results.  The issue of bias is not new, and many approaches have been implemented to reduce selection bias including taking care in selecting
the learning model, taking care in selecting the training data etc. The success or these attempts to reduce or eliminate bias is questionable in many instances.
 In conclusion Machine Learning offers huge potential but before even considering which Machine Learning technology to use attention must be paid to the quality of data and how you can source this data and possibly look at the return on Investment (ROI)
as good quality data is not always cheap.