Four Steps to Prepare Your Enterprise for Machine Learning
Machine learning can significantly automate generating insights from big data. Here’s how to get started.
By Ankur GargJune 8, 2020
Implementation of machine learning (ML) is often misunderstood, but knowledge of the technological tools and processes that facilitate the generation of data-derived insights is vital. With the increased volume of big data, it is more difficult to generate insights using traditional analytics. The ability of ML to significantly automate this process complements the growth of big data, especially when ML infrastructure is understood.
That means addressing the four key steps to preparing for ML:
Sourcing the data
Establishing a trusted zone or “single source of truth” (SSOT)
Establishing modeling environments
Provisioning model outputs or insights to downstream applications
Step 1: Source the Data
Data sourcing includes surveying accessible data types for inputs to the algorithm, as well as the processes and technologies needed to tap into these sources. Examples of data sources include core transactions, customer-provided information, external databases, market research data, social media, and website traffic.
Step 2: Establish a Trusted Zone
Once data is sourced, it must be curated through an SSOT (which structures the data into a consistent place). It is important to prove data validity and quality as data is handled. Before data can be consumed for ML, it must be aggregated, reconciled, and validated. Key attributes of a trusted zone include:
A central repository of data, aggregated from multiple channels.
Clearly defined and documented data elements and data lineage.
Documentation of assumptions. For example, if hospital data from a previous management system conflicts with elements of the current system, perhaps the most recent data entry prevails. This assumption must be documented.
Protocol for addressing unintended exceptions. Consider the previous example and assume that a patient had conflicting same-date data entries in both systems. The stack should capture such exceptions as a business intelligence report, and data may be manually entered into the trusted zone.
Daily reporting that matches and reconciles counts across systems.
Architecture that expands vertically and horizontally.
The data store that houses the trusted zone should have high availability and be resilient to failure. Lately, more data warehouses are hosted on cloud platforms. Cloud benefits include high availability, cost-effectiveness, and horizontal and vertical scaling. Another trend is increasing adoption of NoSQL databases (such as MongoDB), which provide greater flexibility and better performance to store unstructured data than traditional relational databases.
As with all things digital, regulation and security of data are critical. Data is more intimate today, and privacy and security regulations are more complicated. The data governance team should be part of any ML implementation. Having data lineage that tracks data sourcing is necessary to ensure compliance.
Data collected and held must be protected. Security and risk management teams must be involved to initiate and monitor best practices and to develop security breach response plans. Investment in outsourced assistance is worthwhile for smaller institutions. If cloud vendors are utilized, they must contractually agree that data security is their responsibility. Transmission of data from on premises to the cloud and back must be part of the scope and should be carefully designed to address security risk. Data encryption is valuable before transmittal to the cloud, even when transmission occurs over a secured virtual private network.