In artificial intelligence, enterprises still not minding their data

Data is the raw material that fuels artificial intelligence and machine learning initiatives, but it actually can’t be that raw. It needs to be as accurate, timely and well-vetted as possible — or else AI will deliver erroneous or biased results. At this stage, most enterprises haven’t quite locked down the viability of the data employed within their AI efforts.

Photo: Joe McKendrick

The potential biases at the code level in AI have been well documented in works such as Cathy O’Neil’s Weapons of Math Destruction, which calls for greater transparency in the algorithms that are driving decisions on everything from creditworthiness to corporate performance.

Data needs to be looked at as well, and efforts to do so are only beginning, according to O’Reilly’s latest survey of 1,388 data scientists, executives and IT professionals on AI adoption. The survey finds that AI efforts are maturing from prototype to production, but organizational support remains an obstacle. “Data governance isn’t yet a priority,” the report’s authors, Roger Magoulas and Steve Swoyer, report They indicate only about one-fifth of respondents “have implemented formal data governance processes and/or tools to support and complement their AI projects.”

Practitioners and executives recognize that data governance is a pressing requirement for AI, and a number do intend to put more data governance in place. One in four respondents, 26%, say their organizations will likely put formal data governance processes or tools in place over the coming year, and at least 35% expect to do so within the next three years. Still, this means close to two-thirds of AI adopters will still lack strong data governance mechanisms.

“AI adopters—much like organizations everywhere—seem to treat data governance as an additive rather than an essential ingredient,” Magoulas and Swoyer state. They urge AI adopters to incorporate such best practices and mechanisms as data provenance, data lineage, consistent data definitions, and rich metadata management into their AI projects from the start. “Think of data governance as analogous to observability in software development,” they add. Data governance is all about ensuring transparency in the results AI delivers.

The O’Reilly survey also finds TensorFlow leads as the most widely adopted AI tool, cited by almost 55% of respondents. TensorFlow also was the top choice in the previous year’s survey. Python-related tools also continue to dominate the AI development scene, Magoulas and Swoyer observe. “Four of the five most popular tools for AI-related work are either Python-based or dominated by Python tools, libraries, patterns, and projects.” Along with TensorFlow, this includes scikit-learn (48%), PyTorch (36%) and Keras (34%).

In terms of machine learning techniques, supervised learning remains the most popular, used among 73% of mature AI sites. Supervised learning also led in the previous year’s survey. Deep learning follows at about 66%, and model-based methods at 60%. Multiple methods are in play at a majority of companies.

Hannah