Getting Hadoop to Jump Through AI/ML Hoops – The Next Platform

  • Lauren
  • July 22, 2021
  • Comments Off on Getting Hadoop to Jump Through AI/ML Hoops – The Next Platform

Just a decade ago, the enterprise IT push was to make Hadoop the platform for storage and analytics. At that time, cloud hesitancy was still looming for large on-prem organizations. Hadoop, no matter how that ecosystem played out over the years, became a major source of investment with the idea that compute, analytics, and I/O could be more seamless and even cheaper.
For companies that are still riding that wave, and we’ve talked to plenty of companies that are still making use of Hadoop and several of its offshoots (Yarn, the HDFS file system, etc.) the rise of AI/ML has pushed a rethink in Hadoop since integrating modern AI platforms and tools is not necessarily simple.
It’s not just a matter of integration either. It’s how Hadoop’s native file system has been designed, which is for large, complex data versus small files that need to be handled in near real-time. Over the last couple of years a number of startups have sought ways to let users continue leveraging those Hadoop investments and start dipping their toes in AI/ML waters while still using familiar HDFS via NFS connectors, for example.

Among the companies trying to blend the old “big data” world and the new one in AI/ML is software-defined storage company, Quobyte, which today described a new driver they’ve developed for Hadoop that takes aim at HDFS—the source of limitations when it comes to implementing streaming or near-real time analytics where a host of new tools are available for ML on small, fast-moving data.
Quobyte, if you’ll recall, was founded by former Google infrastructure folks and two of the founders of the open source XtreemFS file system, which made its appearance in 2007.
To put Quobyte into this perspective, the team has built their own drivers that hook into HDFS. One feature that early users of XtreemFS might remember that set it apart was the ability for users to control data placement. This is also an attractive feature for existing users of Ceph, who need a capability that addresses metadata and using policies and other information, can provide the best placement of data stored for Hadoop workloads.
Taking aim at the small file limitations in HDFS, Quobyte’s new native driver for Hadoop is focused on mixed workloads without the heavy task of introducing a new file system.
Hadoop is not bad or slow or expensive or not adaptable, of course. It’s possible to keep all the scale-out benefits that come with it and maintain the locality and sharing capabilities with other interfaces. The way Quobyte architected their Hadoop driver is simply through the HDFS API so it looks the same to the system, even if it’s talking to Quobyte instead of HDFS. That also means application changes aren’t required.
The driver they built is more like a plugin that implements the Quobyte client and all dependencies, allowing applications to talk directly to the driver. That chatter gets translated to Quobyte’s registry and metadata backend with all talk happening via TCP.
“Today’s analytics solutions allow enterprises to extract important insight from large volumes of data, but with the increasing prevalence of AI and machine learning in data analytics applications HDFS’s batch processing limitations have been exposed,” said Bjӧrn Kolbeck, CEO of Quobyte. “By deploying Quobyte’s native Hadoop/HDFS driver, enterprises can now seamlessly share large amounts of file data with high performance across Hadoop/analytics, machine learning, and any Linux or Windows application.”
“With this announcement, Quobyte and Hadoop bring together the ability to run on any distributed commodity x86 hardware, which unlocks the benefits of Hadoop while removing the storage limitations of HDFS and NFS, ensuring these modern workloads can run efficiently on Hadoop clusters,” Kolbeck adds.