Lately, there has been a lot of discussion in the machine learning space about the concept of feature stores. Feature stores were first developed by the Uber Michelangelo team to support the deployment of thousands of machine learning models in production. Today, several open source and commercial feature stores have emerged, making the technology accessible to every organization. What is this technology, and why is the industry investing in it? Below, Mike Del Balso from Tecton and Willem Pienaar from Feast answer our questions and explain why feature stores are key to building machine learning models and deploying them to production to power new applications.
insideBIGDATA: What are features and why are they so important?
Mike/Willem: Essentially, features are the backbone of any ML application. A feature is data that serves as a predictive input signal to a model. Features are derived from transforming all kinds of raw data, from real-time streaming data to batch historical data. For example, let’s say a food delivery service wanted to show an expected delivery time in their app. One useful feature might be the distance from the restaurant to the delivery address. Another might be the number of incoming orders the restaurant received in the past 30 minutes.
insideBIGDATA: What is a feature store?
Mike/Willem: A feature store is a data system specific to machine learning that acts as the central hub for features across an ML project’s lifecycle. It operates the data pipelines that generate feature values, and serves those values for training and inference. It enables data scientists to build new features collaboratively, and deploy them to production quickly and reliably. In short — it brings DevOps-like principles to ML data.
insideBIGDATA: How did the concept of the feature store originate?
Mike/Willem: The industry’s first real feature store was built by the Michelangelo team at Uber. When I [Mike] first joined Uber, it was incredibly hard to get ML models to production. Getting a single model to production required complex coordination between data scientists, data engineers, ML engineers, and DevOps teams.
My team, Michelangelo, was tasked with building ML infrastructure to simplify this process of getting ML to production. We started off by focusing on models, but even after we implemented a platform for data scientists to more easily train, validate, and serve models in production, we were still having trouble. We realized that the main bottleneck was the data, and specifically building and deploying features.
insideBIGDATA: What is operational ML?
Mike/Willem: Operational ML is really about running ML models in production to generate predictions in real-time and to power production applications. Organizations use operational ML to build a new class of applications that deliver new customer experiences and automate business processes. Operational ML enables countless new use cases including personalized product recommendations, dynamic pricing, real-time insurance underwriting, and inventory optimization.
insideBIGDATA: Why is it so hard to build and deploy features?
Mike/Willem: For all the promise of Operational ML, it is hard to do at scale. When building traditional apps, engineering teams really just need to build and deploy applications. In the world of operational ML, enterprises have to deploy apps, models, and features to production.
Most enterprises can build and deploy apps efficiently. That’s the result of decades of improvement in software engineering tools and processes, culminating in today’s modern DevOps practices. But we don’t have decades of experience getting models and features to production, and we don’t have DevOps-like tooling and processes for ML. Up to now, analytics has mostly been limited to generating insights for offline human consumption. The majority of data scientists are building dashboards and offline predictions, not building systems that generate predictions with mission-critical, production SLAs.
It’s getting easier to get models to production with emerging MLOps platforms like Kubeflow. But we’re still lacking proper tooling to get features to production, and that was the motivation to build a feature store at Uber.
insideBIGDATA: What does a feature store enable data scientists to do?
Mike/Willem: Feature stores bring DevOps-like capabilities to the feature lifecycle. They enable data scientists to build a library of features collaboratively using batch, streaming, and real-time data. Data scientists can instantly serve their feature data online, without depending on another team to reimplement production pipelines. Data scientists can search and discover existing features to maximize reuse across models.
insideBIGDATA: Are all feature stores the same? What kinds of variations should we be aware of?
Mike/Willem: We’re starting to see convergence on the definition of a feature store. But there are significant differences between individual products in the feature store category. Users should educate themselves prior to selecting a specific feature store.
First, a feature store should manage the complete lifecycle of features — from transformations to online serving. More basic products only store and serve feature values, and don’t manage the transformations that generate those values. In other words, they provide a single source of truth for data, but they don’t simplify the process of building new features. Data scientists still rely on data engineering teams to manually build bespoke production pipelines.
Second, feature stores should be able to build features from batch, streaming, and real-time data. This is important to have historical context for training, while providing fresh feature values for real-time inference. Some products are only able to handle batch and/or streaming data sources.
Third, feature stores should be enterprise-ready with built-in security and monitoring. And they should integrate easily with a variety of data sources and MLOps platforms.
insideBIGDATA: How does a feature store fit into the complete stack for operational ML?
Mike/Willem: It’s an exciting time in MLOps, and as operational ML stacks are still taking shape, the canonical stack doesn’t exist yet. What’s clear is that teams building machine learning to power live end-user products and experiences are moving away from monolithic ML platforms, and treating ML more like software development. This means incorporating a collection of best-in-class tools that work together to enable powerful workflows.
It will be fascinating to watch how the stack for operational ML evolves in the future. But there’s no doubt that organizations will benefit tremendously from having access to more advanced tooling that helps them get ML to production. Ultimately, organizations will build more ML-powered applications to deliver new customer experiences and automate business processes.
About the Interviewees
Mike Del Balso is Co-Founder and CEO of Tecton. Mike is focused on building next-generation data infrastructure for Operational ML. Before Tecton, he was the PM lead for the Uber Michelangelo ML platform. He was also a product manager at Google where he managed the core ML systems that power Google’s Search Ads business. Previous to that, he worked on Google Maps. He holds a BSc in Electrical and Computer Engineering summa cum laude from the University of Toronto.
Willem Pienaar leads the Data Science Platform team at Gojek, developing the Gojek ML platform, which supports a wide variety of models and handles more than 100 million orders every month. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making. In a previous life, he founded and sold a networking startup.
Sign up for the free insideBIGDATA newsletter.