In my August 2020 article, “How to choose a cloud machine learning platform,” my first guideline for choosing a platform was, “Be close to your data.” Keeping the code near the data is necessary to keep the latency low, since the speed of light limits transmission speeds. After all, machine learning — especially deep learning — tends to go through all your data multiple times (each time through is called an epoch).
I said at the time that the ideal case for very large data sets is to build the model where the data already resides, so that no mass data transmission is needed. Several databases support that to a limited extent. The natural next question is, which databases support internal machine learning, and how do they do it? I’ll discuss those databases in alphabetical order.
Amazon Redshift is a managed, petabyte-scale data warehouse service designed to make it simple and cost-effective to analyze all of your data using your existing business intelligence tools. It is optimized for datasets ranging from a few hundred gigabytes to a petabyte or more and costs less than $1,000 per terabyte per year.
Amazon Redshift ML is designed to make it easy for SQL users to create, train, and deploy machine learning models using SQL commands. The CREATE MODEL command in Redshift SQL defines the data to use for training and the target column, then passes the data to Amazon SageMaker Autopilot for training via an encrypted Amazon S3 bucket in the same zone.
After AutoML training, Redshift ML compiles the best model and registers it as a prediction SQL function in your Redshift cluster. You can then invoke the model for inference by calling the prediction function inside a SELECT statement.
Summary: Redshift ML uses SageMaker Autopilot to automatically create prediction models from the data you specify via a SQL statement, which is extracted to an S3 bucket. The best prediction function found is registered in the Redshift cluster.
BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem; it exists as an open-source project and a paid service. RAPIDS is a suite of open source software libraries and APIs, incubated by Nvidia, that uses CUDA and is based on the Apache Arrow columnar memory format. CuDF, part of RAPIDS, is a Pandas-like GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
Dask is an open-source tool that can scale Python packages to multiple machines. Dask can distribute data and computation over multiple GPUs, either in the same system or in a multi-node cluster. Dask integrates with RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.
Summary: BlazingSQL can run GPU-accelerated queries on data lakes in Amazon S3, pass the resulting DataFrames to cuDF for data manipulation, and finally perform machine learning with RAPIDS XGBoost and cuML, and deep learning with PyTorch and TensorFlow.
Google Cloud BigQuery
BigQuery is Google Cloud’s managed, petabyte-scale data warehouse that lets you run analytics over vast amounts of data in near real time. BigQuery ML lets you create and execute machine learning models in BigQuery using SQL queries.
BigQuery ML supports linear regression for forecasting; binary and multi-class logistic regression for classification; K-means clustering for data segmentation; matrix factorization for creating product recommendation systems; time series for performing time-series forecasts, including anomalies, seasonality, and holidays; XGBoost classification and regression models; TensorFlow-based deep neural networks for classification and regression models; AutoML Tables; and TensorFlow model importing. You can use a model with data from multiple BigQuery datasets for training and for prediction. BigQuery ML does not extract the data from the data warehouse. You can perform feature engineering with BigQuery ML by using the TRANSFORM clause in your CREATE MODEL statement.
Summary: BigQuery ML brings much of the power of Google Cloud Machine Learning into the BigQuery data warehouse with SQL syntax, without extracting the data from the data warehouse.
IBM Db2 Warehouse
IBM Db2 Warehouse on Cloud is a managed public cloud service. You can also set up IBM Db2 Warehouse on premises with your own hardware or in a private cloud. As a data warehouse, it includes features such as in-memory data processing and columnar tables for online analytical processing. Its Netezza technology provides a robust set of analytics that are designed to efficiently bring the query to the data. A range of libraries and functions help you get to the precise insight you need.
Db2 Warehouse supports in-database machine learning in Python, R, and SQL. The IDAX module contains analytical stored procedures, including analysis of variance, association rules, data transformation, decision trees, diagnostic measures, discretization and moments, K-means clustering, k-nearest neighbors, linear regression, metadata management, naïve Bayes classification, principal component analysis, probability distributions, random sampling, regression trees, sequential patterns and rules, and both parametric and non-parametric statistics.
Summary: IBM Db2 Warehouse includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for R and Python.
Kinetica Streaming Data Warehouse combines historical and streaming data analysis with location intelligence and AI in a single platform, all accessible via API and SQL. Kinetica is a very fast, distributed, columnar, memory-first, GPU-accelerated database with filtering, visualization, and aggregation functionality.
Kinetica integrates machine learning models and algorithms with your data for real-time predictive analytics at scale. It allows you to streamline your data pipelines and the lifecycle of your analytics, machine learning models, and data engineering, and calculate features with streaming. Kinetica provides a full lifecycle solution for machine learning accelerated by GPUs: managed Jupyter notebooks, model training via RAPIDS, and automated model deployment and inferencing in the Kinetica platform.
Summary: Kinetica provides a full in-database lifecycle solution for machine learning accelerated by GPUs, and can calculate features from streaming data.
Microsoft SQL Server
Microsoft SQL Server Machine Learning Services supports R, Python, Java, the PREDICT T-SQL command, and the rx_Predict stored procedure in the SQL Server RDBMS, and SparkML in SQL Server Big Data Clusters. In the R and Python languages, Microsoft includes several packages and libraries for machine learning. You can store your trained models in the database or externally. Azure SQL Managed Instance supports Machine Learning Services for Python and R as a preview.
Microsoft R has extensions that allow it to process data from disk as well as in memory. SQL Server provides an extension framework so that R, Python, and Java code can use SQL Server data and functions. SQL Server Big Data Clusters run SQL Server, Spark, and HDFS in Kubernetes. When SQL Server calls Python code, it can in turn invoke Azure Machine Learning, and save the resulting model in the database for use in predictions.
Summary: Current versions of SQL Server can train and infer machine learning models in multiple programming languages.
Oracle Cloud Infrastructure (OCI) Data Science is a managed and serverless platform for data science teams to build, train, and manage machine learning models using Oracle Cloud Infrastructure. It includes Python-centric tools, libraries, and packages developed by the open source community and the Oracle Accelerated Data Science (ADS) Library, which supports the end-to-end lifecycle of predictive models:
Data acquisition, profiling, preparation, and visualization
Model training (including Oracle AutoML)
Model evaluation, explanation, and interpretation (including Oracle MLX)
Model deployment to Oracle Functions
OCI Data Science integrates with the rest of the Oracle Cloud Infrastructure stack, including Functions, Data Flow, Autonomous Data Warehouse, and Object Storage.
Models currently supported include:
ADS also supports machine learning explainability (MLX).
Summary: Oracle Cloud Infrastructure can host data science resources integrated with its data warehouse, object store, and functions, allowing for a full model development lifecycle.
Vertica Analytics Platform is a scalable columnar storage data warehouse. It runs in two modes: Enterprise, which stores data locally in the file system of nodes that make up the database, and EON, which stores data communally for all compute nodes.
Vertica uses massively parallel processing to handle petabytes of data, and does its internal machine learning with data parallelism. It has eight built-in algorithms for data preparation, three regression algorithms, four classification algorithms, two clustering algorithms, several model management functions, and the ability to import TensorFlow and PMML models trained elsewhere. Once you have fit or imported a model, you can use it for prediction. Vertica also allows user-defined extensions programmed in C++, Java, Python, or R. You use SQL syntax for both training and inference.
Summary: Vertica has a nice set of machine learning algorithms built-in, and can import TensorFlow and PMML models. It can do prediction from imported models as well as its own models.
All eight of these databases support doing machine learning internally. The exact mechanism varies, and some are more capable than others. If you have so much data that you might otherwise have to fit models on a sampled subset, however, then any of these databases might help you to build models from the full dataset without incurring serious overhead for data export.