The demand for production-quality software for mining insights from datasets across scales has exploded in the last several years. The growing size of datasets throughout industry, government, and other fields has increased the need for scalable distributed machine learning solutions that can make full use of available hardware to analyze the largest datasets. This article is intended to provide a brief introduction to just a few of the many available tools for machine learning across scales. First, we will look at interactive tools suited for exploratory data analysis on a single workstation. Next, we will consider the development of machine learning pipelines for small-to-medium datasets on a single node. Finally, we will survey some of the solutions available for leveraging cluster resources for large-scale machine learning applications.
Getting Acquainted with Your Data: GUI Machine Learning Frameworks
For users who are new to machine learning, or for those who prefer an interactive interface for preliminary data exploration, GUI-based tools like Weka and Orange are great options for quickly getting acquainted with a dataset. Both packages have facilities for loading, sampling, transforming, and visualizing data, as well as for applying and evaluating supervised and unsupervised models.
Weka in particular has an impressive selection of algorithms, while Orange has an especially intuitive, elegant interface based on a directed network model. While these tools are not suited for production-scale processing of large datasets, they represent a convenient means of guiding early decision-making in building up machine learning pipelines.
Analyzing Small-to-Medium Datasets
When it comes time to develop a codified machine learning pipeline, for datasets that can be handled by a single node, it is hard to beat the Python-based scikit-learn package. The package is well maintained and was found in a 2018 Kaggle survey to be one of the most popular machine learning packages. It contains not only an extensive array of classification, regression, and clustering algorithms, but routines for data transformation and feature extraction (including document vectorization), parameter tuning/model selection, and model validation and evaluation.
The scikit-learn package has some limited multicore (single-node) parallel capabilities through joblib. Algorithms are implemented with different parallelism strategies, so digging into the documentation or code is a good idea before trying to operationalize a machine learning pipeline on many cores. Model selection routines like GridSearchCV provide one straightforward option for parallelism when tuning hyperparameters: farming out each training process to its own process/thread. One important caveat to this convenience is that if the parameters being searched include dataset size or algorithm type, this approach may lead to workload imbalance. When data set sizes or processing requirements necessitate multiple nodes, the Dask package provides a limited set of drop-in replacements for a handful of scikit-learn‘s routines that allows them to run in distributed memory fashion.
Scaling Up: CPU and GPU Parallelism in Deep Learning Application
For larger datasets and more sophisticated applications like deep learning, we will need something more powerful than the packages we have looked at so far. PyTorch is popular among deep learning frameworks for its relative user friendliness, its deep integration with Python, and its ability to leverage GPUs. The PyTorch framework is highly optimized, utilizing mature frameworks like Intel MKL and NVIDIA’s cuDNN library under the hood. Its high-level functionality provides a low barrier to entry for commercial-grade deep learning, making it an excellent go-to package.
The TensorFlow framework is considered to be one of the industry standards in deep learning. It is notoriously powerful, notoriously flexible, and notoriously complex, at least compared with packages like scikit-learn and PyTorch, which in general will both involve shorter development times. As with PyTorch, TensorFlow can run on CPUs or GPUs, and has some limited built-in capabilities for distributed computing.
Both TensorFlow and PyTorch can be distributed over a cluster using Uber’s MPI-based Horovod framework. Using Horovod will typically require slightly more than, say, scaling up scikit-learn models with Dask, but for deep learning applications that need the power of multiple nodes, Horovod provides a stable, well-supported solution. This makes the combination of PyTorch and Horovod an especially good choice for users wishing to minimize development time while still enabling cluster-scale learning.
MLlib: A Natively Distributed Machine Learning Framework
The Apache Spark distributed computing framework, used throughout industry, includes MLlib, a machine learning library that rivals scikit-learn in its extensive range of functionality. It can be used through several popular languages. Because Spark and several Spark-compatible schedulers like Hadoop are tightly integrated into cloud services like Google Cloud and Amazon Web Services, MLlib makes it convenient to conduct preliminary development of a machine learning pipeline on a single workstation, then rapidly transition to testing in the cloud before deploying at the cluster scale.
Here is a chart listing the pertinent features of the frameworks just discussed:
Native GPU support?
Native distributed computing?
Typical use case
Preliminary data analysis
Preliminary data analysis
No (needs, e.g., Dask)
Small-to-medium datasets; complex pipelines
No (needs, e.g., Horovod)
Deep learning; medium-to-large datasets
Google Brain Team
Python, C, third-party support for others
Limited (extensible with Horovod)
Deep learning; medium-to-large datasets
Java, Scala, Python
Medium-to-large datasets; complex pipelines
No single framework has fully satisfied the demand for fast-to-develop, fast-to-run machine learning in high-performance computing environments. Successfully deploying machine learning at the cluster scale requires a careful choice of software frameworks, as well as the use of computational strategies that are consistent with hardware, data, and business constraints. While there are many more machine learning frameworks available than are mentioned in this article, the frameworks mentioned here are well-supported and robust, and will help users to succeed in their machine learning applications, from single-core to whole-cluster scales.
About the author: Alex Richert, PhD, is a Senior Scientific Programmer/Analyst at RedLine Performance Solutions. Dr. Richert has more than 10 years of experience in scientific analysis, software development, and parallel computing. Since 2019, he has been working for RedLine to support operations on the National Weather Service’s Weather and Climate Operational Supercomputing System (WCOSS), as well as prepare for operations on its successor, WCOSS2.
From Amazon to Uber, Companies Are Adopting Ray
A ‘Glut’ of Innovation Spotted in Data Science and ML Platforms
The Maturation of Data Science