How sparsification and quantization build leaner AI

Artificial Intelligence (AI) and Machine Learning (ML) are rarely out of the news. Technology vendors are busy jostling for position in the AI-ML marketplace, all keen to explain how their approach to automation can speed everything from predictive maintenance for industrial machinery to knowing what day consumers are most likely to order vegan sausages in their online shopping orders.

Much of the debate around AI itself concerns the resultant software tooling that tech vendors bring to market. We want to know more about how so-called ‘explainable’ AI functions function and what those advancements can do for us. A key part of that explainability concentrates on AI bias and the need to ensure human unconscious (or perhaps semiconscious) thinking is not programmed into the systems we are creating.

An energy-intensive process

Some of the advancement in AI and ML has been hindered by the energy-intensive process of training and deploying the AI engines that run inside these models. This has led some organizations to rethink how they source the energy to power their platforms, with organizations including AWS making the whole cloud AI environment issue a flag-waving exercise in and of itself.

Industry analysts have not been upbeat about the prospects here; many think that with each new advancement in AI hardware comes an exponential increase in the energy needed to train and run the AI, furthering the environmental impact.

Co-founder and CTO of Tel Aviv headquartered DeepMind is Dr. Eli David. As a self-styled research pioneer in deep learning and neural networks, David has focused his work on advancing deep learning through supplemental software.

He created DeepCube, a software-based inference accelerator that can be deployed on top of existing hardware to restructure deep learning models throughout their training phase. Early results have shown a 10x decrease in model size, which critically decreases the computational power needed to run the model in real world environments.

Artificial Intelligence (AI) and Machine Learning (ML) are rarely out of the news. Technology vendors are busy jostling for position in the AI-ML marketplace, all keen to explain how their approach to automation can speed everything from predictive maintenance for industrial machinery to knowing what day consumers are most likely to order vegan sausages in their online shopping orders.

Much of the debate around AI itself concerns the resultant software tooling that tech vendors bring to market. We want to know more about how so-called ‘explainable’ AI functions function and what those advancements can do for us. A key part of that explainability concentrates on AI bias and the need to ensure human unconscious (or perhaps semiconscious) thinking is not programmed into the systems we are creating.

An energy-intensive process

Some of the advancement in AI and ML has been hindered by the energy-intensive process of training and deploying the AI engines that run inside these models. This has led some organizations to rethink how they source the energy to power their platforms, with organizations including AWS making the whole cloud AI environment issue a flag-waving exercise in and of itself.

Industry analysts have not been upbeat about the prospects here; many think that with each new advancement in AI hardware comes an exponential increase in the energy needed to train and run the AI, furthering the environmental impact.

Co-founder and CTO of Tel Aviv headquartered DeepMind is Dr. Eli David. As a self-styled research pioneer in deep learning and neural networks, David has focused his work on advancing deep learning through supplemental software.

He created DeepCube, a software-based inference accelerator that can be deployed on top of existing hardware to restructure deep learning models throughout their training phase. Early results have shown a 10x decrease in model size, which critically decreases the computational power needed to run the model in real world environments.

Sparsification cleans neural brain patterns

Dr David and team say that, specifically, DeepCube’s proprietary technology mimics the human brain, which undergoes pruning during its initial training period, while it is most amenable to sparsification i.e. quite literally, the act of making something more sparse.

“Similar to the early stages of human brain development, our deep learning model experiences a mass intake of data at its earliest stages. But as training occurs, neural connections become stronger with each learned action and adapt to support continuous learning. As each connection becomes stronger, redundancies are created and overlapping connections can be removed. This is why continuously restructuring and sparsifying deep learning models during training time (and not after training is complete) is necessary,” said David.

After the AI training stage, the DeepMind software engineering team point out that the AI model has lost a significant amount of its ‘plasticity’. This means that neural connections cannot adapt to take over additional responsibility, which in turn means that removing connections can result in decreased accuracy.

DeepCube is performing ‘optimized pruning’ on AI models to bring sparsification to bear, so it runs a minimal compute and power footprint. During training, the hardware that will run the model is imprinted as part of the training process to make sure optimisation is not naïve and generic, but specific and optimized. By doing that process, minimal power and energy budgets are being used for each ML workload.

Prudent pruning provides peppier AI power

“Current methods, where attempts are made to make the deep learning model smaller post-training phase, have reportedly seen some success. However, if you prune in the earlier stages of training when the model is most receptive to restructuring and adapting, you can drastically improve results. When you conduct sparsification during training, the connections are still in the rapid learning stage and can be trained to take over the functions of removed connections,” said David.

The DeepCube team say that they are now working to build more AI by continuously monitoring pruning progress to mitigate any damage to output accuracy while the model is at its greatest plasticity.

The resulting model AI can therefore be lightweight with significant speed improvement and memory reduction, which could allow for an efficient deployment on intelligent edge devices (e.g. mobile devices, drones, agricultural machines, preventative maintenance and the like). David and team insist that this approach can be key in terms of making devices smart while reducing their environmental impact and allowing machines to make truly autonomous decisions without raising the planet’s temperature.

By decreasing the size of the model by 85-90% on average and increasing the speed by 10x, this use of sparsification allows AI models to run with less power consumption and, in theory at least, less environmental impact. This approach also allows more AI to be deployed in a smaller physical computing space, so where Internet of Things (IoT) ‘edge’ devices need to be able to function with additional smartness, this is good news.

CEO at edge-focused AI company LGN Daniel Warner is understandably upbeat about the need for IoT device AI to be able to make fast decisions more quickly and more accurately. Think of a car driving on the motorway he says, it doesn’t have time to connect to a datacenter before deciding whether to brake or not.

“Sparsification and simplification of models is critical to this process in two ways. Firstly, it makes smaller models that can be run on edge devices that naturally have constrained computing power — as opposed to big powerful machines in the cloud. Secondly, it allows models to make decisions faster, shaving precious milliseconds off exactly the type of time-sensitive decisions they’re trying to make.  In the case of autonomous driving, those milliseconds equate directly to safety and lives saved,” said LGN’s Warner.

Sparsification plus quantization

Co-founder and chief product officer at machine learning modelling specialist OctoML is Jason Knight. He agrees that performance and size is especially important on the computing edge, where leveraging sparsity (through sparsification) along with quantization can reduce the bill of materials costs for a System on a Chip (SoC) by as much as fifth.

Quantization, for the record, in mathematics and digital signal processing is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set. For Knight and his team, model performance and size are two large obstacles to model deployment both in the cloud and on the edge… and sparsification plus quantization can both help.

“Network pruning to increase sparsity is a powerful way of reducing model size, and coupled with the right compiler and computational engine, we’ve seen significant performance improvements on end to end workloads,” said OctoML’s Knight. “Quantization helps transforming deep learning models to use parameters and computations at a lower precision and is a technique that pairs well with pruning, to reduce size and increase performance. Reducing 32-bit floating point numbers to 8-bit integers and even all the way to single bit operations enables another round of significant savings.”

This entire discussion points to a lower substrate level technology that many of us might not immediately consider the leading or key issue in AI today. But as we have seen so many times before, it is sometimes the ‘ingredient element’ that makes any product or service what it really. After all, the PC revolution was hinged on microprocessor speeds and diet colas only exist due to NutraSweet, right? DeepCube was actually acquired by Additively Manufactured Electronics (AME) company Nano Dimension during the writing of this analysis. Sparsification and quantization could where the smart money is for the next chapter in AI.