This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
A new neural network architecture designed by artificial intelligence researchers at DarwinAI and the University of Waterloo will make it possible to perform image segmentation on computing devices with low-power and -compute capacity.
Segmentation is the process of determining the boundaries and areas of objects in images. We humans perform segmentation without conscious effort, but it remains a key challenge for machine learning systems. It is vital to the functionality of mobile robots, self-driving cars, and other artificial intelligence systems that must interact and navigate the real world.
Until recently, segmentation required large, compute-intensive neural networks. This made it difficult to run these deep learning models without a connection to cloud servers.
In their latest work, the scientists at DarwinAI and the University of Waterloo have managed to create a neural network that provides near-optimal segmentation and is small enough to fit on resource-constrained devices. Called AttendSeg, the neural network is detailed in a paper that has been accepted at this year’s Conference on Computer Vision and Pattern Recognition (CVPR).
Object classification, detection, and segmentation
One of the key reasons for the growing interest in machine learning systems is the problems they can solve in computer vision. Some of the most common applications of machine learning in computer vision include image classification, object detection, and segmentation.
Image classification determines whether a certain type of object is present in an image or not. Object detection takes image classification one step further and provides the bounding box where detected objects are located.
Segmentation comes in two flavors: semantic segmentation and instance segmentation. Semantic segmentation specifies the object class of each pixel in an input image. Instance segmentation separates individual instances of each type of object. For practical purposes, the output of segmentation networks is usually presented by coloring pixels. Segmentation is by far the most complicated type of classification task.
Image classification vs object detection vs semantic segmentation (credit: codebasics)
The complexity of convolutional neural networks (CNN), the deep learning architecture commonly used in computer vision tasks, is usually measured in the number of parameters they have. The more parameters a neural network has the larger memory and computational power it will require.
RefineNet, a popular semantic segmentation neural network, contains more than 85 million parameters. At 4 bytes per parameter, it means that an application using RefineNet requires at least 340 megabytes of memory just to run the neural network. And given that the performance of neural networks is largely dependent on hardware that can perform fast matrix multiplications, it means that the model must be loaded on the graphics card or some other parallel computing unit, where memory is more scarce than the computer’s RAM.
Machine learning for edge devices
Due to their hardware requirements, most applications of image segmentation need an internet connection to send images to a cloud server that can run large deep learning models. The cloud connection can pose additional limits to where image segmentation can be used. For instance, if a drone or robot will be operating in environments where there’s no internet connection, then performing image segmentation will become a challenging task. In other domains, AI agents will be working in sensitive environments and sending images to the cloud will be subject to privacy and security constraints. The lag caused by the roundtrip to the cloud can be prohibitive in applications that require real-time response from the machine learning models. And it is worth noting that network hardware itself consumes a lot of power, and sending a constant stream of images to the cloud can be taxing for battery-powered devices.
For all these reasons (and a few more), edge AI and tiny machine learning (TinyML) have become hot areas of interest and research both in academia and in the applied AI sector. The goal of TinyML is to create machine learning models that can run on memory- and power-constrained devices without the need for a connection to the cloud.
The architecture of AttendSeg on-device semantic segmentation neural network
With AttendSeg, the researchers at DarwinAI and the University of Waterloo tried to address the challenges of on-device semantic segmentation.
“The idea for AttendSeg was driven by both our desire to advance the field of TinyML and market needs that we have seen as DarwinAI,” Alexander Wong, co-founder at DarwinAI and Associate Professor at the University of Waterloo, told TechTalks. “There are numerous industrial applications for highly efficient edge-ready segmentation approaches, and that’s the kind of feedback along with market needs that I see that drives such research.”
The paper describes AttendSeg as “a low-precision, highly compact deep semantic segmentation network tailored for TinyML applications.”
The AttendSeg deep learning model performs semantic segmentation at an accuracy that is almost on-par with RefineNet while cutting down the number of parameters to 1.19 million. Interestingly, the researchers also found that lowering the precision of the parameters from 32 bits (4 bytes) to 8 bits (1 byte) did not result in a significant performance penalty while enabling them to shrink the memory footprint of AttendSeg by a factor of four. The model requires little above one megabyte of memory, which is small enough to fit on most edge devices.
“[8-bit parameters] do not pose a limit in terms of generalizability of the network based on our experiments, and illustrate that low precision representation can be quite beneficial in such cases (you only have to use as much precision as needed),” Wong said.
Experiments show AttendSeg provides optimal semantic segmentation while cutting down the number of parameters and memory footprint.
Attention condensers for computer vision
AttendSeg leverages “attention condensers” to reduce model size without compromising performance. Self-attention mechanisms are a series that improve the efficiency of neural networks by focusing on information that matters. Self-attention techniques have been a boon to the field of natural language processing. They have been a defining factor in the success of deep learning architectures such as Transformers. While previous architectures such as recurrent neural networks had a limited capacity on long sequences of data, Transformers used self-attention mechanisms to expand their range. Deep learning models such as GPT-3 leverage Transformers and self-attention to churn out long strings of text that (at least superficially) maintain coherence over long spans.
AI researchers have also leveraged attention mechanisms to improve the performance of convolutional neural networks. Last year, Wong and his colleagues introduced attention condensers as a very resource-efficient attention mechanism and applied them to image classifier machine learning models.
“[Attention condensers] allow for very compact deep neural network architectures that can still achieve high performance, making them very well suited for edge/TinyML applications,” Wong said.
Attention condensers improve the performance of convolutional neural networks in a memory-efficient way
Machine-driven design of neural networks
One of the key challenges of designing TinyML neural networks is finding the best performing architecture while also adhering to the computational budget of the target device.
To address this challenge, the researchers used “generative synthesis,” a machine learning technique that creates neural network architectures based on specified goals and constraints. Basically, instead of manually fiddling with all kinds of configurations and architectures, the researchers provide a problem space to the machine learning model and let it discover the best combination.
“The machine-driven design process leveraged here (Generative Synthesis) requires the human to provide an initial design prototype and human-specified desired operational requirements (e.g., size, accuracy, etc.) and the MD design process takes over in learning from it and generating the optimal architecture design tailored around the operational requirements and task and data at hand,” Wong said.
For their experiments, the researchers used machine-driven design to tune AttendSeg for Nvidia Jetson, hardware kits for robotics and edge AI applications. But AttendSeg is not limited to Jetson.
“Essentially, the AttendSeg neural network will run fast on most edge hardware compared to previously proposed networks in literature,” Wong said. “However, if you want to generate an AttendSeg that is even more tailored for a particular piece of hardware, the machine-driven design exploration approach can be used to create a new highly customized network for it.”
AttendSeg has obvious applications for autonomous drones, robots, and vehicles, where semantic segmentation is a key requirement for navigation. But on-device segmentation can have many more applications.
“This type of highly compact, highly efficient segmentation neural network can be used for a wide variety of things, ranging from manufacturing applications (e.g., parts inspection / quality assessment, robotic control) medical applications (e.g., cell analysis, tumor segmentation), satellite remote sensing applications (e.g., land cover segmentation), and mobile application (e.g., human segmentation for augmented reality),” Wong said.