Accelerating Machine Learning Means New Hardware – Electronic Design

  • Lauren
  • May 15, 2020
  • Comments Off on Accelerating Machine Learning Means New Hardware – Electronic Design

Machine learning (ML) is only one aspect of artificial intelligence (AI). ML also has many parts to it, but those having the biggest impact now are based around neural networks (NNs). Even drilling down this much doesn’t narrow the field a lot due to the multitude of variations and implementations. Some work well for certain types of applications like image recognition, while others can handle natural language processing or even modification and creation of artwork. There are deep neural networks (DNNs), convolutional neural networks (CNNs), and spiking neural networks (SNNs). Some are similar while others use significantly different approaches and training techniques. All tend to require more significant amounts of computing power than conventional algorithms, but the results make neural networks very useful. Though ML applications run on lowly microcontrollers, the scope of those applications is actually limited by the hardware. Turning to hardware tuned or designed for NNs allows designers to implement significantly more ambitious applications like self-driving cars. These depend heavily on the ability of the system to employ NNs for image recognition, sensor integration, and a host of other chores. Hardware acceleration is the only way to deliver high-performance ML solutions. A microcontroller without ML hardware may be able to run an ML application for checking the motor it’s controlling to optimize performance or implement advanced diagnostics, but it falls short when trying to analyze video in real time. Likewise, processing larger images at a faster rate is just one ML chore that places heavy demands on a system. A plethora of solutions are being developed and delivered that provide orders of magnitude more performance to address training and deployment. In general, deployment needs are less than systems doing training but there are no absolutes when it comes to ML. This year’s Linely Spring Processor Conference was almost exclusively about AI and ML. Most of the presentations addressed high-performance hardware solutions. While many will land in the data center, a host of others will wind up on “the edge” as embedded systems. Wafer-Scale Integration Targets Machine Learning Creating new architectures are making ML platforms faster; still, there’s an insatiable need for more ML computing power. On the plus side, it’s ripe for parallel computing and cloud-based solutions can network many chips to handle very large or very many ML models. One way to make each node more powerful is to put more into the compute package. This is what Cerebras Systems’ Waferscale Engine (WSE) does with identical chips, but it doesn’t break up the die (Fig. 1). Instead, the connections between chips remain, making the 46,225-mm2 silicon solution the largest complete computing device with 1.2 trillion transistors that implement 400,000 AI optimized cores. The die has 18 GB of memory with 9 petabytes per second (PB/s) of memory bandwidth. The fabric bandwidth is 100 petabits per second (Pb/s). The chip is implemented by TSMC using its 16-nm process. 1. Shown is Cerebras’ Wafer Scale Engine (WSE) machine-learning solution (a). It’s designed to be used as is, not broken up into individual chips. Cerebras’ WSE needs a water-cooled system to keep it running without a meltdown (b). (Source: Cerebras Systems) Each chip is power-efficient; however, packing this much computing power in a small package leads to lots of heat. Multiple die are put into one water-cooled system. Multiple systems can fit into a standard rack with Ethernet connections, allowing very large systems to be constructed. The interconnect and computational support have been optimized to handle sparse neural networks that are common for most applications. Spiking Neural Network Hardware Spiking neural networks (SNNs) have different characteristics than DNNs. One advantage with SNNs is that the support for learning is on par for deployment, whereas DNNs require lots of data and computational capabilities for training compared to deployment. SNNs can also handle incremental training. Furthermore, SNNs require less computational overhead because they only process neurons when triggered (Fig. 2). 2. Conventional neural networks (top) evaluate all the elements in the model at each level whereas spiking neural networks (bottom) compute only triggered events. (Source: GrAI Matter Labs) BrainChip’s AKD1000 Neural Network SoC (NSoC) can handle both DNNs and SNNs. The architecture supports up to 80 neural processing units (NPUs)—the AKD1000 has 20 NPUs (Fig. 3). A conversion complex implements a spike event converter and a data-spike event encoder that can handle multivariable digital data as well as preprocessed sensor data. The SNN support only processes non-zero events. 3. The AKD1000 developed by BrainChip neuron fabric supports spiking neural networks. A Cortex-M4 manages system resources. The AKD1000 benefits from sparsity in both activations and weights. It supports quantizing weights and activations of 1, 2, or 4 bits, leading to a small memory footprint. NPUs communicate events over a mesh network, so model processing doesn’t require external CPU support. Tenstorrent also targets SNN applications with its Tensix cores (Fig. 4). The cores have five single-issue RISC cores and a 4-TOPS compute engine. A packet processing engine provides decoding/encoding and compression/decompression support along with data-transfer management. 4. Tenstorrent’s Tensix core is built with five, single-issue RISC codes and a 4-TOPS compute engine. As with most SNN platforms, Tensix cores can be used on the edge or in the data center. They provide fine-grained conditional execution that makes the system more efficient in processing SNN ML models. The system is designed to scale since it doesn’t use shared memory. It also doesn’t require coherency between nodes, enabling a grid of cores to be efficient connected via its network. GrAI Matter Labs also targets this event-driven ML approach with its NeuronFlow technology (Fig. 5). The GrAI One consists of 196 neuron cores with 1024 neurons/core, which adds up to 200,704 neurons. A proprietary network-on-chip provides the interconnect. No external DRAM is needed. The SDK includes TensorFlow support.5. Events and responses are handled by the network-on-chip (1). The events (2) are then multiplied by the appropriate weights (3), run through the neuron pool (4) and the results are then processed (5). CNNs, DNNs, and More Convolutional neural networks are very useful for certain kinds of applications like image classification. SiMaai optimized its chip (Fig. 6) for CNN workloads. By the way, sima means “edge” in Sanskrit. The chip is also ISO 26262 ASIL-B compliant, allowing it to be used in places where other chips aren’t suitable, such as automotive applications. 6. SiMaai’s SoC is ISO 26262-compliant. An Arm Cortex provides application support, but it’s augmented with a 50-TOPS machine-learning accelerator (MLA). The MLA includes an image signal processor (ISP) and computer-vision process to preprocess data, allowing all aspects of the system to run on a single chip. Flex Logix is known for its embedded FPGA technology. The company brought this expertise to the table with its nnMAX design and the InferX X1 coprocessor. The nnMAX array cluster is designed to optimize memory use for weights by implementing Winograd acceleration that handles input and output translation on-the-fly. As a result, the system can remain active, while other solutions are busy moving weights in and out of external memory. The chip supports INT8, INT16, and BFLOAT16. Multiple models can be processed in parallel. Groq’s Tensor Streaming Processor Chip (TSP) delivers 1 petaoperations per second, running at 1.25 GHz using INT8 values. The chip architecture also enables the system to provide this high level of performance by splitting data and code flow (Fig. 7). The 20 horizontal data-flow superlanes are managed by the vertical SIMD instruction flow. Identical east/west sections let the data flow in both directions. There are 20 superlanes with 16 SIMD units each. 7. Groq’s chip implements 20 superlanes with 16 SIMD units each. Data flows toward the outer edges of each side, while the control SIMD instructions flow up through the array that’s controlling massive parallel computations. The units within the array include on-chip SRAM (MEM), vector processing engines (VXM), a matrix of MAC cores (MXM), and data reshapers (SXM). Processors and DSPs Special ML processors are the order of the day for many new startups, but extending existing architectures garners significant performance benefits while keeping the programming model consistent. This allows for easy integration with the rest of an application. Cadence’s Tensilica HiFi DSP is now supported by its HiFi Neural Network library in addition to the Nature DSP library that handles vector math like FFT/FIR and IIR computations. The 8-/16-/32-bit SIMD and Vector FPU (VFPU) support provides efficient support for neural networks while enabling a custom design DSP to include customer-specific enhancements. CEVA’s SensPro sensor hub DSP combines the CEVA-BX2 scalar DSP with a NeuPro AI processor and a CEVA-XM6 vision processor. The wide SIMD processor architecture is configurable to handle 1024 8-×-8 MACs, 256 16-×-16 MACs, or dedicated 8-×-2 binary-neural-network (BNN) support. It can also handle 64 single-precision and 128 half-precision floating-point MACs. This translated to 3 TOPS for the 8-×-8 network’s inferencing, 20 TOPS for BNN inferencing, and 400 GFLOPS for floating-point arithmetic. The DesignWare ARC HS processor solution developed by Synopsys takes the tack of having lots of processors to address the ML support. This isn’t much different than most solutions, but it’s more along the lines of conventional RISC cores and interconnects that are typically more useful for other applications. AMD isn’t the only x86 chip producer. Via Technologies has its own x86 IP and its Centaur Technology is making use of that. The x86 platform is integrated with an AI Ncore coprocesessor tied together by a ring (Fig. 8). The Ncore utilizes a very wide SIMD architecture organized into vertical slices to provide a scalable configuration, making future designs more powerful. The chip can deliver 20 TOPS at 2.5 GHz.8. Centaur Technology blends x86 processors with the NCore AI accelerator. I’ve previously covered the Arm Cortex-M55 and Ethos-U55 combination. The Cortex-M55 has an enhanced instruction set that adds a vector pipeline and data path to support the new SIMD instructions. The DSP support includes features like zero overhead loops, circular buffers, and bit reverse addressing. Still, as with other architectures, a dedicated AI accelerator is being added to the solution—namely, the Ethos-U55 micro network processor unit (microNPU). It supports 8- or 16-bit activations in the models, but internally, weights will always be 8 bits. The microNPU is designed to run autonomously. While the V in RISC-V doesn’t stand for vectors, SiFive’s latest RISC-V designs do have vector support that’s ideal for neural-network computational support (Fig. 9). What makes this support interesting is that the vector support can be dynamically configured. Vector instructions work with any vector size using vector-length and vector-type registers. The compiler vectorization support takes this into account. The VI2, VI7, and VI8 platforms target every application space through the data center. 9. SiFive’s RISC-V new designs include configurable vector support. Extending FPGAs and GPGPUs Xilinx’s Versal adaptive compute acceleration platform (ACAP) is more than just an FPGA (Fig. 10). The FPGA fabric is at the center, providing low-level customization. However, there are hard cores and an interconnect network surrounding it. The hard cores range from Arm Cortex CPUs for application and real-time chores along with AI and DSP support. 10. Xilinx’s adaptive compute acceleration platform (ACAP) can incorporate AI engines to complement the FPGA fabric and hardcore Arm Cortex CPUs. I left Nvidia to the end, as the company announced its A100 platform at the recent, virtual GPU Technology Conference (Fig. 11). This GPGPU incorporates a host of ML enhancements, including sparsity acceleration and multi-instance GPU (MIG) support. The latter provides hardware-based partitioning of the GPU resources that allow more secure and more efficient operation. Large-scale implementations take advantage of the third-generation NVLink and NVSwitch technology that tie multiple devices together. 11. Nvidia’s Jensen Huang just pulled this motherboard out of the oven. It has eight A100-based GPGPUs specifically designed for AI acceleration. The plethora of machine-learning options includes more than just the platforms outlined here. They reflect not only the many ways that ML can be accelerated, but also the variety of approaches that are available to developers. Simply choosing a platform can be a major task even when one understands what kind of models will be used, and their possible performance requirements. Likewise, the variants can offer performance differences that are orders of magnitude apart. System and application design has never been more exciting or more complicated. If only we had a machine-learning system to help with that. Series: MicroPython for Embedded SystemsThere are currently around 600 programming languages to choose from, so picking the one that’s right for you can be pretty difficult. But if you’re looking for a language that’s incredibly popular, has a low barrier of entry, and can easily be integrated with a wide variety of other languages, then Python is arguably your best bet right now. Python is the second most in-demand programming language as of 2020, and it might even end up taking the crown from JavaScript one day in the near future. But while Python can be used for anything from web hosting and software development to business applications and everything in between, it can’t run on microcontrollers, which somewhat limits its capabilities. Luckily, an intrepid programmer took care of this little issue a few years back when he came up with MicroPython. Just as its name suggests, MicroPython is a compact version of the popular programming language that was designed to work hand-in-hand with microcontrollers. In this tutorial, we’re going to teach you everything you need to know about microcontrollers and discuss the benefits of using MicroPython over other boards. There’s quite a bit to unpack here, but before we jump into the nitty-gritty of things, let’s take a trip back in time and see where the idea behind MicroPython came from. MicroPython History and Overview The year is 2013. Damien George, an undergraduate at Cambridge University at the time, launches a Kickstarter campaign that promises to bring Python to microcontrollers, which would finally allow for quick and painless hardware programming. The crowdfunding campaign was a huge success from the get-go. Damien was only asking for a rather modest sum of £15,000 to turn his proof of concept into a working product, complete with reference hardware. However, by the end of the Kickstarter, the programmer managed to raise close to £100,000 for the project thanks to over 1,900 backers. The project evolved over the years and Damien’s reference hardware eventually become the Pyboard (see figure), a small electronic circuit board that runs MicroPython on the bare metal. While you don’t necessarily need a Pyboard in order to use MicroPython, the microcontroller is one of the best and easiest to work with when it comes to hardware programming. The PyBoard runs MicroPython on a STMicroelectronics STM32F405RG microcontroller based on a 186-MHz Arm Cortex-M4. The board is currently available in a few different versions, with prices generally ranging between $20 and $35, though some are a bit more expensive. If you’re not looking to invest in a microcontroller just yet, you can go to MicroPython Live to play around with a board for free in an online environment. MicroPython offers some very exciting possibilities for hardware-programming experts and beginners alike. That’s because, unlike regular Python, MicroPython can seamlessly integrate with circuits, buttons, sensors, LCD displays, and various other electronics. Not only that, but MicroPython requires far fewer resources and doesn’t have to rely on an operating system because MicroPython itself acts as the OS for the Pyboard or any other microcontroller. Essentially, all you need to set up a MicroPython project is compatible hardware and some coding skills. The latter may seem like a problem if you’re not a coder. But don’t worry, because MicroPython is an open-source project that’s supported by a very passionate and helpful community. If you have a specific project in mind, you can generally expect to be able to find code libraries and tutorials from developers that can help you bring it to life. Usually developers use services from providers such as Bluehost or AWS to host their open-source projects, so you will have access to them at any time. MicroPython may have started as a one-man effort, but nowadays the project is supported by a large community of programmers, hobbyists, and even major organizations like the European Space Agency, who helped fund Damien’s work. Physical Computing Whereas regular Python is one of the best scripting languages for software programming, MicroPython is perfect for anyone interested in physical computing. The term physical computing can refer to lots of things when used in a broad sense, but as far as this tutorial is concerned, we’re primarily going to use it to describe the interactive systems and devices created with the help of MicroPython. Regardless of which hardware programming project you plan to tackle, you’ll need to take into account the following three main elements: 1. Input: This can be a button, sensor or anything else that allows you to give the microcontroller a command or a signal. 2. Processing: The microcontroller itself, which process the input and delivers one or multiple outputs. 3. Output: Can come in the form of any device that sends data from the microcontroller to another device or directly to the user. In addition to the three elements described above, you will also need some type of power source and, depending on the project, wires that connect everything together. What Exactly is a Microcontroller? A microcontroller can be described as an integrated circuit that controls a device or a system. You can look at it as being the equivalent of a small computer that’s less powerful than a regular desktop computer, but much more compact. Because of their small size, microcontrollers can easily be embedded into a wide variety of systems, including air-conditioning systems, medical devices, home appliances, radios, vending machines, vehicles, and even robots, to name just a few examples. Unlike regular computers, microcontrollers don’t require an entire board of chips to get the job done. Instead, they come in the form of an all-purpose chip that contains a processing unit (CPU), memory, and one or multiple I/O (input/output) ports. Similarly, a microcontroller doesn’t need a front-end operating system either, because they come complete with specialized software known as firmware. In the case of MicroPython boards, this software comes in the form of a small subset of the Python standard library. How Does a Microcontroller Work? The basic concept is similar to that of a regular computer. The microcontroller receives data via its I/O ports and processes it using its CPU. Since a microcontroller doesn’t have a permanent storage unit like a hard drive or an SSD, all of the received data is temporarily stored inside the microcontroller’s built-in data memory, which you can view as the equivalent to the RAM sticks found in your computer. The processor then accesses the information and deciphers it using a set of instructions stored in the program memory. Depending on what the input requires, the microcontroller subsequently delivers one or multiple types of actions (outputs) using its I/O ports. Microcontrollers can be found in systems of all shapes and sizes. But while a single microcontroller can handle certain small devices all by itself, it won’t be able to power more complex systems. However, multiple controllers can be programmed to work in conjunction with one another to achieve that result, with each unit having to control only a specific feature or a small component of the larger system. In many cases, microcontrollers work alongside a central computer. However, they can also be programmed to communicate only with other microcontrollers or to operate individually inside the same system. What are the Core Elements of a Microcontroller? As mentioned earlier, a microcontroller only has three core elements, so let’s take a closer look at each one of them: Central processing unit (CPU) The processor is the part of a computer that performs operations and executes instructions. It’s essentially the brain of a regular computer and the same concept applies to microcontrollers, too. Like any CPU, the microcontroller’s processor performs basic logic, arithmetic, and I/O operations, but can also perform data-transfer operations. These come in the form of instructions, which are communicated to other components that are part of a larger system. Memory The memory stores data received by the CPU via the microcontroller’s I/O ports. The memory comes equipped with a certain set of instructions that are used to respond to the information received from the processor. A microcontroller works with two types of memory, each of which performs a different function. Data memory stores information while instructions are being executed by the CPU. Meanwhile, program memory stores information related to the instructions themselves. Data memory is temporary and disappears once the microcontroller is no longer connected to a power source. Program memory is non-volatile and remains on the device even after its power source has been removed. I/O Ports The input and output ports allow the microcontroller’s CPU to interface with the outside world. I/Os are used by the microcontroller to interact not just with human users, but also other information processing systems. The microcontroller’s CPU gathers data via its input ports and uses output ports to send instructions or signals externally. In addition to the three core elements detailed above, microcontrollers generally rely on other components that connect to the I/O ports. These components act as peripherals that create a bridge between the processor and various devices. A few examples include the system bus, digital-to-analog converters, analog-to-digital converters, and serial ports. Why Choose MicroPython Boards Over Other Microcontrollers? Microcontrollers have existed long before Damien George came up with the idea for MicroPython, so why not go with one of those instead? Well, there are plenty of good reasons, the most obvious of which has to be accessibility. Regular Python is known for being extremely friendly to newcomers, and the same can be said about MicroPython. If you want to learn the basics of hardware programming and start working on your own projects as soon as possible, MicroPython is for you. Another big advantage of MicroPython is that you can interact with it using the intuitive REPL (read-eval-print loop) environment. The REPL prompt allows you to quickly execute commands, change your code on the fly, and import scripts from the built-in filesystem. In addition to benefiting from MicroPython’s ease of use and rapid feedback, you would also be able to take advantage of the fact that this software implementation is fully compatible with normal Python. If you already know your way around Python, you won’t have any issues wrapping your head around MicroPython. If you don’t, you’ll be able to quickly pick it up because, thanks to Python’s ever-growing popularity, there are tons of resources online, which isn’t the case with many of the more obscure programming languages. Final Thoughts Microcontrollers have seen a massive surge in popularity these last few years and that trend is very likely to continue for the foreseeable future. These amazing little circuit boards can be used to power all manner of devices and are being used more and more as an educational tool in schools worldwide. In other words, now is definitely a great time to learn about microcontrollers and hardware programming in general. If you’re a beginner who doesn’t want to be overwhelmed by all the tech-heavy jargon floating around these days, we recommend you start by learning as much as you can about MicroPython, because doing so will also give you a lot of insight into microcontrollers. We hope our tutorial managed to shed a bit of light on the topic. There’s always more to learn, though, so don’t hesitate to check out the documentation section on the official MicroPython website for additional information. Series: MicroPython for Embedded Systems
Source: https://www.electronicdesign.com/technologies/embedded-revolution/article/21131474/accelerating-machine-learning-means-new-hardware