Low Power-High Performance
2.5D memory provides the bandwidth, capacity, and power efficiency needed for AI training, but comes with added complexity.
The largest AI/ML neural network training models now exceed an enormous 100 billion parameters. With the rate of growth over the last decade on a 10X annual pace, we’re headed to trillion parameter models in the not-too-distant future. Given the tremendous value that can be derived from AI/ML (it is mission critical to five of six of the top market cap companies in the world), there has been a huge imperative to develop application-specific silicon for training. Those AI accelerators need extremely high memory bandwidth and capacity to keep their processing engines operating at full tilt.
High Bandwidth Memory (HBM) has evolved as the “go to” memory for AI/ML training. It uses a 2.5D/3D architecture to leapfrog the bandwidth and capacity performance of 2D memory architectures such as DDR. The “3D” refers to DRAM device stacking. In a standard DDR device, there’s typically a single DRAM die. With HBM, the memory device is a package containing multiple DRAM die. In the latest iteration of HBM, HBM2E, 12-high stacks of DRAM die are supported. With an internal 12-high die stack, an HBM2E DRAM device can achieve a memory capacity of up to 24 gigabytes (GB).
The “2.5D” portion of the 2.5D/3D architecture derives from the interface implementation. HBM uses a “wide and slow” interface. We’ll talk about how “slow” is increasingly a misnomer, but let’s focus on the “wide.” HBM employs a 1024-bit wide data bus to connect the HBM memory device to an AI accelerator. There are also clock, power management, and command/address connections required which pushes the number of traces needed to about 1,700. State-of-the-art AI accelerators may connect to four to six HBM devices, so now we’re pushing upwards of 10,000 traces.
That is far, far more than can printed on a standard PCB. A silicon interposer is used as the intermediary to connect memory stack(s) and processor. The use of the silicon interposer is what makes this a 2.5D architecture. As with an IC, finely spaced traces can be etched in the silicon interposer to achieve the number needed for the HBM interface(s).
The “slow” referred to how a modest data rate could deliver high bandwidth. With first generation HBM, the interface operated at 1 gigabit per second (Gb/s) which over a 1024-bit wide bus yielded a bandwidth of 128 GB/s. With HBM2E, Rambus has pushed that maximum data rate to 4 Gb/s, which is downright fast, being well above the top-end performance of DDR4 memory. When paired with the fastest available HBM2E DRAM devices from SK hynix, rated at 3.6 Gb/s, the Rambus HBM2E interface can deliver 460 GB/s of memory bandwidth. In an architecture consisting of an accelerator and four HBM2E devices, we can achieve an enormous bandwidth of over 1.8 Terabytes per second (TB/s).
Another big benefit of this architecture’s scaling in the Z-dimension is that processor and memory can be kept in very close proximity. With the “relatively slow” data rate, power consumption is minimized. Given the deployment of AI/ML training accelerators in enterprise and hyperscale data centers, heat and power constraints are critical.
High bandwidth, high capacity, and power efficient, HBM2E memory delivers what AI/ML training needs, but of course there’s always a catch. The design trade-off with HBM is increased complexity of both the 3D devices and the 2.5D structure. The silicon interposer is an additional element that must be designed, characterized and manufactured. 3D stacked memory shipments are small in comparison to the huge volume and manufacturing experience built up making traditional DDR-type memories. The net is that implementation and manufacturing costs are higher for HBM2E than for a high-performance memory built using traditional manufacturing methods such as GDDR6 DRAM.
Designers can greatly mitigate the challenges of higher complexity with their choice of IP supplier. Integrated solutions such as the HBM2E memory interface from Rambus ease implementation and provide a complete memory interface sub-system consisting of verified PHY and digital controller. Further, Rambus has extensive experience in interposer design with silicon-proven HBM2/HBM2E implementations benefiting from Rambus’ mixed-signal circuit design history, deep signal integrity/power integrity and process technology expertise, and system engineering capabilities. Rambus provides an interposer reference design to all its HBM2E customers.
The progress of AI/ML has been breathtaking, and there’s no slowing down. Improvements to every aspect of computing hardware and software will continue to be needed to keep on the scorching pace. For memory, AI/ML training demands bandwidth, capacity, and power efficiency. The Rambus HBM2E memory interface, consisting of PHY and memory controller, raises the bar with the highest performance available for AI/ML training.
Additional Resources:Website: HBM2E Memory Interface SolutionWebsite: HBM2E Memory PHYWebsite: HBM2E Memory Controller
Frank Ferro (all posts)
Frank Ferro is senior director of product marketing for IP cores at Rambus.