AI And Machine Learning Provide Challenges And Opportunities For Data Center Architects

CEO and co-founder of Liqid Inc. Tech industry veteran with more than 20 years of experience.

Gary Burchell

Pain drives the adoption of new IT solutions. The need for inexpensive shared storage for VMware virtualization drove the proliferation of storage area networks (SANs) — the massive data storage systems that are now a staple in modern data centers. High on-prem IT costs to run applications and store the massive amounts of data created and managed by SANs drove the adoption of cloud computing, enabling organizations to economically access applications and expensive, high-value resources such as storage to power native applications.

Meet the new challenge: AI and machine learning (AI+ML). The ability to share high-speed NVMe flash storage resources can no longer match the performance required to effectively calculate that data in real time for AI+ML operations. Graphics processing (GPU) resources that can calculate data three-dimensionally (i.e., visual computing), versus linear, transactional calculations of traditional CPUs, have been introduced into the data center ecosystem to improve data performance to meet the needs of AI+ML.

The numbers tell the story of this ongoing data center transformation very well. Recent earnings from GPU provider NVIDIA (a Liqid partner) show the company has posted 167% year-over-year revenue growth in its data center business with AI+ML and high-performance computing deployments driving growth across industry verticals. Furthermore, last year the analyst group Gartner predicted that through 2023, GPU adoption in the data center will grow much faster compared to PC gaming: 22.53% CAGR vs. 6.69%.

GPUs are high-value resources with the price tags to match. NVIDIA has again increased that value and thrown down the gauntlet for its data center competitors to pick up. The company’s new A100 Tensor Core GPU is faster than its previous generation of GPU by 20 times, making it among the fastest accelerators in any data center environment. NVIDIA sees the A100 as the “universal accelerator” for AI+ML operations, scientific research and cloud-based visual computing.

It’s easy to predict that those organizations seeking a competitive advantage in next-gen, AI-driven computing will eagerly deploy these devices in the largest quantities that budgets will allow. Again, the money tells the story well: Gartner anticipates that augmenting data center architectures with AI will create $2.9 trillion in business value in 2021.

A Wave Of Accelerator Innovation Drives AI Adoption

With GPUs increasingly becoming table stakes for competitive data center environments with diverse AI+ML workloads, it’s important to remember GPUs are by no means the only data accelerators evolving to meet the ever-increasing data requirements associated with AI+ML.

In addition to the performance advancements enabled by the A100, NVIDIA Mellanox is now producing the high-speed ConnectX-6 SmartNIC. The ConnectX delivers a powerful, secure 25/50 Gb/s ethernet controller. The device is designed to enable the most adaptive management of networked hardware possible, across the entire data center, via software.

And that is just from NVIDIA. Intel Optane memory technology uses software to give its Intel Core-based solid-state storage (SSD) products near-memory speeds. One of the most basic, bare-metal interface that connects disaggregated data center devices through the CPU, PCI-Express (PCIe), has recently been updated to PCI-Express 4.0., which offers twice the performance of its preceding specification. NVMe storage protocols provide marked performance improvements over legacy protocols such as SATA or SaaS for additional advancements in data speeds.

Tying It All Together: New Performance Requirements Call For New Architectures

While all of these new solutions are exciting for data center professionals, architecting them into existing environments and predicting rapidly changing performance requirements are urgently required. Traditional data center architectures leave GPU and other accelerator resources locked into configurations at the point of purchase. Without the ability to share these valuable resources in a networked enviornment, the result can be significant underutilization and waste.

Again using the example of the A100, multiple A100s to be aggregated in previously impossible volumes so users can share GPUs in a manner better capable of supporting the data needs of AI+ML applications.

Other organizations seek to better optimize resources through NVMe- and GPU-over Fabric (NVME-/GPU-oF) operations, which use software to aggregate the power of these resources via high-speed ethernet and infiniband networking fabrics in order to more widely distribute NVMe and GPU capabilities across distance and beyond traditional fixed configurations.

Composable software makes it possible to share all of these resources in any volume required, in perfect balance with other high-performance hardware. Composability also allows for older accelerator devices already deployed to be seamlessly integrated with newer disaggrated hardware. As workload requirements change, all disaggregated high-performance accelerators can be adjusted to meet uneven data workload requirements, either on-demand or automated deployment.

In these ways and others, IT users will eventually be able to aggregate unlimited devices across ethernet in tandem with other accelerators. As the market matures, users will eventually be able to rent GPU resources as a service, just as the SAN enabled the movement of data to the cloud while simultaneously driving new opportunities to rent out the high-performance storage resources.

The evolving ecosystem that is developing to support and efficiently share high-value, next-gen accelerator resources will help solve huge problems and create significant opportunities across industry, government, and academia. It’s just that transformative. The leaders who emerge as change takes place at a rapid clip will need to be able to effectively and efficiently optimize and adapt to the changing data environment in order to meet the performance demand associated with AI+ML operations.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Hannah