Comment The lucid ramblings and art synthesized by ChatGPT or Stable Diffusion have captured imaginations and prompted no shortage of controversy over the role generative AI will play in our futures.
As we’ve seen with CNET and Buzzfeed, executives are no less dazzled by AI’s creative potential to replace workers with profits. But one of things that’s often missed in these conversations is the need to retrain these models regularly or risk them aging into irrelevance, particularly in rapidly evolving environments like the news.
ChatGPT, Stable Diffusion, Dall-E-2 and the majority of generative AI today are trained on large datasets and then made available as proof of concepts or exported as a pre-trained model.
Let’s take Stable Diffusion as an example as it offers a glimpse at just how misleading the scope these models can be. Like Dall-E-2 Stable Diffusion is multi-modal. It’s made up of a collection of models that work together to turn your works into a visual representation.
But where Stable Diffusion stands out is that its pre-trained model can fit into just 4GB of Nvidia vRAM without sending the CPU into overdrive trying to churn data. This means you can run it at home on a decently powerful laptop or desktop so long as you’ve got a dedicated GPU with enough memory. The ability to run models at home has opened the eyes of many of the potential for generative AI, but while fun, pre-trained models also have a finite shelf life.
Imagine if you exposed a child to everything the world has to offer. For 18 years they absorb all the knowledge they can, but on the first day of their adult life they’re locked away in a cave and isolated from the world. Now imagine you provided that person with art supplies and asked them to draw, paint, and render images based on your prompts.
At first the images would be relatively accurate, but with each passing day isolation puts them at a greater disadvantage. As the prompts increasingly venture into unfamiliar territory, the art steadily becomes less accurate.
A pre-trained AI model isn’t much different. It’s blind to the world from the point its training is complete. This is why for generative AI to be truly useful it’s going to need to be retrained repeatedly. And herein lies the problem: while these AI models all seem magical, training them even once remains an exceptionally expensive proposition.
This makes private school look like a bargain
Calculating the cost of training is a tricky thing because there are so many variables at play. But for the purposes of this piece, we’re going to take a look at floating point accuracy, model size, and training time to help put it all in perspective.
Most AI training today is done on GPUs each with a relatively small amount of fast memory onboard. Nvidia’s A100 and H100 GPUs both sport 80GB of HBM memory, while AMD and Intel’s GPUs are now pushing 128GB. While there are other architectures out there with different memory topologies, we’re going to stick to Nvidia’s A100 because the hardware is well supported, widely available in both on-prem and cloud environments and has been running AI workloads for years at this point.
Floating point accuracy is one of the biggest factors, as it plays both into training time and how much memory the model will need. The latter also dictates how much compute is required, as each accelerator only has so much memory. Training time itself is harder to quantify because it’ll vary depending on compute density, the quantity of accelerators, the size of the dataset, the number of parameters at play and any number of other related variables.
Most models today are trained using FP32, FP16, or Bfloat16, though many industry players are now pushing FP8 calculations. As you drop down the scale, accuracy is traded for greater performance and the models tend to get smaller too. For this reason, it’s not uncommon for models to use mixed precision, which essentially involves using lower accuracy calculations for some parameters and higher accuracy for others, usually to optimize performance.
So just how big are these models? Well, with ChatGPT generating no shortage of controversy as of late, let’s take a look at GPT-3 on which the divisive AI model is based. At 175 billion parameters GPT-3, unveiled in mid 2020, was trained on a massive cluster of Nvidia V100 GPUs on a dataset of roughly 2TBs.
From what we understand, GPT-3 was trained using FP32 precision, which means four bytes per parameter. That works out to about 700GB of vRAM required just to fit the model. Today, that’d require about ten 80GB Nvidia A100s, but unless you want to wait years for it to train, you’re gonna want a few more chunks of big iron.
Engineers at Nvidia, working alongside scientists at Stanford University and Microsoft Research, estimated in a 2021 paper that it’d take 1,024 A100s 34 days to train GPT-3 on a 1.2TB dataset. To put that in perspective, that’s the equivalent to 128 AWS p4de.24xlarge instances. At $40.96 per hour apiece, and with 816 hours required to train, that’d run you in the neighborhood of $4.28 million just to train it. Running inferencing on the trained model to ensure smarter performance is another issue entirely.
And that’s just GPT-3. Future models are expected to be an order of magnitude larger, with some speculating that GPT-4 could be as large as a trillion parameters in size. But, since we don’t have any firm details on GPT-4 just yet, we’ll look at another large language model from Nvidia.
Behold the Megatron
Nvidia’s Megatron-Turing NLG language model has 530 billion parameters, making it more than three-times larger than GPT-3. According to Nvidia, it took 2,048 Nvidia A100s running in mixed precision eight weeks to train the model. Going back to our AWS example, now we’re talking about just over $14 million to train it once. It doesn’t take much of an imagination to see why retraining every week on an incrementally larger dataset could get expensive in a hurry.
You might ask why not train on prem if the cloud is so expensive. This is a valid point, especially if you’re going to be retraining your model constantly, but it still requires a big upfront investment.
Using Nvidia’s Megatron-Turning NLG example from before, you’d need 256 8-GPU nodes. We’ll use Nvidia’s DGX A100 servers as an example. While the cost of these systems varies, we’ve seen pricing in the neighborhood of $175,000.
For 256 nodes the costs work out to $44.8 million and that doesn’t consider the power and maintenance required to keep them up and running. Under full load, a 256 node cluster could chew through 1.7 megawatts an hour. Assuming constant retraining, you’re looking at $2.2 million a year in power. Of course, in reality it should be a fair bit less than that.
Proliferation of faster accelerators and lower/mix precision calculations will certainly help, but that’s assuming the models don’t continue to outpace our advances in silicon.
The point of diminishing returns
If we’ve learned anything about human nature, it’s that we’ll take whatever shortcuts we can if it means turning a buck. Massive natural language models like ChatGPT may be impressive, but the sheer cost to train and then retrain them will make them so impractical that only the largest companies can afford to use them to their full potential.
Businesses like Microsoft, which operate massive GPU clusters with tens of thousands of accelerators, are well positioned to do just that, so it’s no surprise the company is making massive investments in companies like OpenAI.
But as AI models and accelerators mature, the number of models tailored to specific applications are likely to proliferate.
We’ve already seen a slew of AI art generators emerge in the wake of Dall-E. But despite failing to deliver the same degree of polish as its rivals, Stable Diffusion’s open source nature and ability to not only be deployed, but trained on consumer hardware, have made it a standout hit.
Stable Diffusion also demonstrates that AI isn’t immune to the rule of diminishing returns. Luxury cars may captivate drivers, but if they can’t afford them, they make do with their Ford or Honda. While it may lack the style or prestige of a luxury brand, it’ll still get you to point A to B. There’s no reason to think the same won’t be true of AI adoption in the enterprise.
Ultimately, the goal isn’t perfection, it’s mediocrity. As long as the model is good enough – and costs less than having a person do it – the AI will have paid for itself. And as we’ve discussed, there are plenty of corners to cut. ®