What is AI hardware? How GPUs and TPUs give artificial intelligence algorithms a boost

Most computers and algorithms — including, at this point, many artificial intelligence (AI) applications — run on general-purpose circuits called central processing units or CPUs. Though, when some calculations are done often, computer scientists and electrical engineers design special circuits that can perform the same work faster or with more accuracy. Now that AI algorithms are becoming so common and essential, specialized circuits or chips are becoming more and more common and essential.

The circuits are found in several forms and in different locations. Some offer faster creation of new AI models. They use multiple processing circuits in parallel to churn through millions, billions or even more data elements, searching for patterns and signals. These are used in the lab at the beginning of the process by AI scientists looking for the best algorithms to understand the data.

Others are being deployed at the point where the model is being used. Some smartphones and home automation systems have specialized circuits that can speed up speech recognition or other common tasks. They run the model more efficiently at the place it is being used by offering faster calculations and lower power consumption.

Scientists are also experimenting with newer designs for circuits. Some, for example, want to use analog electronics instead of the digital circuits that have dominated computers. These different forms may offer better accuracy, lower power consumption, faster training and more.

Event

MetaBeat 2022

MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.

What are some examples of AI hardware?

The simplest examples of AI hardware are the graphical processing units, or GPUs, that have been redeployed to handle machine learning (ML) chores. Many ML packages have been modified to take advantage of the extensive parallelism available inside the average GPU. The same hardware that renders scenes for games can also train ML models because in both cases there are many tasks that can be done at the same time.

Some companies have taken this same approach and extended it to focus only on ML. These newer chips, sometimes called tensor processing units (TPUs), don’t try to serve both game display and learning algorithms. They are completely optimized for AI model development and deployment.

There are also chips optimized for different parts of the machine learning pipeline. These may be better for creating the model because it can juggle large datasets — or, they may excel at applying the model to incoming data to see if the model can find an answer in them. These can be optimized to use lower power and fewer resources to make them easier to deploy in mobile phones or places where users will want to rely on AI but not to create new models.

Additionally, there are basic CPUs that are starting to streamline their performance for ML workloads. Traditionally, many CPUs have focused on double-precision floating-point computations because they are used extensively in games and scientific research. Lately, some chips are emphasizing single-precision floating-point computations because they can be substantially faster. The newer chips are trading off precision for speed because scientists have found that the extra precision may not be valuable in some common machine learning tasks — they would rather have the speed.

In all these cases, many of the cloud providers are making it possible for users to spin up and shut down multiple instances of these specialized machines. Users don’t need to invest in buying their own and can just rent them when they are training a model. In some cases, deploying multiple machines can be significantly faster, making the cloud an efficient choice.

How is AI hardware different from regular hardware?

Many of the chips designed for accelerating artificial intelligence algorithms rely on the same basic arithmetic operations as regular chips. They add, subtract, multiply and divide as before. The biggest advantage they have is that they have many cores, often smaller, so they can process this data in parallel.

The architects of these chips usually try to tune the channels for bringing the data in and out of the chip because the size and nature of the data flows are often quite different from general-purpose computing. Regular CPUs may process many more instructions and relatively fewer data. AI processing chips generally work with large data volumes.

Some companies deliberately embed many very small processors in large memory arrays. Traditional computers separate the memory from the CPU; orchestrating the movement of data between the two is one of the biggest challenges for machine architects. Placing many small arithmetic units next to the memory speeds up calculations dramatically by eliminating much of the time and organization devoted to data movement.

Some companies also focus on creating special processors for particular types of AI operations. The work of creating an AI model through training is much more computationally intensive and involves more data movement and communication. When the model is built, the need for analyzing new data elements is simpler. Some companies are creating special AI inference systems that work faster and more efficiently with existing models.

Not all approaches rely on traditional arithmetic methods. Some developers are creating analog circuits that behave differently from the traditional digital circuits found in almost all CPUs. They hope to create even faster and denser chips by forgoing the digital approach and tapping into some of the raw behavior of electrical circuitry.

What are some advantages of using AI hardware?

The main advantage is speed. It is not uncommon for some benchmarks to show that GPUs are more than 100 times or even 200 times faster than a CPU. Not all models and all algorithms, though, will speed up that much, and some benchmarks are only 10 to 20 times faster. A few algorithms aren’t much faster at all.

One advantage that is growing more important is the power consumption. In the right combinations, GPUs and TPUs can use less electricity to produce the same result. While GPU and TPU cards are often big power consumers, they run so much faster that they can end up saving electricity. This is a big advantage when power costs are rising. They can also help companies produce “greener AI” by delivering the same results while using less electricity and consequently producing less CO2.

The specialized circuits can also be helpful in mobile phones or other devices that must rely upon batteries or less copious sources of electricity. Some applications, for instance, rely upon fast AI hardware for very common tasks like waiting for the “wake word” used in speech recognition.

Faster, local hardware can also eliminate the need to send data over the internet to a cloud. This can save bandwidth charges and electricity when the computation is done locally.

What are some examples of how leading companies are approaching AI hardware?

The most common forms of specialized hardware for machine learning continue to come from the companies that manufacture graphical processing units. Nvidia and AMD create many of the leading GPUs on the market, and many of these are also used to accelerate ML. While many of these can accelerate many tasks like rendering computer games, some are starting to come with enhancements designed especially for AI.

Nvidia, for example, adds a number of multiprecision operations that are useful for training ML models and calls these Tensor Cores. AMD is also adapting its GPUs for machine learning and calls this approach CDNA2. The use of AI will continue to drive these architectures for the foreseeable future.

As mentioned earlier, Google makes its own hardware for accelerating ML, called Tensor Processing Units or TPUs. The company also delivers a set of libraries and tools that simplify deploying the hardware and the models they build. Google’s TPUs are mainly available for rent through the Google Cloud platform.

Google is also adding a version of its TPU design to its Pixel phone line to accelerate any of the AI chores that the phone might be used for. These could include voice recognition, photo improvement or machine translation. Google notes that the chip is powerful enough to do much of this work locally, saving bandwidth and improving speeds because, traditionally, phones have offloaded the work to the cloud.

Many of the cloud companies like Amazon, IBM, Oracle, Vultr and Microsoft are installing these GPUs or TPUs and renting time on them. Indeed, many of the high-end GPUs are not intended for users to purchase directly because it can be more cost-effective to share them through this business model.

Amazon’s cloud computing systems are also offering a new set of chips built around the ARM architecture. The latest versions of these Graviton chips can run lower-precision arithmetic at a much faster rate, a feature that is often desirable for machine learning.

Some companies are also building simple front-end applications that help data scientists curate their data and then feed it to various AI algorithms. Google’s CoLab or AutoML, Amazon’s SageMaker, Microsoft’s Machine Learning Studio and IBM’s Watson Studio are just several examples of options that hide any specialized hardware behind an interface. These companies may or may not use specialized hardware to speed up the ML tasks and deliver them at a lower price, but the customer may not know.

How startups are tackling creating AI hardware

Dozens of startups are approaching the job of creating good AI chips. These examples are notable for their funding and market interest:

D-Matrix is creating a collection of chips that move the standard arithmetic functions to be closer to the data that’s stored in RAM cells. This architecture, which they call “in-memory computing,” promises to accelerate many AI applications by speeding up the work that comes with evaluating previously trained models. The data does not need to move as far and many of the calculations can be done in parallel.
Untether is another startup that’s mixing standard logic with memory cells to create what they call “at-memory” computing. Embedding the logic with the RAM cells produces an extremely dense — but energy efficient — system in a single card that delivers about 2 petaflops of computation. Untether calls this the “world’s highest compute density.” The system is designed to scale from small chips, perhaps for embedded or mobile systems, to larger configurations for server farms.
Graphcore calls its approach to in-memory computing the “IPU” (for Intelligence Processing Unit) and relies upon a novel three-dimensional packaging of the chips to improve processor density and limit communication times. The IPU is a large grid of thousands of what they call “IPU tiles” built with memory and computational abilities. Together, they promise to deliver 350 teraflops of computing power.
Cerebras has built a very large, wafer-scale chip that’s up to 50 times bigger than a competing GPU. They’ve used this extra silicon to pack in 850,000 cores that can train and evaluate models in parallel. They’ve coupled this with extremely high bandwidth connections to suck in data, allowing them to produce results thousands of times faster than even the best GPUs.
Celestial uses photonics — a mixture of electronics and light-based logic — to speed up communication between processing nodes. This “photonic fabric” promises to reduce the amount of energy devoted to communication by using light, allowing the entire system to lower power consumption and deliver faster results.

Is there anything that AI hardware can’t do?

For the most part, specialized hardware does not execute any special algorithms or approach training in a better way. The chips are just faster at running the algorithms. Standard hardware will find the same answers, but at a slower rate.

This equivalence doesn’t apply to chips that use analog circuitry. In general, though, the approach is similar enough that the results won’t necessarily be different, just faster.

There will be cases where it may be a mistake to trade off precision for speed by relying on single-precision computations instead of double-precision, but these may be rare and predictable. AI scientists have devoted many hours of research to understand how to best train models and, often, the algorithms converge without the extra precision.

There will also be cases where the extra power and parallelism of specialized hardware lends little to finding the solution. When datasets are small, the advantages may not be worth the time and complexity of deploying extra hardware.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.