Sponsored Feature Arm is starting to fulfill its promise of transforming the nature of compute in the datacenter, and it is getting some big help from traditional chip makers as well as the hyperscalers and cloud builders that have massive computing requirements and who also need to drive efficiency up and costs down each and every year.
To do this, nearly all of the hyperscalers and cloud builders have turned to Arm. And in a lot of cases they have created their own custom CPUs built to the standardized Arm ISA, even as they are also designing specialized accelerators crammed with vector and tensor cores to boost the performance of the matrix math at the heart of modern AI models. Thankfully, these CPUs and XPUs are also useful for doing the calculations underlying HPC simulations.
Here is the neat bit of this Cambrian evolutionary explosion for compute. Even as semiconductor manufacturing processes get more complex and more expensive as they shrink, and even as the demands on CPUs and XPUs get heavier, Arm is making it easier, faster, and cheaper to create custom silicon, which creates a virtuous feedback loop for compute that all IT shops can benefit from in different ways.
"The way we look at AI, there's a great opportunity for us to support companies that are building their own accelerators," Dermot O'Driscoll, vice president of product solutions at Arm, tells The Next Platform. "But first up there is going to be a lot of on-CPU machine learning, too. There is a lot of machine learning inference already happening on CPUs, and we are doing a lot of investment in the full stack to making sure that that on-CPU inference runs super-flawlessly and friction free on Arm platforms. There's a lot of pull for that on the endpoints and clients. As an integral part of our Kleidi initiative we are going to expand that effort out to infrastructure. We want to make sure that people understand that machine learning is not just a GPU problem."
Having a lot of inference run on the CPU - particularly to drive down latency and to simplify the security of AI models, makes sense for many of the existing workloads in the enterprise, which will have AI features and functions added to them where they are. AI is, in reality, just another set of algorithms sitting alongside the existing ones that embody the activities and transactions of a company. Which is why CPUs have needed an increasing amount of vector and tensor math capabilities.
O'Driscoll estimates that somewhere between 50 and 80 organizations in the world today are designing their own discrete AI accelerators, and serving these organizations is the second pillar in the Arm AI strategy. As we noted above, just about all of the hyperscalers and the cloud builders have designed their own Arm server CPU processors as well as acquired them from Ampere Computing, and they are doing this to have more control over the features in and timing of the CPUs they add to their server fleets and to drive down the costs of those fleets as well. At a certain scale, it makes sense to eliminate Intel, AMD, and others as middlemen; why pay for their profits? Why cede control? The Arm ISA is as popular as the X86 ISA at this point, and software runs equally well on both at this point, too.
CPUs are not, however, massively parallel compute engines with huge amounts of high bandwidth memory to balance that compute out and make it useful. For more than a decade now, a GPU cluster has been able to offer at least 10X the performance of a CPU cluster on HPC and AI workloads for 3X the cost, which is why GPU acceleration took off in the late 2000s for HPC simulation and why GPU-accelerated clusters adopted this new-fangled HPC architecture as their initial platform. HPC could no longer be done affordably on all-CPU systems, and we would argue that AI as we know it today would not have been possible without GPU accelerated systems, either technically or economically.
Which is why AI is driven by the Nvidia or AMD GPU or its analogs - TPUs from Google, Trainium from Amazon, RDUs from SambaNova Systems, LPUs from Groq, waferscale CS-2s from Cerebras Systems, and so forth. And these economic and technical realities, and the massive profits that Nvidia has been able to extract for its GPUs and interconnects, is why those 50 to 80 organizations are building their own AI accelerators. And even though Arm does not build a big, discrete GPU or XPU compute engine, it does have a role to play.
"Now we have a lot of companies - hyperscalers in particular - building their own accelerators and building their own CPUs, and one of the key facets is how quickly and how well those talk to each other," O'Driscoll says. "And what is the relationship between the two? So can they build custom silicon for their CPU that enables them to do a better job of talking to the accelerator, whether PCI-Express or something else."
O'Driscoll explained there is a need for high bandwidth and control plane paths between the CPU and the GPU, and that Arm is continuing to evolve what it is doing in the bus space to help with this when building a CPU based on Arm technology. Nvidia has shown the way with its NVLink high-bandwidth ports and memory coherent protocol, and now the industry needs a standard that any CPU and any accelerator can adopt.
And that brings up the third pillar of the Arm strategy, which is chiplets and interconnects for them and the Chiplet System Architecture (CSA), the Compute Sub System (CSS) intellectual property packages, and the Arm Total Design collective of companies with EDA tools, design expertise, foundry support, and firmware and higher-level software to sit atop designs. With this approach, those who want to have their own custom CPUs and XPUs can do so without trying to do everything themselves, and start with a mix of components and chiplets that can be assembled into unique combinations for specific classes of HPC and AI work.
But this is only the first good example of how to lash CPUs tightly to accelerators.
"All of these companies who are building an accelerator are going to, sooner or later, find out they need compute very tightly coupled to that accelerator," explains O'Driscoll. "This is the Grace /Hopper model. There are definitely players who want HBM on the CPU, but those are not broad markets and those tend to be for classic HPC. Today, people don't want to run the HBM through the CPU mesh to the GPU. They want the GPU and the HBM to be stacked very, very closely to each other. They're spending an awful lot of money on HBM, so they want those to be tightly coupled. However, they know there's a limited capacity for HBM on the XPU, and that's why they are looking at the model that Nvidia built, which is to have a much larger store of DRAM on the CPU, very tightly coupled to the GPU, and use the HBM purely for short term latency, short term bandwidth. Everybody's going to need compute with their accelerator - today it's over PCI-Express, tomorrow it will be probably over AMBA-CHI within a package when they get to that point. And when they get to that point, the easiest way to create this compute complex is to take the CSS that we are delivering because there isn't another ecosystem offering chiplets of compute. As we build out the Arm Total Design partnerships, that's going to mean that there are actually companies who have those chiplets ready to go. So we're using CSS to enable the Arm Total Design partners to enable the chiplets, and the chiplets then can be co-packaged with the accelerator of choice of the end customer."
There are all kinds of possibilities here. Arm cores on the accelerator socket, making it self-hosted. Various homegrown superchips with multiple sockets. Arm CPUs with integrated vector and tensor accelerators. And those possibilities are precisely the point. You can co-design the datacenter hardware to the software and increase efficiencies and cut costs.