Cerebras has crammed even more compute into its wafer-scale chip, in the form of a second-generation Wafer Scale Engine which has migrated to TSMC’s 7nm process node.
The Wafer Scale Engine 2 (WSE2), a chip the size of an entire wafer, has 2.6 trillion transistors (by comparison, the largest GPU on the market has around 54 billion transistors). The WSE2 is a staggering 46,225 mm2 of silicon with 40 GB of on-chip memory (an increase from the first generation’s 18 GB), 20 petabytes/s memory bandwidth and 220-petabit/s fabric bandwidth (increased from 9 PB/s and 100 Pb/s, respectively). It is designed for AI workloads in large-scale data center and HPC applications.
Describing the WSE2 as the size of a dinner plate, Cerebras CEO Andrew Feldman told EE Times the new chip has more than double the amount of processor cores – 850,000 compared to the previous generation’s 400,000.
“We were able to move about 2.3 X on every performance dimension, so this is a massive step forward,” he said.
While the first-generation WSE was built on TSMC’s 16nm process node, the new device has migrated to 7nm.
“When you invent a new technology that allows you to yield the largest part ever, the question is, have you solved this problem at its root, or did you do it just for the special case at 16nm?” Feldman said. “The answer is that we solved it for the general case, now we can build wafer scale parts at any geometry.”
Moving to the 7nm node poses no additional yield problems, he said.
“Our strategy began with the assumption we would face flaws [in the crystalline structure of the silicon],” he said. “When there’s a flaw, we don’t throw away the wafer. We map around it just like the data center guys do, when there’s a flaw in one of their servers, they shut it down and they map around it… this technique is only possible if you have extreme numbers of identical units, because otherwise the cost of carrying redundant elements is very, very high. And so our yields, even at 7 nm are much, much higher than much smaller parts like the GPU.”
Over and above the 850,000 processor cores, there are also 1% redundant cores purely for this eventuality. Connections between nearest neighbors allow defective cores to be skipped with “almost no penalty,” Feldman said. He conceded that patterns of defects that would take out multiple adjacent cores would be harder to overcome, but added that these are “extraordinarily rare. Defects at a beautifully-run fab like TSMC are stochastically distributed, they are randomly distributed, and so our yield is extraordinarily high.”
As well as moving to a more advanced process node, Cerebras has also improved its micro-architecture following insights gained from customer deployments of the first generation WSE. A great deal of work has also gone into making sure Cerebras’ software is sufficiently robust to move seamlessly between 400,000 and 850,000 cores.
The WSE2 is programmed “in exactly the same way you program a GPU,” Feldman said. Unmodified TensorFlow or PyTorch code can be run with just one additional line of code to run on the WSE2. Cerebras’ compiler allocates neural network layers to regions of compute on the chip, then creates a circuit that runs through all the layers and sends data through it.
“By doing it this way, we are able to keep a huge amount of data on the wafer, which is extremely power efficient and extraordinarily low latency,” Feldman said. “The approach that everybody else uses, building a cluster with lots of little parts, has a great deal of complexity associated with it. You have to figure out how to break up your work into lots of little parts.”
Feldman pointed out that clusters of parts require different TensorFlow/PyTorch distributions, careful consideration of memory and bandwidth, and specialized tools. Machine learning rates and hyperparameters have to be adjusted to suit the environment.
“You have to do none of that with our system,” he said. “You can take your work as designed for one GPU and point it at our system and get this huge performance gain. You can’t take the work for one GPU and point it at 20 GPUs – you’d have to do all this extra work.”
Cerebras already has multiple customers announced in the world of supercomputing, including Argonne National Laboratory, Lawrence Livermore National Laboratory, Pittsburgh Supercomputing Center, and the supercomputing centre at the University of Edinburgh. Argonne has said the first-generation WSE it deployed has reduced experiment turnaround time on cancer prediction models by 300 X, allowing years’ worth of work to be done in months.
High-profile enterprise customers include GlaxoSmithKline, and others in the pharmaceutical and financial industries. The company is also seeing interest from hyperscale cloud companies, both domestically and internationally, Feldman said.
Cerebras currently has 300 engineers across its offices in Silicon Valley, Toronto, San Diego and Tokyo. The startup has raised more than $475 million with its last round, a Series E in November 2019, valuing it at $2.4 billion (post-money valuation).
The WSE2, like its predecessor, will be sold in a rack-mount 26”-tall (15 U) system. The Cerebras CS-2 incorporates the necessary power supplies, fans and liquid cooling to run the WSE2.
CS-2 systems will be available in Q3, 2021.
The post Cerebras Crams More Compute Into Second-Gen ‘Dinner Plate Sized’ Chip appeared first on EETimes.