Nvidia, this month, unexpectedly launched an up to date GPU roadmap with new merchandise yearly.
The brand new GPUs for 2024-2026 got here regardless of clients lining up for the red-hot A100 and H100 GPUs for his or her AI computing wants.
Tesla was among the many firms ready to obtain Nvidia GPUs and at last acquired a batch of 10,000 H100s to energy its AI operations, CEO Elon Musk stated throughout an earnings name final week.
Nvidia clearly shouldn’t be resting on its laurels however declined to touch upon its roadmap.
Business observers urged Nvidia may leverage chiplets, superior packaging, and manufacturing applied sciences to advance chips on an unprecedented yearly foundation. Additionally, Nvidia’s roadmap could also be a placeholder, and the corporate doesn’t have an obligation to ship on it.
Andrew Feldman, the CEO of Cerebras Programs, felt in another way and referred to as Nvidia’s roadmap a “predatory pre-announcement” and stated the corporate was utilizing misleading practices and its dominant place to hinder competitors.
Feldman provided his unabashed opinion to HPCwire of why Nvidia’s roadmap is probably not life like and why it may flip clients off.
Feldman is likely one of the most vocal critics of Nvidia, however he additionally has the pedigree because the architect of the world’s largest AI chip. He additionally talked about how Cerebras’ built-in chip improvement method – albeit at a wafer scale — continues to be essential in a world heading towards chiplets.

HPCwire: What do you consider Nvidia’s yearly product roadmap?
Andrew Feldman: I feel that is very doubtless a predatory pre-announce. It’s onerous to say. Is the pre-announcement as a result of they wish to do it or as a result of it helps confuse the market? I feel it’s the latter.
What Cisco did – they pre-announced a three-phase program that supposedly solved world peace however by no means bought to section two, not to mention three.
HPCwire: What was Cisco’s predatory pre-announce affair?
Feldman: Within the late 90s, abruptly, there have been a complete bunch of opponents that have been consuming Cisco’s lunch. And so they couldn’t do their engineering as quick.
They put out a three-phase plan that may take 5 years. The entire kitchen sink bought thrown in. It froze the marketplace for somewhat bit and gave their engineering an opportunity to kind of catch up. They by no means delivered on all three phases, ever.
In some ways, it has been a horrible block of time for Nvidia. Stability AI stated they have been going to go on Intel. Amazon stated the Anthropic was going to run on them. We introduced a monstrous deal that may produce sufficient compute so it could be clear that you may construct… giant clusters with us.
[Nvidia’s] response, not shocking to me, within the technique realm, shouldn’t be a greater product. It’s… throw sand up within the air and transfer your palms loads. And you understand, Nvidia was a yr late with the H100.
HPCwire: It’s an attention-grabbing time… you’ll be able to speed up roadmaps with chiplets and advances in manufacturing. You possibly can add totally different elements, particularly SRAM and analog, which can’t scale to 3 nanometers.
Feldman: Corporations have been making chips for a very long time, and no one has ever been capable of succeed on a one-year cadence as a result of the fabs don’t change at a one-year tempo.
Which means you might be paying an enormous sum of money to attend for masks and never getting sufficient time to amortize the price of these masks. Your vendor doesn’t make cash on masks; they make cash on the runs.
I consider that as not designing a brand new chip however modifying the bundle. You may be capable of swap chiplets at common intervals however keep in mind, meaning each 9 months, you’ll piss off a buyer by promoting them a chip that’s old-fashioned three months later.
If they’re altering the bundle, it’s actually a smaller elevate. It places some stress in your software program crew. And it actually places stress in your clients … each 9 months, all the things they purchased is instantly moved off the leading edge in favor of another product.
HPCwire: Cerebras has gone huge, with all the things built-in into one big wafer. Others are getting into one other route however in another way — by decomposing built-in chips into chiplets. Why don’t you do the identical?
Feldman: There are two methods to have a look at it. One is that they’re going small, however the different is that they weren’t ok to go huge. They want extra silicon, too, and they’re simply doing it on a lot of little items of separate silicon.
We will put it on one piece of silicon, however they need extra whole silicon. And so they [Nvidia] are utilizing an 800-mm2 major chip, after which they’re utilizing a lot of reminiscence chips, after which they’re utilizing IO chips. And all of that. We simply went with an enormous chip.
I feel each methods attempt to use extra silicon space. We used it on one undiced wafer. They’ve damaged it up into many little items that should be reassembled on a motherboard or the bundle.
On the highest degree, there’s absolute settlement that you simply want extra silicon space, and we’d like extra transistors for these issues. Whether or not you do it with one huge chip or a lot of little chips is an implementation element of the final concept that you simply want extra silicon.
HPCwire: How do you have a look at chip design going into the long run?
Feldman: We’ve got probably the most reminiscence bandwidth. We’ve got enormous quantities of IO, and I feel all people desires extra.
Serious about find out how to get extra is massively essential. And occupied with find out how to — whether or not it’s with chiplets, different strategies, stacking, or different modern approaches — everyone seems to be looking for extra reminiscence bandwidth as a result of these issues are reminiscence bandwidth constrained. And that’s the reason we’re quicker than GPUs. However no one is standing nonetheless.
HPCwire: How do you pack extra reminiscence in built-in versus the chiplet design method?
Feldman: SRAM is in your principal die. It’s the reminiscence that lives subsequent to compute. If in case you have a limited-size chip like 800mm2 just like the H100, each sq. millimeter you give to SRAM, you are taking away from a core. You will have this dilemma — you’ll be able to put extra reminiscence on chip, which is blisteringly quick, or you’ll be able to have extra compute.
What has been accomplished is on GPUs — they’ve skinnied up the SRAM on the chip in favor of DRAM or HBM off-chip, which prices a ton. It’s a onerous drawback. That’s the reason we went to wafer scale, so we may slam down an enormous quantity of SRAM and an enormous variety of cores. That’s what all these architectural selections are about.
HPCwire: Is the benefit the bandwidth?
Feldman: That’s it. That’s the way you get on and off of the chip. That’s the way you energy the chip. These are basic parts usually neglected — the bundle delivers energy and IO.
Our determination to place all the things on one wafer vastly simplified our means to speak throughout the equal of a whole lot of GPUs. They need to put switches down, invent NVLink, after which they’ve bought a few of their clients that don’t purchase NVLink and have to make use of InfiniBand or Ethernet. We transfer quicker at 1,000th of the ability 1,000 occasions as quick.
[Nvidia] acknowledges now that they’re going to want extra IO, do some chiplets, and people are going to spin at a distinct price than their major processors. However they’re attacking the identical basic drawback, which is — how on earth can we get extra silicon to bear on the issue?
HPCwire: Chiplets appear higher for applied sciences like analog chips, which can not scale to leading edge. How do you overcome that along with your built-in method?
Traditionally, there have been elements in your chip, particularly, SERDES (serializer/deserializer, a transceiver that converts parallel information to serial information and vice versa) that have been analog. And that IP was not transferring on the identical pace as the remainder of the CMOS design, the remainder of your logic. We designed round that drawback early on.
Our view was that it’s a enormous drawback, and it is usually an enormous drawback that you’re doubtless to purchase SERDES from a number of numbers of distributors, and they’re terribly costly. Why don’t we design them out utterly? So, as an alternative of disaggregating them, we designed them out.
HPCwire: The place is the complexity in AI chip design – is it in studying or inferencing?
Feldman: Inference is an easy drawback, besides generative inference, which is a really onerous drawback and very reminiscence and bandwidth-intensive. All of the inference we do on photographs is a trivial drawback.
Generative AI is a really onerous inference drawback. GPUs are very unhealthy at it. And all of us do it this minute. However CPUs did it for some time. I feel you will note a complete bunch of latest elements popping out over the subsequent 6-9-12 months that will probably be higher at it.
However it’s a very, very onerous drawback; this can be very reminiscence intensive since you are producing every token based mostly on the earlier tokens, and that may be a linked drawback, and you might be doing that inside a context. And that’s reminiscence, reminiscence, reminiscence.
HPCwire: Sparsity and holding information nearer to processing appears to be an enormous deal in your AI stack.
Feldman: Sparsity provides you a bonus in each step. You don’t retailer stuff you aren’t going to make use of. It isn’t going to supply any new info. You don’t transport bits that don’t carry info. In every of these, you’ll be able to give it some thought as a type of compression. You compress the quantity of knowledge you should transfer so that you get extra bang in your bandwidth. Every of these is key to the way in which we take into consideration the issue.
HPCwire: You might be nonetheless at 7-nm. Nvidia carries a big benefit in course of. With the chip being on a wafer, does the nanometer course of even matter for you?
Feldman: Our means to place transistors down is one in every of humanity’s crowning achievements. That we will put transistors down at 5 or three nanometers is extraordinary. The good points you get are actual and significant, and that can not be ignored.
Nevertheless, in the latest era, [Nvidia] didn’t include any pricing benefit. The H100 is roughly twice [the size of] A100; it has roughly twice as many transistors. So you bought twice the compute for twice the worth. And that’s not an enormous achieve historically.
Your selections are to invent issues like we did. And put 46,000 sq. millimeters of silicon. If you don’t want to invent issues, you’ll reorganize chips at 800 mm2 and smaller.
It’s like saying. ‘Oh, look, we will put two on a motherboard.’ Okay. ‘Oh look, we will tie two along with an NVLink swap and put a CPU complicated.’ Okay. ‘Look, we will put a chip down and one other little chiplet that helps it with IO.’ Every of these is identical however barely totally different within the grand scheme of issues. It’s tossing your salad in another way.
HPCwire: What have you ever bought arising?
Feldman: I can’t share it with you proper now. This business is a treadmill. Both you might be transferring ahead, or you might be racing backward. There are all kinds of actually attention-grabbing stuff that will probably be introduced over time. Proper now, we’re constructing and promoting an enormous quantity of [silicon].
Academia & Analysis, Group, Leisure, Monetary Providers, Authorities, Life Sciences, Manufacturing, Oil & Gasoline, Retail, semiconductor, House & Physics, Climate & Local weather
