close
close
news

Inside the 100K GPU xAI Colossus Cluster that Supermicro helped build for Elon Musk

XAI Colossus Data Center Compute Hall
XAI Colossus Data Center Compute Hall

Today we’re releasing our tour of the xAI Colossus Supercomputer. For those who have heard stories about Elon Musk’s xAI building a massive AI supercomputer in Memphis, this is that cluster. With 100,000 NVIDIA H100 GPUs, this multibillion-dollar AI cluster is notable not only for its size, but also for the speed at which it was built. The teams built this gigantic cluster in just 122 days. Today we can show you the inside of the building.

Of course we have a video of this that you can find on X or on YouTube:

Normally at STH we do everything completely independently. This was different. Supermicro is sponsoring this because this is by far the most expensive piece for us to do this year. Also, some things will be blurred, or I will be deliberately vague due to the sensitivity behind building the largest AI cluster in the world. To demonstrate this, we received special approval from Elon Musk and his team.

Supermicro liquid cooled racks at xAI

The basic building block for Colossus is the Supermicro liquid-cooled rack. This includes eight 4U servers with eight NVIDIA H100s each for a total of 64 GPUs per rack. Eight of these GPU servers plus one Supermicro Refrigerant Distribution Unit (CDU) and associated hardware make up one of the GPU computer racks.

XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle
XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle

These racks are arranged in groups of eight for 512 GPUs, plus networking to provide mini-clusters within the much larger system.

XAI Colossus Data Center Supermicro 4U universal GPU liquid-cooled servers
XAI Colossus Data Center Supermicro 4U universal GPU liquid-cooled servers

Here, xAI uses the Supermicro 4U Universal GPU system. These are the most advanced AI servers on the market today for a number of reasons. One of these is the degree of liquid cooling. The other is how useful they are.

XAI Colossus Data Center Supermicro 4U Universal GPU Liquid-Cooled Server Close
XAI Colossus Data Center Supermicro 4U Universal GPU Liquid-Cooled Server Close

We first saw the prototype for these systems about a year ago at Supercomputing 2023 (SC23) in Denver. We were unable to open one of these systems in Memphis because they were conducting training missions while we were there. An example of this is the way the system sits on trays that can be serviced without removing systems from the rack. The 1U rack manifold helps introduce cool fluid and exit heated fluid for any system. Quick disconnects make getting liquid cooling out of the way quick, and last year we showed how they can be removed and installed with one hand. Once these are removed, the trays can be pulled out for maintenance.

Supermicro 4U universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 3
Supermicro 4U universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 3

Luckily, we have images of the prototype of this server so we can show you what’s inside these systems. In addition to the 8 GPU NVIDIA HGX tray that uses custom Supermicro liquid cooling blocks, the CPU tray shows why this is a next-level design that is unparalleled in the industry.

Supermicro 4U universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 6
Supermicro 4U universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 6

The two x86 CPU liquid cooling blocks in the SC23 prototype above are fairly common. The uniqueness is on the right. Supermicro’s motherboard integrates the four Broadcom PCIe switches used in almost every HGX AI server today, rather than placing them on a separate board. Supermicro then has a custom liquid cooling block to cool these four PCIe switches. Other AI servers are being built in the industry, adding liquid cooling to an air-cooled design. Supermicro’s design is liquid-cooled from the start, all from one supplier.

Supermicro SYS 821GE TNHR NVIDIA H100 and NVSwitch liquid cooling pads 8
Supermicro SYS 821GE TNHR NVIDIA H100 and NVSwitch liquid cooling pads 8

It’s analogous to cars, some of which are designed to run on gas first and then an EV drivetrain is mounted to the chassis, versus EVs that are designed from the ground up to be EVs. This Supermicro system is the latter, while other HGX H100 systems are the former. We’ve had hands-on experience with most of the public HGX H100/H200 platforms since launch, and some of the hyperscale designs. Make no mistake, there is a big hole in this Supermicro system and others, including some of Supermicro’s other designs that can be liquid or air cooled and that we have discussed previously.

Related Articles

Back to top button