Meta has announced two new GPU groups that will allow the company to provide enhanced infrastructure to address the demanding computing demands of artificial intelligence (AI) systems.
Marking a “major investment in the future of Meta AI,” the company announced the addition of two 24k GPU data center-scale clusters that feature increased performance and reliability for AI workloads.
These GPUs will support both Meta's current Llama 2 model and its upcoming Llama 3 model, as well as the company's broader research and development projects in generative AI and other areas.
The company described the announcement as “a step in our ambitious infrastructure roadmap” and will allow the tech giant to acquire 350,000 Nvidia H100 GPUs to expand its portfolio.
Meta said the expansion project will deliver total computing power equivalent to nearly 600,000 H100s upon completion.
“As we look to the future, we recognize that what worked yesterday or today may not be sufficient for the needs of tomorrow,” the company said in a statement.
“That's why we constantly evaluate and improve every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond.”
Meta focused on building “end-to-end” AI systems on its latest pair of GPU clusters, emphasizing the expertise of researchers and developers as a means to guide production.
With high-performance network fabrics running alongside 24,576 Nvidia Tensor Core H100 GPUs, these new clusters can support “larger and more complex” models than Meta's previous RSC clusters.
One of the new clusters was built with “remote direct memory access (RDMA) over converged Ethernet (RoCE)”, while the other features an “Nvidia Quantum 2 InfiniBand fabric”, both aimed at improving the functionality of the grid.
Both clusters were built using Meta's in-house open GPU hardware platform, Grand Teton, which itself is based on AI generations that integrate “power, control, compute and fabric interfaces into a single chassis for better overall performance.” .
“Grand Teton allows us to build new clusters in a way designed specifically for current and future applications in Meta,” the firm said.
Generative AI also consumes data at high volumes, the company said, meaning the next generation of GPUs must take storage into account.
Meta's “homegrown” Linux storage system does this in its latest GPU cluster offerings, which will run in parallel with a version of Meta's Tectonic distributed storage solution.
Although Meta reports that there were initial performance issues with these larger clusters, changes to its internal job scheduler helped optimize both GPU clusters to “achieve excellent and expected performance.”