Rearchitecting data centers for AI workloads
1 /8Pages

Catalog excerpts

WHITE PAPER Rearchitecting data centers for AI workloads According to some estimates, over 80% of AI projects1 fail to achieve all their goals. There are several common reasons for this, one of them being a lack of adequate data center infrastructure. While most IT leaders realize AI can’t happen without cutting-edge infrastructure, building and optimizing an AI-ready data center is undeniably complex. At the heart of this challenge is that AI isn’t a single process but a set of processes happening in different places, each with its own technical requirements. Before you can train an AI model, you need to collect, process and store vast amounts of data, often across a multitude of systems. Training then requires specialized hardware, such as high-end GPUs. Once trained, you need the lowlatency, high-efficiency CPUs to generate real-time insights, all while moving data seamlessly between storage, compute and edge environments. Each workload places unique demands on the data center. As such, optimizing data center infrastructure to support AI projects requires a fundamentally different approach compared with more traditional enterprise workloads, such as virtualization or delivery of cloud applications. This paper explores strategies for building an efficient data center that can accommodate the ever-growing demands of AI workloads.

Open the catalog to page 1

Rearchitecting data centers for AI workloads -2

AI is tremendously data-hungry. The data sets used to train large mainstream models such as ChatGPT-4 or Google Gemini run into dozens of terabytes. Even smaller models, designed for specific use cases, require about 10 times the amount of data than the number of parameters, or the number of variables the model learns during training. ChatGPT-4, for example, has an estimated 1.8 trillion parameters.2 Even the smallest AI models have millions of parameters. Training data can significantly burden storage infrastructure, particularly as AI models grow in scale and complexity. Moreover, training...

Open the catalog to page 2

Rearchitecting data centers for AI workloads -3

Step 1: Choose scalable, high-performance storage mediums To meet data collection and storage demands, the AI-ready data center must use modernized storage architectures that minimize bottlenecks and support high-speed data retrieval. While hard drives remain a practical choice for low-cost archival storage, they can’t meet the performance demands of real-time data access. An optimal production environment should instead use nonvolatile memory express solid state drives (NVMe SSDs). These offer lower latency and higher throughput, with data-transfer speeds up to 35 times higher than conventional...

Open the catalog to page 3

Rearchitecting data centers for AI workloads -4

Model training is easily the most resource-intensive phase of the AI lifecycle. Computational demands are enormous, making raw compute power a common bottleneck in training large-scale AI models. Not only does data center infrastructure need to accommodate rapid access to vast data sets, but it also needs high-performance computing (HPC) capabilities to handle the myriad complex computations inherent in training any AI model. That said, the computational requirements for AI model training vary greatly, depending on the size and complexity of the model. For example, a small-scale model may have...

Open the catalog to page 4

Rearchitecting data centers for AI workloads -5

Step 2: Address energy-consumption demands in the AI data cente AI model training consumes massive amounts of energy, with larger models requiring several orders of magnitude more energy than small, singlepurpose models with just a few billion parameters. Also, AI training involves sustained, resource-hungry workloads, unlike the short bursts of computation used for inference. Standard CPUs aren’t energyefficient enough, leading to wasted energy consumption over excessively long training times. AI-optimized processors have lower power-tocomputation ratios and can maintain the high compute demands...

Open the catalog to page 5

Rearchitecting data centers for AI workloads -6

After training, an AI model is deployed in production, where its ability to analyze and generate insights is put to the test. This is known as the inference phase, where the model recognizes patterns in external data to infer conclusions and predictions. That's the heart of the value of AI, so having the infrastructure necessary for real-time inference is ultimately what translates into actual business value and model viability. Fortunately, inference isn't nearly as computationally intensive as training. However, unlike training, which is done intermittently, inference runs continuously in a...

Open the catalog to page 6

Rearchitecting data centers for AI workloads -7

Step 2. Right-size inference workloads for cost-efficient scaling Step 3. Factor in data localization and regulatory demands To be useful in real-world applications, AI inference heavily depends on the rapid movement of data between compute, storage and end-user environments. An optimized network infrastructure is essential for making that possible and, in doing so, preventing latency or bandwidth bottlenecks from disrupting user experiences. For instance, SmartNICs offer software-defined hardware acceleration to ensure faster packet handling and reduced latency, while high-speed interconnects...

Open the catalog to page 7

Rearchitecting data centers for AI workloads -8

AI is radically changing how businesses operate, compete and innovate, but success hinges on having a high-performance data center. As AI workloads continue to grow in scale and complexity, integrating specialized storage and compute and efficient networking are paramount for ensuring long-term sustainability and growth. Working with the right AI infrastructure partner grants you access to the expertise, hardware and software ecosystem needed to accelerate AI adoption while ensuring cost efficiency, performance and scalability. Being at the forefront of AI data center innovation, AMD delivers...

Open the catalog to page 8