Catalogs >
Fujitsu >
NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads
1 / 14Pages

Catalog excerpts

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 1

Technical Report NetApp AFF A800 and Fujitsu Server PRIMERGY GX2570 M5 for AI and ML Model Training Workloads David Arnette, NetApp Takashi Oishi, Fujitsu February 2020 | TR-4815 Abstract This solution focuses on a scale-out architecture to deploy artificial intelligence systems with NetApp® storage systems and Fujitsu servers. The solution was validated with MLperf v0.6 model-training benchmarks using Fujitsu GX2570 servers and a NetApp AFF A800 storage system. Sensitivity: Internal & Restricted

Open the catalog to page 1

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 2

Open the catalog to page 2

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 3

1 Introduction This solution focuses on a clustered architecture using NetApp® storage systems and Fujitsu PRIMERGY servers optimized for artificial intelligence (AI) workflows. It covers testing and validation for PRIMERGY GX2570 M5 servers and a NetApp AFF A800 storage system. In this validation, we demonstrate an efficient and cost-effective solution for high-performance distributed training with NVIDIA Tesla V100 GPUs and the enterprise-grade data management capabilities of NetApp ONTAP® cloud-connected data storage. Target Audience This document is intended for the following audiences:...

Open the catalog to page 3

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 4

Scaling from 200TB (two controllers) up to 9.6PB (24 controllers) NetApp ONTAP 9.5, with a complete suite of data protection and replication features for industryleading data management Other NetApp storage systems, such as the AFF A700, AFF A320, and AFF A220, offer lower performance and capacity options for smaller deployments at lower cost points. Figure 1) NetApp AFF A800. NetApp ONTAP 9 ONTAP 9 is the latest generation of storage management software from NetApp that enables businesses to modernize infrastructure and transition to a cloud-ready data center. Leveraging industry-leading...

Open the catalog to page 4

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 5

Future-Proof Infrastructure ONTAP 9 helps meet demanding and constantly changing business needs: • Seamless scaling and nondisruptive operations. ONTAP supports the nondisruptive addition of capacity to existing controllers as well as to scale-out clusters. Customers can upgrade to the latest technologies such as NVMe and 32Gb FC without costly data migrations or outages. Cloud connection. ONTAP cloud-connected storage management software offers options for software-defined storage (ONTAP Select) and cloud-native instances (NetApp Cloud Volumes Service) in all public clouds. Integration...

Open the catalog to page 5

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 6

connection with NVLink provides up to 50GBps of bandwidth. The PRIMERGY GX2570 M5 also has a 16-lane PCIe slot for every two GPUs that supports up to four low-profile interconnect cards. Using Mellanox ConnectX-5 host channel adapters (HCA) enables remote direct memory access (RDMA) data transfer between nodes at 100Gbps for both Ethernet and InfiniBand. These HCAs also allow larger workloads to use multiple nodes with linear performance scalability. Internal storage includes six SATA 3.0 HDDs or SSDs through PCH and four faster PCIe Gen3 4-lane NVMe SSDs. Internal storage can be used for...

Open the catalog to page 6

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 7

capabilities and allows priority flow control for lossless forwarding of Ethernet packets by allocating bandwidth to specific traffic on physical layer media. This enables simultaneous NFS storage access using the same 100GbE link, while guaranteeing bandwidth for RDMA between GPU nodes. Automatically Build and Sustain System Infrastructure with Ansible Ansible is a DevOps-style configuration management tool developed by Red Hat. The desired configuration of a system is defined in easily readable YAML files, including the software and hardware configuration of servers, storage, and...

Open the catalog to page 7

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 8

Figure 5) Network topology of tested configuration. Hardware Requirements Table 1 lists the hardware components required to implement the solution as tested. Table 1) Hardware requirements. Fujitsu PRIMERGY GX2570 M5 servers CPU: Dual, 24-core Intel Xeon 8280. System memory: 12 DDR4 32GB 2933MHz. GPUs: 8 NVIDIA Tesla V100 SXM2 32GB. Storage: 2 SATA SSDs, 2 NVMe SSDs. Network: 2 Mellanox ConnectX-5 2ports HCA. Power consumption: Max 3,656W. High-availability (HA) pair, including two controllers and 48 NVMe SSDs. Cisco Nexus 3232C network switches Software Requirements Table 2 lists the...

Open the catalog to page 8

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 9

Table 2) Software requirements. Version or Other Information Docker container platform NVIDIA-Docker container Benchmark software SSD300 v1.1 for PyTorch Mask R-CNN for PyTorch ResNet-50 v1.5 for Mxnet Minigo for TensorFlow NVIDIA Peer Memory Client Munge Open-mpi 4 Validation Test Plan and Results 4.1 Validation Test Plan This solution was validated using the MLPerf v0.6 benchmark models and testing procedure. Each model was trained using one PRIMERGY GX2570 server and one NetApp AFF A800 storage system in the configuration described in Section 3.1. In addition to testing each model with a...

Open the catalog to page 9

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 10

Figure 6) MLPerf benchmark software stack. MLPerf Test Set CUDA Guest OS (Ubuntu 16.04LTS) Docker (docker-ce, nvidia-docker2) NVIDIA Driver GPUDirect RDMA Host OS (Ubuntu 18.04LTS) PRIMERGY Server MLPerf v0.6 testing procedures were used to produce these results. The following notes apply to the test results that follow: • Each result was computed by executing five test runs, dropping the fastest and slowest, and then taking the mean of the remaining three runs. We tested each model with the recommended dataset, as noted below. Each test used a standardized dataset size and makeup in order...

Open the catalog to page 10

NetApp AFF A800 and Fujitsu Server - PRIMERGY GX2570 M5 for AI and ML Model Training Workloads - 11

Validation Test Results MLPerf Model Training Results Each MLPerf training benchmark measured the processing time required to train a model on the specified dataset to achieve the specified quality target. Table 3 shows the clock-time result for each of the models trained. Because the datasets, training parameters, and quality targets are standardized in these benchmarks, these results can be compared to other publicly available MLPerf results. Table 3) MLPerf benchmark processing time for tested models. Training Time Result Note that these are unverified scores of v0.6 on the MLPerf image...

Open the catalog to page 11