The Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) is renowned for its science and engineering, and is rated as the second most innovative university in Europe by Reuters. Its areas of expertise include materials science, chemistry, life sciences, computer science, and biomedical engineering. High-performance computing (HPC) is a key research focus at FAU with numerous applications across its faculties, particularly HPC-related teaching and research in computer science.
FAU is member of the German NHR-Alliance, which is a federation of nine tier-2 computing centers in Germany. The university’s Erlangen National High Performance Computing Center (NHR@FAU) supports HPC-based research on a national scale. This HPC project would significantly expand FAU’s research and HPC-capabilities.
The use of machine learning (ML) and molecular dynamics (MD) has become increasingly important for many areas of research, including materials science, chemistry, life sciences , computer science, biomedical engineering, and linguistics. As FAU’s research in these areas becomes increasingly more complex, it requires state-of-the-art HPC. NHR@FAU, desiring to bolster the university’s scientific research capabilities, sought to construct a state-of-the-art, energy- and cost-effective GPU cluster.
Business Needs/Challenges (User Pain Points)
1. Machine Learning and AI
A wide array of research is turning to machine learning and AI for scientific advances. At FAU, major research utilizing ML and AI is being done in computer science for pattern recognition, biomedical engineering, and linguistic studies in spoken word and gesture pairing.
2. Molecular Dynamics
FAU uses MD to simulate the time-dependent properties of macromolecules, properties of biological systems such as protein systems, and pharmacology. These simulations require immense computing power and the demand for these simulations is growing rapidly.
IEIT and MEGWARE’s Solution
IEIT and MEGWARE’s HPC solution has greatly enhanced FAU’s scientific research capabilities. The floating-point performance for model inference and training exceeded the university’s original expectations by 115%. For the ML nodes, IEIT recommended the NF5488A5 GPU Server powered by NVIDIA A100 GPUs, a next-generation data center GPU with better performance in CUDA Core, Tensor Core, video memory, computing power, etc. For the MD nodes, IEIT recommended the NF5468A5 GPU Server powered by NVIDIA A40 Tensor Core GPUs. This solution maximized performance while minimizing costs. FAU had relatively low requirements for server PCIe expansion, making IEIT NF5488A5 and NF5468A5 servers an ideal choice. Alternative server products offer more PCIe slot expansion capacity, but this is a costly feature the university will never fully utilize. Consequently, IEIT NF5488A5 and NF5468A5 servers were able to fulfill all project requirements, while also reducing procurement costs.
1) ML nodes supporting diverse machine learning and artificial intelligence tasks
Various software and algorithms for ML-based research have diverse computing characteristics. Different software has very different utilization of the CPUs, GPUs, memory, and hard disks. Computing resource utilization also varies considerably depending on the task. Therefore, IEIT recommended the NF5488A5 GPU server, which supports 8 third-generation NVLink fully interconnected NVIDIA GPUs and 2 AMD CPUs in a 4U chassis. It can sufficiently support the calculation of ML datasets and improve training efficiency.
2) MD nodes with strong computing power
Due to the complexity of molecular simulation theory, the requirements for HPC in this field are extremely high. IEIT recommended the NF5468A5 GPU server. NF5468A5 utilizes 8 NVIDIA A40 Tensor Core GPUs and 2 AMD CPUs in a 4U chassis. It is a powerful match for an assortment HPC MD requirements.
3) A GPU Cluster that can support a wide variety of complex scientific research
The GPU cluster composed of these ML and MD nodes is named “Alex” by NHR@FAU. Alex is the core component of NHR@FAU’s HPC Infrastructure to handle the rapidly growing computing resource demands for ML and MD in scientific research. The cluster is interconnected through a high-speed HDR InfiniBand network, resulting in top-level general-purpose computing with excellent HPC and AI performance that runs a multitude of research-specific software with various hardware requirements, while supporting massive ML datasets, molecular dynamics simulations, and improving training efficiency.
The client's return
NHR@FAU’s HPC cluster using IEIT GPU servers and developed by MEGWARE, handles applications such as Tensorflow, PyTorch, QuantumEspresso, and VASP, and scientific research software such as NAMD, LAMMPS, AMBER, GROMACS, etc. With the HPL test requirements of this high-performance project, combined with the deep understanding of HPC in the field of scientific computing, IEIT's professional HPC application analysis team provided a performance optimization solution for a 15% performance boost.