Companies increasingly look to the cloud to run high-performance computing (HPC) workloads. In some cases, they need cloud HPC’s burst capabilities; in others, it is because they cannot fully utilize an on-premises supercomputer. Regardless, HPC workloads involve large-scale computations to solve complex problems—requiring massive amounts of data, storage and high-speed networking distributed across many machines. Easy-to-use tools are also needed to ensure that customers can spend their time on actual computation rather than managing infrastructure.
AWS has put itself at the forefront of the cloud HPC trend with capabilities ranging from advanced job orchestration and cluster management to an enhanced compute fabric to a file system that scales to millions of IOPS. The company has realized that many of its customers need instances optimized for specific HPC workloads that provide performant, cost-effective solutions. Recently AWS announced the general availability of the Amazon EC2 Hpc7a instance as an upgrade to one of its three existing HPC choices. In this article, we review these HPC instances with a focus on the new addition to the family, then take a deeper dive into some of the use cases for all this HPC horsepower.
Boundless HPC capacity, built from the latest technologies
The cloud removes the traditional challenges associated with on-premises clusters: fixed infrastructure capacity, technology obsolescence and high capital expenditures. Customers can quickly migrate to newer, more powerful compute instances as they become available on AWS, removing the risk of on-premises CPU clusters becoming obsolete or poorly utilized as needs change.
AWS now has four HPC-optimized EC2 instances. First, Hpc7g instances—based on AWS Graviton3E processors—provide low latency and high network performance for MPI-based applications. (Message Passing Interface, or MPI, is a specification for cluster computing). Second, Hpc6id instances are for memory-bound and data-intensive workloads such as finite element analysis (more on that below) and seismic simulations. Finally, we have the Hpc6a and the new Hpc7a instances for compute-intensive applications such as computational fluid dynamics and weather prediction.
Better performance compared to the previous generation
The new Amazon EC2 Hpc7a instances feature 4th Gen AMD EPYC (Genoa) processors. Compared with Hpc6a instances, Hpc7a instances have twice the core density at up to 192 cores, 2.1 times higher memory bandwidth throughput, twice the memory capacity at 768GB and three times higher network bandwidth.
To enable fast, low-latency inter-node communications, these instances come with 300 Gbps of Elastic Fabric Adapter (EFA) network bandwidth, powered by the AWS Nitro System. Hpc7a instances also feature DDR5 memory, which provides 50% higher memory bandwidth compared to DDR4 memory to enable high-speed access to data in memory.
The new instances scale more efficiently on fewer nodes when compared to Hpc6a instances. There are four different sizes, all of which have the same memory capacity and interconnect speed: 24 cores, 48 cores, 96 cores and 192 cores.
HPC drives more efficient weather forecasting
One application for the new Hpc7a instances is weather forecasting. This goes far beyond the weather reports we check to know how to dress for the day, or to find out if it will rain during a road trip.
Two types of organizations need to run the workloads that meet these forecasting needs. The first type includes the many startups that ingest weather data from the National Oceanic and Atmospheric Administration (NOAA) and similar organizations, process it and sell the outputs to customers such as insurers and weather-dependent companies. Before the cloud, this type of work was not economically feasible because it would require purchasing an on-premises supercomputer that would sit idle 90% of the time.
The second type of organization is large-scale businesses and government labs that run continuous weather models and long-term climate research workloads. These organizations typically have enough tasks to keep an on-premises supercomputer busy. For them, the cloud presents an economical alternative for numerical weather prediction and climate modeling.
DTN is an example of a company in the first category that specializes in subscription-based analysis and delivery of real-time weather information. DTN’s sophisticated, high-resolution models require continual processing of vast amounts of data from global inputs. Using Amazon EC2 Hpc6a instances, DTN has effectively doubled its high-resolution global weather modeling capacity from two to four times daily.
These new instances enable ensemble runs, something that’s difficult with fixed-capacity on-premises environments. An ensemble forecast is a collection of multiple forecasts that are run simultaneously, preserving the initial conditions. The result is a spread of outcomes from which a probability density function can be derived—promoting greater accuracy in the ultimate forecast.
Solving significant challenges in the auto industry
HPC also has multiple applications in automotive design and analysis. One big challenge, for instance, is optimizing a vehicle’s aerodynamic shape for better fuel efficiency. Aerodynamic simulation in a wind tunnel with a clay model is expensive and time-consuming. Using fluid dynamics, however, an HPC instance can run thousands of design simulations to identify the best candidates, shortening the time it takes to put a physical prototype on the road.
Finite element analysis is a different approach used in crash analysis. Auto makers have to do an immense amount of this analysis, which traditionally involves crashing cars inside a dedicated crash facility—which is incredibly expensive. It is much more economical to crash virtual cars inside a computer, which is where the new AWS HPC instance shines.
Comparing the performance of Hpc7a to the previous generation, Ferrari reports a 30% performance improvement for computational fluid dynamics (CFD) workloads and a 25% performance improvement for finite element analysis (FEA). These advances speak to the significant business impact of the new AWS instances.
Wrapping up
As HPC workloads increase in complexity, there is always the insatiable need for more computing, memory and network performance to reduce the time to complete the tasks at hand. As more customers bring HPC workloads to EC2, Amazon is responding to align its instances with workload requirements.
When we look across the many critical use cases discussed above, the clear message is that customers need many performant, cost-effective instances optimized for each specific HPC workload. The cloud is a boon for HPC because it offers access to effectively unlimited infrastructure so customers can scale on demand yet pay for only what they use. That’s how customers can sidestep the traditional challenges of capacity, obsolescence and cost created by on-premises clusters.
AWS service levels ensure that there is elasticity from 100 instances to 1,000 instances and beyond in minutes. Eliminating job queue times and being able to scale the cluster as needed, when needed, vastly improves efficiency—and ultimately reduces customers’ time to market. The new instances AWS has rolled out will only increase these benefits for the companies poised to take advantage of them. As AI and other cutting-edge areas of tech continue to create more complex and intensive workloads, I’ll be interested to see which sectors make the most of cloud HPC’s capabilities.
Moor Insights & Strategy provides or has provided paid (wish services to technology companies, like all tech industry research and analyst firms. These services include research, analysis, advising, consulting, benchmarking, acquisition matchmaking, and video and speaking sponsorships. The company has had or currently has paid business relationships with 8×8, Accenture, A10 Networks, Adobe, Advanced Micro Devices, Amazon, Amazon Web Services, Ambient Scientific, Ampere Computing, Analog Devices, Anuta Networks, Applied Brain Research, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Avaya Holdings, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, C3.AI, Calix, Cadence Systems, Campfire, Cisco Systems, Clear Software, Cloudera, Clumio, Cohesity, Cognitive Systems, CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Elastic, Ericsson, Extreme Networks, Five9, Flex, Fortinet, Foundries.io, Foxconn, Frame (now VMware), Frore Systems, Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, HYCU, IBM, Infinidat, Infoblox, Infosys, Inseego, IonQ, IonVR, Inseego, Infosys, Infiot, Intel, Interdigital, Intuit, Iron Mountain, Jabil Circuit, Juniper Networks, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, LoRa Alliance, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, MemryX, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology, Microsoft, MiTEL, Mojo Networks, MongoDB, Movandi, Multefire Alliance, National Instruments, Neat, NetApp, Netskope, Nightwatch, NOKIA, Nortek, Novumind, NTT, NVIDIA, Nutanix, Nuvia (now Qualcomm), NXP, onsemi, ONUG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantinuum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Rigetti Computing, Ring Central, Salseforce.com, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing, Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, Telesign,TE Connectivity, TensTorrent, Tobii Technology, Teradata,T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Veeam, Ventana Micro Systems, Vidyo, Volumez, VMware, Wave Computing, Wells Fargo, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom, and Zscaler.
Read the full article here