# FPGAs in Supercomputers: Opportunities or Folly?

**FPGA'19 Banquet Panel** 

Deming Chen, ECE, University of Illinois at Urbana-Champaign



### Supercomputing applications

- quantum mechanics
- weather forecasting
- climate research
- oil and gas exploration
- molecular modeling
- physical simulations
  - early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion, etc.



#### The Summit Supercomputer

| Sponsors     | U.S. Department of Energy                                                    |
|--------------|------------------------------------------------------------------------------|
| Operators    | IBM                                                                          |
| Architecture | 9,216 POWER9 22-core CPUs<br>27,648 Nvidia Tesla V100<br>GPUs <sup>[1]</sup> |
| Power        | 13 MW <sup>[2]</sup>                                                         |
| Storage      | 250 PB                                                                       |
| Speed        | 200 petaflops (peak)                                                         |
| Purpose      | Scientific research                                                          |

Opportunity?

Engineering Letters, 16:3, EL\_16\_3\_23

### Maxwell – a 64 FPGA Supercomputer

University of Edinburgh, 2008

# Microsoft's FPGA-powered supercomputers can translate Wikipedia faster than you can blink

The world doesn't have to long to wait for Microsoft's "A.I. supercomputers"; they're already here.

Paderborn University in Germany, 2018

#### • There is no FPGAs used in the top supercomputer systems yet

### Panelists

- Hal Finkel, Lead, Compiler Technology and Programming Languages, Argonne National Laboratory
- Martin Herbordt, Professor, ECE, Boston University
- Wen-Mei Hwu, Sanders III AMD Endowed Chair Professor, ECE, UIUC
- Venkata Krishnan, Principal Engineer, Intel
- Viraj Paropkari, Senior Manager, Global Data Center Marketing, Xilinx

### Questions charged for the panelists

- Is there a need to bring FPGAs into supercomputers? Why or why not?
- Are there unique applications that are specifically suitable for FPGAs for supercomputing fields?
- What are the challenges and/or major issues facing FPGAs for supporting supercomputing?
- What and where are the opportunities? Who are the stakeholders?
- Name one thing that the FPGA industry should (or should not) do in the near term to facilitate FPGA's induction into supercomputers.



### FPGAs in SUPERCOMPUTERS: OPPORTUNITY OR FOLLY

Venkata Krishnan Intel Corporation

Feb 26, 2019



# Disclaimer

Views, thoughts, and opinions expressed belong solely to the speaker, and not necessarily to the speaker's employer or any other group or individuals.

Any product information provided here is subject to change without prior notice.

Examples shown are for illustrative purposes and are to be treated as such.



### DO FPGAs BELONG TO SUPERCOMPUTERS? The answer is a YES but...

-Not as an alternative for Xeon/GPU/ASIC but rather complement them

#### **FPGAs are relevant in Supercomputers**

- -For a well-defined set of "services" (e.g. specific/targeted applications)
- Such services need to quickly adapt to algorithmic changes/data & meet certain performance, cost, power requirements

#### One part of enabling this is making FPGAs 1<sup>st</sup> class citizens on the network

Note: There are times when FPGAs can be used in place of GPUs or help avoid ASICs. Also the term 1<sup>st</sup> class here broadly refers to FPGAs acting autonomously without host/OS involvement. It doesn't preclude FPGAs being deployed as a SmartNICs or as a NIC assist.

### FPGAs can INDEED be FIRST CLASS CITIZENS on THE NIFTWORKII

#### • EP300 PLD (1984) & XC2046 FPGA (1984)

- 64 logic blocks with two 3-input LUT & 1 register

#### • Falcon Mesa (after Stratix10) in 2019

- Millions of logic blocks/registers

#### AND

- Heterogeneous System-in-Package (SIP) Integration
- In package High Bandwidth Memory (HBM2/HBM3)
- Integrated high speed transceivers (112G PAM4)
- Quad-core ARM\* SoC
- PCIe Gen4 IP, DSP, multipliers
- Support for DDR/DDR-T
- And much more...



#### Couple of examples (there are others) to show this usage in a large system follow...

### **EXAMPLE – Computing Near Sensors**

**FPGA SOC as Front End Processing Nodes** 



#### Frontend (Trigger) - Particle detectors, Radio Astronomy, Aerospace etc.

- "Filter" huge volume of data by performing compute *at point of data acquisition*
- Estimated reduction in backend nodes/fabric requirements **could be 10x-100x**
- Flexibility enables new/updates to algorithms

### EXAMPLE – DISAGGREGATED STORAGE in a HPC System Assumes a massively distributed high performance storage across entire system PEGA Soc

with FPGA SOC as storage nodes (for accelerating storage services)



As a standalone (autonomous) node



#### **INTEGRATION OF STORAGE "NODE" WITH SWITCHES**

**Processor nodes** 

- Rack Savings
- Infrastructure Power & cost reduction
- Cabling cost
- Performance (acceleration) with opportunity to reconfigure based on changes to storage services



# Great. But how do we get the set of the set

### NEED A FOUNDATIONAL INFRASTRUCTURE

**1. Basic infrastructure that integrates networking & accelerator support** 

- Modular architecture that allows ease of customization
- Familiar programming environment developed around open standards (not proprietary)

**2. Customizing infrastructure based on targeted application or services** 

- Identify core libraries & provide necessary IP (accelerator) blocks
- Integrate them on the above infrastructure for a complete solution



High-level view of a conceptual software/hardware stack

STEP 1 IS A NECESSARY STEP STEP 2 DRIVEN BY CUSTOMER OR APPLICATION REQUIREMENTS

### Providing a "FAMILIAR" PROGRAMMING ENVIRONMENT Extensions to OFI\*

- -OFI is a low-level networking API with extensive middleware support applications can continue to use standard APIs
- Extending this network API with acceleration capabilities is easier than doing it the other way around (i.e. taking an acceleration API such as OpenCL and extending it with networking primitives)



Open standard. Can be extended for new features (e.g. acceleration) Defines a communication API and agnostic to networking protocols & HW implementation

Architected to support RDMA functionality



\* Concepts presented at 2019 ECP annual meeting & OFA workshop



## The Challenge: Balancing Reformance & Flexibility



### OFI LIBFABRIC COMMUNITY

Others: rsockets, PMDK, Spark, ZeroMQ, TensorFlow, MxNET, NetIO, Intel MLSL, ...



## OFI insulates applications from fabric diversity

\* Other names and brands may be claimed as the property of others

### **HPC with FPGAs\***+

#### **Martin Herbordt**

Computer Architecture and Automated Design Laboratory Department of Electrical and Computer Engineering Boston University http://www.bu.edu/caadlab

\* This work supported, in part, by Red Hat, Microsoft, the U.S. NIH and NSF, and by donations from Xilinx, Intel-Altera, and Gidel

<sup>+</sup> Thanks to Ahmed Sanaullah, Ethan Yang , Qingqing Xiong, Jiayi Sheng, Robert Munafo, Josh Stern, Tony Geng, Tianqi Wang

UNIVERS

#### **Question 1:**

### What's the status of FPGA/HPC right now?

Method:

• Look



### **FPGA/HPC** is here!





#### Search for "FPGA" in SC Proceedings from 2001-2016 – 13 hits 😕

IEEE Conferences

Abstract (I html) (298 Kb) (C)

Cited by: Papers (1) | Patents (12) IEEE Conferences Abstract ((html)) (368 Kb) (C)

UNIVERSITY

#### **Question 1:**

### What's the status of FPGA/HPC right now?

### Answer 1: Long way to go ... especially if we mean having teams in place at non-FPGA-specialized facilities.



### Question 2: Why FPGAs for HPC?

- Answer 2:
- Beat GPUs?
- Transceivers cheap, flexible, high quality interconnects
- Co-location of compute and communication logic
- Flexible on-chip interconnects

"We are good at low latency. We will stay good at low latency." - keynote, FGPA 2019

"Data movement is everything" - heard at FPGA 2019



#### **Question 3:**

### What will FPGAs in HPC look like?

**Answer 3: FPGAs everywhere!** 





#### **Question 4:**

# Can ordinary computer professionals make FPGA application cost-effective(ly)?

Method:

Try OpenCL



### Results

CPU : 14 core 2.4GHz Intel<sup>®</sup> Xeon<sup>®</sup> E7-4908v3 GPU: NVIDIA P100 (16nm), 3584 cores, HBM2 FPGA: Intel<sup>®</sup> Arria<sup>®</sup> 10 (20nm), 427K ALMs, 1.5K DSP blocks

|                                               | CPU<br>(14 core) | Previous<br>FPGA OpenCL        | Our Work | GPU                   | Verilog               |
|-----------------------------------------------|------------------|--------------------------------|----------|-----------------------|-----------------------|
| Average<br>Speedup of<br>Our Work             | 1.2x             | 2.5x                           | 1.0x     | 0.3x                  | 0.9x                  |
| Highest<br>Speedup<br>Achieved by<br>Our Work | 5.9x<br>(NW) [1] | 155x<br>(Range<br>Limited) [2] | 1.0x     | 2.6x<br>(SPMV)<br>[3] | 2.1x<br>(SPMV)<br>[4] |

#### Why is 0.3x versus GPU good?

- GPU/CPU reference codes are (mostly) highly tuned by vendor, OpenCL is ours
- Apps are mostly very good on GPUs (e.g., versus CPUs)

#### We estimate a 4x increase in performance of our OpenCL designs using Intel® Stratix ® 10

- [1] S. Che et al. "Rodinia: A Benchmark Suite for Heterogeneous Computing," in IISWC, 2009.
- [2] C. Yang et al. "OpenCL for HPC with FPGAs: Case study in molecular electrostatics," in HPEC, 2017.
- [3] NVIDIA cuSparse
- [4] L. Zhuo et al. "Sparse Matrix-Vector Multiplication on FPGAs," in FPGA, 2005.

#### **Related Work:**

- A. Sanaullah et al. "Unlocking Performance-Programmability by Penetrating the Intel FPGA OpenCL Toolflow," in HPEC, 2018
- A. Sanaullah et al. "SimBSP: Enabling RTL Simulation for Intel FPGA OpenCL Kernels," in H2RC, 2018

### **Characterization of Optimizations**

1000 100 All optimizations are application Speedup "unaware" and known to the 10 compiler/autotuning communities However – many are not part of OpenCL 1 "best practices," esp. V-4 – V-6 0 These account for 5x performance V-1 V-2 V-3 V-5 V-6 V-4 Code Versions

Average Incremental Impact of Individual Optimizations



#### **Question 4:**

# Can ordinary computer professionals make FPGA application cost-effective(ly)?

### Answer 1: This SHOULD work!

- Currently requires skill/knowledge
- Standard practices not there yet, but should be
- Should be integrated into OpenCL compilers



### Summary

FPGA/HPC is here and spreading

No reason it shouldn't extend further –

- System/Provider functions written by FPGA engineers w/ massive potential impact
- Applications written using HLS

Might take a while before many traditional HPC production apps run primarily on FPGAs



#### FPGAs for Supercomputing: Already Here and Yet So Far Away?

Hal Finkel (hfinkel@anl.gov)



#### Is there a need to bring FPGAs into supercomputers?

| Operation                         | Energy (pJ)         |
|-----------------------------------|---------------------|
| 64-bit integer operation          | 1                   |
| 64-bit floating-point operation   | 20                  |
| 256 bit on-die SRAM access        | 50                  |
| 256 bit bus transfer (short)      | 26                  |
| 256 bit bus transfer (1/2 die)    | 256                 |
| Off-die link (efficient)          | 500                 |
| 256 bit bus transfer (across die) | 1,000               |
| DRAM read/write (512 bits)        | <u>16,000</u>       |
| HDD read/write                    | O(10 <sup>6</sup> ) |

Do FPGA's perform less data movement per computation?

Courtesy Greg Asfalk (HPE) and Bill Dally (NVIDIA)

Argonne Leadership Computing Facility http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/201604/McCormick-ASCAC.pdf

#### Is there a need to bring FPGAs into supercomputers? (cont.)



See also: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2008-130.pdf

Argonne Leadership Computing Facility

#### Are there unique applications that are specifically suitable for FPGAs for



Argonne Leadership

Computing Facility

Fig. 9. Speed up of FHAST compared to BOWTIE for exact matches, one and two mismatches.

http://escholarship.org/uc/item/35x310n6

#### Are there unique applications... (cont.)

#### machine learning and neural networks

#### EPGA is faster than both the CPU and GPU, 10x more power efficient, and a much higher percentage of peak!

10

8 6

2

n

Argonne Leadership Computing Facility Xeon CPU



http://ieeexplore.ieee.org/abstract/document/7577314/

What are the challenges and/or major issues facing FPGAs for supporting supercomputing?

- Compile Time: Place & Route for a high-end FPGA can take hours or days, and that might be *per kernel* for a larger application. Large applications might have hundreds of kernels. Thus, FPGAs will have difficulty functioning as *general purpose* accelerators.
- Double-precision floating-point support. Most HPC applications currently require it.
- Code Size: It's unclear how many of the kernels used by real applications will fit on
  even a large FPGA.
   Cumulative Relative Runtime
   Lines of code in kernels in DOE proxy apps:



Argonne Leadership Computing Facility Lines of code in kernels in DOE proxy apps: ASPA, SNAP, SNBone, SW4lite, Nekbone, SWFFT, miniFE, Lulesh, miniAero, MiniGMG, CoMD, CoSP2, HPCCG, miniTri, PENNANT, Pathfinder, RSBench, SimpleMOC, XSBench – Data collected by Brian Homerding (ALCF).

#### What and where are the opportunities? Who are the stakeholders? FPGAs are well suited to exploit application parallelism! 1.5x from transistor 670x from parallelism 10 Fxa Perf Peta Transistor 8x from transistor Tera 128x from parallelism Relative 32x from transistor Giga 32x from parallelism 1986 1991 1996 2001 2006 2011 2016 2021

#### System performance from parallelism

Argonne Leadership

http://www.socforhpc.org/wp-content/uploads/2015/06/SBorkar-SoC-WS-DAC-June-7-2015-v1.pptx

#### What and where are the opportunities? Who are the stakeholders? (cont.)



Memory-Latency Bound (FPGAs can pipeline deeply) Memory-Bandwidth Bound (FPGAs can do on-the-fly compression)

Argonne Leadership Computing Facility Name one thing that the FPGA industry should (or should not) do in the near term to facilitate FPGA's induction into supercomputers.

 Build integrated CGRAs: Addressing both compile-time and space constraints, while maintaining the advantages of FPGAs, will require increasing the amount of variety of hardened logic.





Argonne Leadership Computing Facility

### A Brief History of CUDA GPUs in Supercomputers

- CUDA started in 2007
- First major adoption Tienhe-1 in 2009
  - David Kirk and Wen-mei Hwu tutorial in 2008
- First major US adoption Blue Waters and Titan in 2013
  - 3072  $\rightarrow$  4224 Kepler GPUs in Blue Waters
  - NSF PAID program helps science teams to use GPUs
  - Numerous courses and tutorials along the way
- As of 2018, about 20% of the Blue Waters applications use GPUs in a significant way
- Most top supercomputers use GPUs in 2018



### What is limiting the GPU use today?

- Data transfer cost and GPU occupancy
  - Large granularity of the off-loaded compute is needed to get significant benefit
  - Level of data reuse must be high
- Insufficient DRAM/HBM capacity
  - Limited types of computation
- Programming effort
  - Application code often needs to be refactored
  - FORTRAN support is still weak

### **FPGA** as a Computing Device

- Strength
  - Logic and integer arithmetic
  - Flexible data reuse patterns
  - Energy efficiency
- Weakness
  - Clock frequency
  - Floating-point compute throughput
  - DRAM bandwidth
  - Program loading (reconfiguration)
  - DRAM latency tolerance

### **Q: One suggestion for the FPGA industry**

- FPGA in intelligent memory/storage controllers
  - Low-data-reuse compute
  - Data access pattern adjustment between memory bank and channels (e.g. strided accesses)
  - On-the-flight data compression/encryption
- Seamless collaboration with CPUs and GPUs crucial





# Erudite: placing NMA compute inside storage-class memory controllers







FPGAs in Supercomputing -Challenges and Opportunities

- Viraj Paropkari , Senior Manager, Global DC Marketing

25<sup>th</sup> February 2019



### 1. Is there a need to bring FPGAs into supercomputers? Why or why not?

- > Absolutely YES..YES..YES
- Next gen supercomputer will do many complex applications (not killer workload with Al integrated) When one looks at Exa-scale level and beyond systems ; those platform architecture needs to change dramatically to accommodate power efficiency and flexibility in a HPC center. We believe FPGAs have potential to get there. We don't expect overnight developers will switch from CPUs/GPUs to FPGAs but there is enough traction in HPC community and we are working closely to define next gen requirements – e.g Versal with it's capabilities such as Al engines, programmable HW, SW maturity also Alveo board level products with standardized stack. It is completely possible to co-exist CPU+GPU+FPGA system.
- FPGAs have great practical Performance :- GPUs gives about 70% (in many cases 50%) of peak performance. This is waste about half of the HW. CPUs have high peak but high overhead as well. FPGAs offer higher of peak performance with low overhead. This is inherent architecture of logic units with interconnected memory system where data movement is less which consumes more power in CPUs and GPUs.
- > Emphasis great Power efficiency using FPGAs:- 10x power efficient over GPU and 50x power efficient over CPU
- > Use of FPGAs in HPC in not new but was limited:- If you look at various work in HPC done a few years ago, you will realize many promising performance results using HDL - hand tuned programmed codes. This work remained to very limited set of developers as it requires specialized HW knowledge to tune the algorithm to HW level in HDL. Now with the maturity of SDAccel – HLS/OpenCL/C++ support from vendors such as Xilinx ; the performance is encouraging to hand tunes HDL codes.

# Traditional HPC applications Requirements & FPGAs advantages

#### > Compute Bound

- >> FPGAs have many compute DSP blocks, lot of registers, flexible precision e.g Alveo U250
- Future FPGAs such as Versal getting more powerful for Flops rate- adaptable HW engines, Intelligent engines, SW programmable engines

#### > Memory Bandwidth Bound

- >> High bandwidth HBM FPGA devices becoming recently available e.g Alveo U280
- >> Lot of embedded memory and bandwidth compared to GPUs
  - 2x on chip memory and 4x on chip memory BW compared to Volta
- >> Other techniques such as compression/ decompression on-the-fly

#### > Memory Latency Bound

- >> FPGAs can pipeline deeply
- >> Algorithms such as FFTs get benefitted by deeper execution pipelines
- >> Versal will push latency advantage further with high throughput
- >> Massive array of intelligent cores with local memory tightly coupled to adaptable HW

### FAST

Faster than CPUs & GPUs Latency advantage over GPUs



#### ADAPTABLE

Optimized for any workload Adapt to changing algorithms



#### ACCESSIBLE

Deploy in the cloud or on-premises Applications available now



### **Alveo Product Table**

|                | Product Name                                | Alveo U200                                        | Alveo U250                                        | Alveo U280                                        |          |
|----------------|---------------------------------------------|---------------------------------------------------|---------------------------------------------------|---------------------------------------------------|----------|
| su             | Width                                       | Dual Slot                                         | Dual Slot                                         | Dual Slot                                         |          |
| mensio         | Form Factor, Passive<br>Form Factor, Active | Full Height, ¾ Length<br>Full Height, Full Length | Full Height, ¾ Length<br>Full Height, Full Length | Full Height, ¾ Length<br>Full Height, Full Length |          |
| ā              | Weight (Passive/Active)                     | 1066g/1122g                                       | 1066g/1122g                                       | 1000g/1144g                                       |          |
|                | DDR Format                                  | 4x 16GB 72b DIMM DDR4                             | 4x 16GB 72b DIMM DDR4                             | 2x 16GB 72b DIMM DDR4                             |          |
| vy             | DDR Total Capacity                          | 64GB                                              | 64GB                                              | 32GB                                              | (ec      |
| Aemo           | DDR Max Data Rate                           | 2400MT/s                                          | 2400MT/s                                          | 2400MT/s                                          |          |
| AM N           | DDR Total Bandwidth                         | 77GB/s                                            | 77GB/s                                            | 38GB/s                                            | at       |
| DR             | HBM2 Total Capacity                         | _                                                 | _                                                 | 8GB                                               | a        |
|                | HBM2 Total Bandwidth                        | _                                                 | _                                                 | 460GB/s                                           | Ce       |
| rnal<br>AM     | Total Capacity                              | 35MB                                              | 54MB                                              | 41MB                                              | nte      |
| Inte<br>SR/    | Total Bandwidth                             | 31TB/s                                            | 38TB/s                                            | 30TB/s                                            | er '     |
| faces          | PCI Express <sup>®</sup>                    | Gen3 x16                                          | Gen3 x16                                          | Gen4 x8 w/ CCIX                                   | Ac       |
| Inter          | Network Interface                           | 2x QSFP28                                         | 2x QSFP28                                         | 2x QSFP28                                         | Ce       |
| nd<br>al       | Thermal Cooling                             | Passive, Active                                   | Passive, Active                                   | Passive, Active                                   | ler      |
| wer a<br>herm  | Typical Power                               | 100W                                              | 110W                                              | 100W                                              | at       |
| Po             | Maximum Power                               | 225W                                              | 225W                                              | 225W                                              | <b>9</b> |
| ses            | Look-Up Tables                              | 892K                                              | 1,341К                                            | 1,079К                                            | <b>O</b> |
| Logic<br>sourc | Registers                                   | 1,831K                                            | 2,749К                                            | 2,179К                                            | arc      |
| Re             | DSP Slices                                  | 5,867                                             | 11,508                                            | 8,490                                             | S        |
| te<br>ince     | INT8 TOPs                                   | 18.6                                              | 33.3                                              | 24.5                                              |          |
| orma           | Machine Learning                            | Machine Learning Solution Brief                   |                                                   |                                                   |          |
| Cc             | Acceleration Applications                   | Acceleration Application Solutions                |                                                   |                                                   |          |

# 2. Are there unique applications that are specifically suitable for FPGAs for supercomputing fields?

#### > For compute :

- >> For following traditional HPC scientific workloads; FPGAs are very good
  - Bioinformatics / gene sequencing -Illumina as use case example
  - Material simulation Quantum espresso ; Maxeler use case
  - Weather forecast –Maxeler use case with University
  - Oil and Gas imaging Maxeler use case with Eni
  - For emerging application such as machine learning and neural networks FPGAs are awesome

#### > For Storage:

- >> Storage acceleration as data explosion in experiments in scientific computing
- >> FPGAs provide end to end compression / decompression

#### > For networking :

- >> Many architecture to use FPGAs in inline network acceleration cards
- >> Ultra low latency communication with partial offload

#### Realizing the Promise of Personalized Medicine

#### 26 Hour Genome Ultra-Rapid Whole Genome Diagnosis for Critically III Newborns Sample Prep Diagnosi DRAGEN 50 26 hours hours Sequencing Analys Children's Mercy HOSPITAL **Previous Best** New World Record Sample to Answer Rady Children's Hospital-San Diego Sample to Answer edico 3000<sup>3</sup> Seismic Imaging eni presented by the Italian National Oil Company at the Annual SEG Conference, 2010.





#### Global Weather Simulation in China

Imperial College London

#### Simulating AtmosphereShallow Water Equation



[L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, X. Huang, Y. Zhang, and G. Yang, Accelerating solvers for global atmospheric equations through mixed-precision data flow engine, FPL Conference 2013]

| Platform       | Performance | Speedup | Efficiency | <u>Energy</u><br>Improvement |
|----------------|-------------|---------|------------|------------------------------|
| 6-core CPU     | 4.66K       | 1       | 20.71      | 1                            |
| Tianhe-1A node | 110.38K     | 23x     | 306.6      | 14.8x                        |
| MaxWorkstation | 468.1K      | 100x    | 2.52K      | 121.6x 9x                    |
| Maxeler MPC-X  | 1.54M       | 330x    | 3K         | 144.9x                       |
| 8.             |             | 14x     |            |                              |
|                |             |         |            |                              |





| System       | 1 rack of BlueGene/Q    | 8 Xilinx VU9P cards   | Comparison |  |
|--------------|-------------------------|-----------------------|------------|--|
| Space        | 205,920 in <sup>3</sup> | 1,520 in <sup>3</sup> | 135x       |  |
| Power 192 kW |                         | 1 kW                  | 192x       |  |
| Performance  | 338 cubes/s             | 2,237 cubes/s         | 6.6x       |  |

 $\blacklozenge$  BlueGene/Q contains significant water cooling and communication

- FFT divided to 256 nodes

◆ Maxeler solution running on 8 VU9P devices in parallel

- FFT in a single node

# For Machine learning and neural networks, Xilinx FPGAs are really good



Xilinx FPGA faster than both CPU ,GPU, Intel FPGAs

- > 20x speedup over CPUs
- 4x speedup over NVIDIA GPUs
- 16x speed up over Intel FPGAs

Xilinx FPGA power efficient than CPU,GPU, Intel FPGAs

**E** XILINX

- 2x efficient over NVIDIA GPUs
- 7x efficient over Intel FPGAs

### Infuse Machine Learning with other accelerations



# 3. What are the challenges and/or major issues facing FPGAs for supporting supercomputing?

#### > Double precision floating point myth created by GPU and other architectures

Important metric is achieved performance % ; one can't just look at theoretical flop/s and rule out FPGAs

#### > Scalability on cluster level

>> Beyond 1 node is not mainstream yet

#### > Ease of programming

- If you look at various work in HPC done a few years ago, you will realize many promising performance results using HDL - hand tuned programmed codes. This work remained to very limited set of developers as it requires specialized HE knowledge to tune the algorithm to HW level in HDL. Now with the maturity of SDAccel – HLS/OpenCL/C++ support from vendors such as Xilinx ; the performance is encouraging to hand tunes HDL codes.
- >> Support for FPGA in higher level languages such as OpenMP, OpenACC
  - Still at very nascent stages



© Copyright 2019 Xilinx

### Xilinx platform supports wider set of use cases vs GPU

#### Versal Expands To Even More Use Cases

| 2222       |                                                                                     | Xilinx Platforms                                                                                                                                                                                     | GPU                                                                                                                                                                                    |
|------------|-------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Compute    | ML Inference<br>Database / Big Data<br>Video<br>Financial Services<br>Genomics      | <ul> <li>✓ Low batch performance</li> <li>✓ Low / variable precision performance</li> <li>✓ Flexible datapath/memory for power efficiency</li> <li>✓ Hardware &amp; software programmable</li> </ul> | <ul> <li>X Parallel architecture poor fit for low-batch ML</li> <li>X Fixed and inflexible datapath and memory</li> <li>X SIMD architecture → inflexible &amp; power hungry</li> </ul> |
|            | High Precision HPC<br>ML Training                                                   | × Optimized for fixed point training and HPC                                                                                                                                                         | ✓ Optimized for high precision floating point                                                                                                                                          |
| Storage    | Compression<br>Encryption<br>Key-Value Store<br>Database / Big Data<br>ML Inference | <ul> <li>✓ Processing near memory / storage</li> <li>✓ Flexible low latency in-line processing</li> <li>✓ Adaptable parallel memory hierarchy</li> </ul>                                             | Poor Fit                                                                                                                                                                               |
| Networking | IPSec/SSL<br>OVS offload<br>Bare Metal Services<br>Security<br>Monitoring           | <ul> <li>✓ Optimization of latency and efficiency</li> <li>✓ Rich I/O, flex datapath for inline processing</li> <li>✓ Power efficient, flexible datapath/memory</li> </ul>                           | Poor Fit                                                                                                                                                                               |

# 4. What and where are the opportunities? Who are the stakeholders?

> Need participation from wide set of users – ranging from developers to OEMs

#### > HPC Domain scientists

- >> Willing to put efforts in algorithm optimization
- >> Library developers

#### > HPC Developers – traditional HDL/Verilog users

>> Converting HDL codes to higher level

#### > CPU and FPGA vendors

- Fast cache coherent networks that help HPC platform architecture e.g CCIX (AMD), OpenCAPI (IBM Power)
- >> Alveo U280 board is CCIX compliant

#### > HPC OEMs- Cray , Atos , Dell

- >> Effort to create HPC tool chain and communication optimized libraries
- >> Resource management SW e.g Slurm

# 5. Name one thing that the FPGA industry should (or should not) do in the near term to facilitate FPGA's induction into supercomputers.

- > The future of FPGAs in mainstream HPC is bright and there is path to it with recent advancements in HW and SW
- > Do not expect it will solve all application problems overnight for low power, single precision workloads it can yield immediate benefits if efforts are put in by developers
- > Focus on targeted domains where FPGAs have shown value proposition
- > Focus on Building Tools, SW libraries, middleware
- > For Double precision workloads -
  - > Key is precision control and developers/scientists should be willing to experiment with lower precision/mixed precision with acceptable accuracy



## Adaptable. Intelligent.



