# Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Naif Tarafdar, Thomas Lin, Eric Fukuda,

Hadi Bannazadeh, Alberto Leon-Garcia, Paul Chow

University of Toronto

# Cloudy with a chance of FPGAs?

- Big data needs more compute power: How about FPGAs in datacenters?
- Datacenters are approximately 200 billion dollar industry
- Datacenter applications are large
- No large scale datacenter multi-FPGA fabric deployed until 2014



## Cloudy with a Chance of FPGAs?

- Microsoft Catapult
  - 1632 Servers
  - Bing search engine
  - 10 % more power, 95 % more throughput
- Intel acquisition of Altera in December 2015
- More than just a *chance*



3

# Large Clusters Difficult To Manage

- Large resources of clusters difficult to manage
- Expensive
- Solution:
  - Allow users framework to create their own clusters from a pool of available resources



- Byma et al:
  - FPGA broke into partial FPGAs
  - Multiple users share portion of FPGA



- Byma et al:
  - FPGA broke into partial FPGAs
  - Multiple users share portion of FPGA

#### • IBM Supervessel:

- FPGA tightly coupled with virtual machine CPU
- Connected to CPU via shared memory
- Network connection through CPU



- Byma et al:
  - FPGA broke into partial FPGAs
  - Multiple users share portion of FPGA
- IBM Supervessel:
  - FPGA tightly coupled with Virtual Machine CPU
  - Connected to CPU via shared memory
  - Network connection through CPU
- Amazon:
  - FPGA(s) tightly coupled with virtual machine CPU
  - Up to 8 FPGAs connected via high performance network link



#### • Byma et al:

- FPGA broke into partial FPGAs
- Multiple users share portion of FPGA
- IBM Supervessel:
  - FPGA tightly coupled with Virtual Machine CPU
  - Connected to CPU via shared memory
  - Network connection through CPU
- Amazon:
  - FPGA(s) tightly coupled with virtual machine CPU
  - Up to 8 FPGAs connected via high performance network link
- IBM Hyperscale
  - Network connected FPGAs
  - Modified Openstack to accept bitstream and then returns IP address and programmed FPGA to user



# Problems We Target

- Large multi-FPGA systems
  - Create abstraction between FPGAs in multi-FPGA systems
  - Easy scalability of system



# Problems We Target

- Large multi-FPGA systems
  - Create abstraction between FPGAs in multi-FPGA systems
  - Easy scalability of system
- Network capabilities
  - FPGA cluster directly accessible by any other network device in the datacenter



## **Overall System View**

User

#### Input From User

Logical Cluster

Description

FPGA Mapping File

#### **FPGA Cluster Generator**

G R



#### **FPGA Cluster Generator**

Output to VM with FPGA Tools

Individual FPGA Projects



#### **FPGA Cluster Generator**

**Output to Cloud Manager** 

Command For Resource Allocation

Commands For Connecting FPGAs to Network

13



#### **Output To User**

MAC addresses of FPGAs in Multi-FPGA Cluster

#### **FPGA Cluster Generator**

6

## Baseline Infrastructure

- SAVI (Smart Applications on Virtualized Infrastructure)
- OpenStack (Cloud Managing Software)
- Xilinx SDAccel (FPGA Hypervisor)



# SAVI (Smart Applications on Virtualized Infrastructure)





# Cloud Managing Software: OpenStack



**User Client** 

## FPGA Hypervisor: Xilinx SDAccel

- Abstracts physical hardware on FPGA and provides software interface for these modules
- Publicly available through Xilinx
- No network interface



18

5

R

### Contributions

1. Non-network FPGA from cloud



# Contributions

- 1. Non-network FPGA from cloud
- 2. Networking infrastructure for FPGAs to communicate in heterogeneous network
  - Modified FPGA hypervisor for networking support
  - FPGAs MAC addresses, accessible by any network device in datacenter



# Contributions

- 1. Non-network FPGA from cloud
- 2. Networking infrastructure for FPGAs to communicate in heterogeneous network
  - Modified FPGA hypervisor for networking support
  - FPGAs MAC addresses, accessible by any network device in datacenter
- 3. FPGA cluster generator



# Non-network FPGA from Cloud



Deployment Flow

- 1. User develops their application on a VM without an FPGA.
- 2. Save VM snapshot
- 3. Upload VM snapshot to OpenStack
- 4. Create new VM with snapshot and FPGA

R

 $\mathbf{O}$ 



# FPGA Hypervisor: Networking Hypervisor

- Customized shell with:
  - PCIe module
  - Off Chip Memory controller
  - 1 GB Ethernet
- Note: No partial reconfiguration



# Logical Cluster Description



D

G

# Physical Mapping





## I/O to FPGAs in Cluster

Input







D

G

R

 $\cap$ 

U

P



# Scaling Up the Clusters

D

G

R

 $\cap$ 

D

R

Output Module





#### Networking Backend



6

R

#### Resource Utilization

| Hardware<br>Setup   | LUTS          | Flip-Flops     | BRAM         |
|---------------------|---------------|----------------|--------------|
| SDAccel Base        | 53346 (12.3%) | 64550 (7.45 %) | 228 (15.5 %) |
| Ethernet<br>Support | 8998 (2.1 %)  | 11574 (1.34 %) | 0 (0 %)      |
| Input Module        | 169 (0.039 %) | 294 (0.033 %)  | 2 (1.36 %)   |
| Output Module       | 773(0.178 %)  | 402 (0.059%)   | 4 (2.72 %)   |
| Total Available     | 233200        | 866400         | 1470         |

D

G

R

# Testing Latency and Throughput

• Directly Connected CPU to FPGA



• VM to one FPGA in SAVI





# Testing Latency and Throughput

• VM to two FPGA chain in SAVI



• VM to three FPGA chain in SAVI



G

### Round-trip Latency

| Test        | Latency<br>(ms) |
|-------------|-----------------|
| CPU + FPGA  | 0.0650          |
| VM + 1 FPGA | 0.500           |
| VM + 2 FPGA | 0.645           |
| VM + 3 FPGA | 0.790           |



### Round-trip Latency

| Test        | Latency<br>(ms) | • CPU > VM      |
|-------------|-----------------|-----------------|
| CPU + FPGA  | 0.0650          | • Extra network |
| VM + 1 FPGA | 0.500           | Нор             |
| VM + 2 FPGA | 0.645           |                 |
| VM + 3 FPGA | 0.790           |                 |

D

5

# Round-trip Latency

| Test        | Latency<br>(ms) |        |
|-------------|-----------------|--------|
| CPU + FPGA  | 0.0650          |        |
| VM + 1 FPGA | 0.500           | Linear |
| VM + 2 FPGA | 0.645           |        |
| VM + 3 FPGA | 0.790           |        |









#### 40

G

R

 $\bigcirc$ 

#### 41

G

R

# Case Study: Scalability of Query Processing Engine

- Representative Case study: Database Streaming Query Processing Engine
  - Size
  - Streaming
- Scalable







D

G

R



D

G

R

R



## Case Study: Scalability of Query Processing Engine

THROUGHPUT (MBITS/SECOND) OF REPLICATIONS



P

#### Conclusion and Summary

- Users can create elastic FPGA clusters from cloud easily
  - Inter-FPGA fabric automatically generated
  - FPGAs provided network interface
- Little overhead
- Easy to scale

- Infrastructure Upgrade
  - 10G
  - Partial Reconfiguration



- Infrastructure Upgrade
  - 10G
  - Partial Reconfiguration
- Automatic Partitioning/Scheduling
  - HLS Model (Scheduler): Behavioral
  - Circuit Partitioning



- Infrastructure Upgrade
  - 10G
  - Partial Reconfiguration
- Automatic Partitioning/Scheduling
  - HLS Model (Scheduler): Behavioral
  - Circuit Partitioning
- Debugging of Large clusters
  - Combine individual debug environments
  - Monitor health



- Infrastructure Upgrade
  - 10G
  - Partial Reconfiguration
- Automatic Partitioning/Scheduling
  - HLS Model (Scheduler): Behavioral
  - Circuit Partitioning
- Debugging of Large clusters
  - Combine individual debug environments
  - Monitor health
- Large Scale Applications
  - Networking Applications (NFV)
  - Distributed Applications (Web-search)
  - Heterogeneous IOT Applications



# Thank You



## Questions?

Email: naif.tarafdar@mail.utoronto.ca



5

R