### Don't Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration

S. Yazdanshenas, K. Tatsumura<sup>\*</sup>, and V. Betz University of Toronto, Canada <sup>\*</sup>Toshiba Corporation, Japan











#### BRAM growing in importance

- Many applications (search engine, CNNs, ...) BRAM-intensive
- Can't fully utilize FPGA's computation capacity without on-chip memory



### **BRAM's Evolution**

- Memory-richness growth
- Organization changes



# The Key Point

- BRAMs can't be neglected!
  - ~25% of area
  - Should respond to application demands
- Need BRAM models
  - How efficient is an architecture?
  - What's the best architecture?



# **BRAM design is Difficult**

- BRAM design is challenging!
  - Analog nature of some components



# **BRAM design is Difficult**

- BRAM design is challenging!
  - Analog nature of some components
  - Variability of memory cells



# **BRAM design is Difficult**

- BRAM design is challenging!
  - Analog nature of some components
  - Variability of memory cells
  - Custom layout style
  - Significant FPGA-specific peripheral circuitry

#### Hand design of each candidate BRAM infeasible



# Use existing tools?

#### **BRAM** design is Different

#### CACTI underestimates area



### **BRAM** design is Different

#### Also underestimates read energy



### **BRAM** design is Different

#### Overestimates operating frequency



# **Emerging Memory Technologies**

- Model promising emerging memory technologies
  - Magnetic Tunnel Junction (MTJ)
  - Phase Change Memory (PCM)
  - Resistive RAM (RRAM)
- Ideal: model any technology with SPICE support



# **BRAM Design Tool**

#### What do we need?



#### **COFFE: Logic & Routing**



C. Chiasson et al. "COFFE: Fully-Automated Transistor Sizing for FPGAs," FPT 2013

18

#### **COFFE BRAM Flow**



19































34




### 2-Bank MTJ-Based BRAM



## Simulation In Context: SRAM Precharge



# Simulation In Context: SRAM Precharge



39

## Simulation In Context: SRAM Precharge



40

Variation changes SRAM cell read/write currents significantly

- Variation changes SRAM cell read/write currents significantly
- Using the nominal memory cell will be inaccurate
  - BRAM energy will be underestimated
  - BRAM frequency will be overestimated

- Variation changes SRAM cell read/write currents significantly
- Using the nominal memory cell will be inaccurate
  - BRAM energy will be underestimated
  - BRAM frequency will be overestimated
- Don't have the Spice model for the worst-case cell!

- Variation changes SRAM cell read/write currents significantly
- Using the nominal memory cell will be inaccurate
  - BRAM energy will be underestimated
  - BRAM frequency will be overestimated
- We don't have the Spice model for the worst-case cell!
- Use Monte Carlo simulation to find distribution of cell properties

## Monte Carlo Simulation: Worst-case Cell



K. Tatsumura et al."High Density, Low Energy, Magnetic Tunnel Junction Based Block RAMs for Memory-Rich FPGAs," FPT 2016









# Validation and Results

#### Area Validation



51

### SRAM-based BRAM Area Breakdown

- SRAM area dominates for large BRAMs
- Smaller BRAMs: other components relevant



#### Area: MTJ vs. SRAM

- MTJ is more area-efficient
- It gets increasingly more efficient with BRAM size



### **Frequency validation**

- Reasonable alignment with commercial data
- Less guardband
- No aggressive banking



54

# Operating Frequency: MTJ vs. SRAM

- SRAM is faster
- The gap narrows with increasing size



### Simulation Results: Energy



56

#### SRAM-based BRAM Narrow Mode

- FPGA RAM configurable
- Often used in narrower modes
- Energy mostly unaffected



# Energy Per Bit: MTJ vs. SRAM

- MTJs are more efficient with large BRAM
- MTJ narrow mode more efficient



# Worst-Case Cell Modeling Crucial

- Use nominal memory cell?
  - Underestimates area and energy
  - Gets more severe with increasing memory size

| BRAM Capacity<br>(Kbit) | Change in Delay | Change in Energy per<br>bit |
|-------------------------|-----------------|-----------------------------|
| 8                       | -21%            | -9%                         |
| 16                      | -19%            | -6%                         |
| 32                      | -27%            | -15%                        |
| 64                      | -22%            | -9%                         |
| 128                     | -30%            | -20%                        |
| 256                     | -42%            | -29%                        |

### Architecture Exploration: RAM-Mapping Flow

- BRAM models generated by COFFE can be used in architecture exploration
- Area-oriented RAM mapping
  - 69 industrial circuits
  - Used in development of Stratix V memory architecture
  - We have partial data:
    - Number of logic blocks used
    - Number, Sizes, and types of Logical RAMs
  - Gradually excluding less-memory-rich circuits

#### **SRAM-based BRAM**

- 16K always the best
- Stratix V-like



#### **MTJ-based BRAM**

- MTJ always saves area
- The best architecture changes
  - 32k or 64k best



- Nine VTR benchmark circuits with memory
- MTJ vs. SRAM
- Architecture Parameters
  - 32kb BRAM, every 8 columns
    - MTJ BRAM is smaller  $\rightarrow$  get more 2.3x more RAM blocks per column
  - Ten 6-luts per logic block

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |
|                  |                  |               |                 |               |                |                  |                | 71                    |

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |
## Architecture Exploration: VTR

## Changes by switching to MTJ-based BRAMs:

| Circuit          | RAM/LUT<br>Ratio | Block<br>Area | Routing<br>Area | Total<br>Area | Block<br>Delay | Routing<br>Delay | Total<br>Delay | Area-delay<br>Product |
|------------------|------------------|---------------|-----------------|---------------|----------------|------------------|----------------|-----------------------|
| mcml             | 1%               | -11%          | 0               | -5%           | 5%             | -19%             | -6%            | -10%                  |
| LU32PEEng        | 7%               | -14%          | 9%              | -2%           | 9%             | -6%              | 0              | -1%                   |
| LU8PEEng         | 6%               | -13%          | 4%              | -5%           | -4%            | 9%               | 3%             | -2%                   |
| ch_intrinsics    | 2%               | -15%          | -2%             | -10%          | -1%            | 6%               | 3%             | -7%                   |
| mkDelayWorker32B | 24%              | -32%          | -47%            | -41%          | 60%            | -40%             | 3%             | -39%                  |
| mkPktMerge       | 198%             | -57%          | -48%            | -53%          | 141%           | -34%             | 12%            | -47%                  |
| mkSMAdapter4B    | 8%               | -16%          | 0               | -10%          | 92%            | -32%             | 9%             | -2%                   |
| or1200           | 2%               | -5%           | -4%             | 0             | 1%             | -7%              | -2%            | -3%                   |
| raygentop        | 1%               | -3%           | 2%              | -1%           | 9%             | -17%             | -4%            | -5%                   |
| boundtop         | 1%               | -3%           | 1%              | -1%           | 8%             | -5%              | 0              | -2%                   |
| Geometric Mean   | 4%               | -19%          | -11%            | -15%          | 25%            | -16%             | 2%             | -14%                  |

## Conclusion

- First transistor sizing tool capable of BRAM modeling
  - SRAM-based
  - MTJ-based
- Simulation results align well with available commercial data
- COFFE now enables BRAM architecture exploration!
  - RAM-Mapping
  - VTR