### Impact of Memory Architecture on FPGA Energy Consumption



Edin Kadric, David Lakata, André DeHon



#### FPGA 2015

- FPGAs are one-size-fits-all architectures
  - $\rightarrow$  **mismatch** between App and Arch

- FPGAs are one-size-fits-all architectures
  - $\rightarrow$  **mismatch** between App and Arch
- Memory cost:

Energy 
$$\begin{pmatrix} 32Kb \end{pmatrix} = 2X \text{ Energy} \begin{pmatrix} 8Kb \end{pmatrix}$$

• Memory mismatch → excess energy

- FPGAs are one-size-fits-all architectures
  - $\rightarrow$  **mismatch** between App and Arch
- Memory cost:

Energy 
$$\begin{pmatrix} 32Kb \end{pmatrix} = 2X \text{ Energy} \begin{pmatrix} 8Kb \end{pmatrix}$$

 Memory mismatch → excess energy (M<sub>app</sub> < M<sub>arch</sub>)



- FPGAs are one-size-fits-all architectures
  - $\rightarrow$  **mismatch** between App and Arch
- Memory cost:

Energy 
$$\left( \begin{array}{c} 32Kb \end{array} \right) = 2X \quad \text{Energy} \left( \begin{array}{c} 8Kb \end{array} \right)$$

• Memory mismatch  $\rightarrow$  excess energy (M<sub>app</sub> < M<sub>arch</sub>) (M<sub>arch</sub> < M<sub>app</sub>)

> Arch App



### Mismatch Also Impacts Area

- "Architectural enhancements in Stratix V" Lewis *et al.* (FPGA 2013)
  - Moved from Stratix IV



### Mismatch Also Impacts Area

- "Architectural enhancements in Stratix V" Lewis *et al.* (FPGA 2013)
  - Moved from Stratix IV



- Goal: minimize area
- Question: How about energy?

- Stratix V: 20Kb memories, ~10% of columns
- <u>~Stratix V</u>: 16Kb memories placed every 10 columns



- Stratix V: 20Kb memories, ~10% of columns
- <u>~Stratix V</u>: 16Kb memories placed every 10 columns



- On spree.v: 147% energy overhead

- Stratix V: 20Kb memories, ~10% of columns
- <u>~Stratix V</u>: 16Kb memories placed every 10 columns



- On spree.v: 147% energy overhead
- Alternative:
- 64Kb placed every 5 columns
  - 7% overhead
  - Energy reduced by 2.3x

- Stratix V: 20Kb memories, ~10% of columns
- <u>~Stratix V</u>: 16Kb memories placed every 10 columns



- On spree.v: 147% energy overhead
- Alternative:
- 64Kb placed every 5 columns
  - 7% overhead
  - Energy reduced by 2.3x



4

### To Reduce Energy (Mismatch):

- How large should the on-chip memories be?
- How frequently should they be placed?
- How should they be organized internally?
- How many different levels of memory?



### Outline

- 1) Motivation
- 2) Mismatch Reduction Techniques
- 3) Experimental Architecture Exploration

### Outline

- 1) Motivation
- 2) Mismatch Reduction Techniques
  - a) Multiple Memory Levels
  - b) Continuous Hierarchy Memory (CHM)
- 3) Experimental Architecture Exploration

- Provides more choice to the mapping tool
- Stratix V: M20K (~16K)



- Provides more choice to the mapping tool
- Stratix V: M20K (~16K)



• Stratix IV: M9K & M144K (~8K & 128K)



• Provides more choice to the mapping tool



• Provides more choice to the mapping tool



"Kung Fu data energy, minimizing communication energy in FPGA computations" (Kadric *et al.* FCCM 2014)



"Kung Fu data energy, minimizing communication energy in FPGA computations" (Kadric *et al.* FCCM 2014)



"Kung Fu data energy, minimizing communication energy in FPGA computations" (Kadric *et al.* FCCM 2014)



"Kung Fu data energy, minimizing communication energy in FPGA computations" (Kadric *et al.* FCCM 2014)



- Break down into smaller banks
- Access memory closest to I/O

"Kung Fu data energy, minimizing communication energy in FPGA computations" (Kadric *et al.* FCCM 2014)

• • Addr[n-1] Addr[n-3:0] Addr[n-2]

- Break down into smaller banks
- Access memory closest to I/O

"Kung Fu data energy, minimizing communication energy in FPGA computations" (Kadric et al. FCCM 2014)

Break down into smaller banks • Access memory closest to I/O • Recursively break down bank closest to I/O • The hierarchy becomes more "continuous" • Addrin Addr[n-1] Addr[n-1] Addr[n-3:0] Addr[n-3:0] Addr[n-2] Addr[n-2] Data

### Outline

- 1) Motivation
- 2) Mismatch Issues
- 3) Experimental Architecture Exploration



# Limit Study

- We want memories of the right size at the right place
  - Assume the right size when calculating energy
  - Place them everywhere (every other column)
    - $\rightarrow$  pretend horizontal memory crossings are free













% Geomean Overhead



% Geomean Overhead

|   | 50            | 39 | 36 | 35 | 48 | 53 | 77    | 120 | 210 | 11 |  |  |
|---|---------------|----|----|----|----|----|-------|-----|-----|----|--|--|
|   | 50            | 32 | 28 | 25 | 24 | 32 | 34    | 66  | 98  | 10 |  |  |
|   | 47            | 32 | 22 | 22 | 24 | 31 | 32    | 64  | 100 | 9  |  |  |
|   | 43            | 30 | 27 | 24 | 25 | 31 | 32    | 65  | 98  | 8  |  |  |
|   | 41            | 28 | 23 | 21 | 24 | 29 | 34    | 69  | 100 | 7  |  |  |
| щ | 42            | 32 | 24 | 23 | 24 | 32 | 32 36 |     | 110 | 6  |  |  |
| σ | 39            | 30 | 25 | 21 | 26 | 32 | 37    | 69  | 110 | 5  |  |  |
|   | 42            | 31 | 28 | 21 | 29 | 34 | 41    | 73  | 120 | 4  |  |  |
|   | 39            | 32 | 27 | 25 | 32 | 37 | 44    | 79  | 140 | 3  |  |  |
|   | 43            | 33 | 30 | 26 | 37 | 41 | 58    | 85  | 150 | 2  |  |  |
|   | 44            | 38 | 42 | 38 | 61 | 63 | 92    | 130 | 230 | 1  |  |  |
|   | -             | 2  | 4  | ω  | 16 | 32 | 64    | 128 | 256 |    |  |  |
|   | Mem Size (Kb) |    |    |    |    |    |       |     |     |    |  |  |

% Geomean Overhead

|    |               |    |    |    |    | ~St | ratix | V   |     |    |  |  |
|----|---------------|----|----|----|----|-----|-------|-----|-----|----|--|--|
|    | 50            | 39 | 36 | 35 | 48 | 53  | 77    | 120 | 210 | 11 |  |  |
|    | 50            | 32 | 28 | 25 | 24 | 32  | 34    | 66  | 98  | 10 |  |  |
|    | 47            | 32 | 22 | 22 | 24 | 31  | 32    | 64  | 100 | 9  |  |  |
|    | 43            | 30 | 27 | 24 | 25 | 31  | 32    | 65  | 98  | 8  |  |  |
|    | 41            | 28 | 23 | 21 | 24 | 29  | 34    | 69  | 100 | 7  |  |  |
| dm | 42            | 32 | 24 | 23 | 24 | 32  | 36    | 68  | 110 | 6  |  |  |
| σ  | 39            | 30 | 25 | 21 | 26 | 32  | 37    | 69  | 110 | 5  |  |  |
|    | 42            | 31 | 28 | 21 | 29 | 34  | 41    | 73  | 120 | 4  |  |  |
|    | 39            | 32 | 27 | 25 | 32 | 37  | 44    | 79  | 140 | 3  |  |  |
|    | 43            | 33 | 30 | 26 | 37 | 41  | 58    | 85  | 150 | 2  |  |  |
|    | 44            | 38 | 42 | 38 | 61 | 63  | 92    | 130 | 230 | 1  |  |  |
|    | -             | 2  | 4  | ω  | 16 | 32  | 64    | 128 | 256 |    |  |  |
|    | Mem Size (Kb) |    |    |    |    |     |       |     |     |    |  |  |

| <u>% Geomean Overhead</u> |    |    |    |    |    |    |    |     |     |                | <u>% Worst-case Overhead</u> |     |     |     |     |     |     |     |                      |
|---------------------------|----|----|----|----|----|----|----|-----|-----|----------------|------------------------------|-----|-----|-----|-----|-----|-----|-----|----------------------|
| ~Stratix V                |    |    |    |    |    |    |    |     |     |                | ~Stratix V                   |     |     |     |     |     |     |     |                      |
|                           | 50 | 39 | 36 | 35 | 48 | 53 | 77 | 120 | 210 | 11             | 770                          | 470 | 250 | 320 | 180 | 320 | 470 | 930 | <mark>1600</mark> 11 |
|                           | 50 | 32 | 28 | 25 | 24 | 32 | 34 | 66  | 98  | 10             | 1300                         | 330 | 310 | 220 | 140 | 270 | 360 | 810 | <mark>1400</mark> 10 |
|                           | 47 | 32 | 22 | 22 | 24 | 31 | 32 | 64  | 100 | 9<br>8         | 770                          | 410 | 180 | 220 | 150 | 260 | 350 | 820 | <mark>1400</mark> 9  |
|                           | 43 | 30 | 27 | 24 | 25 | 31 | 32 | 65  | 98  |                | 680                          | 300 | 430 | 320 | 190 | 240 | 330 | 790 | 1400 8               |
|                           | 41 | 28 | 23 | 21 | 24 | 29 | 34 | 69  | 100 | 7              | 640                          | 300 | 180 | 200 | 160 | 240 | 320 | 790 | 1400 7               |
| цm                        | 42 | 32 | 24 | 23 | 24 | 32 | 36 | 68  | 110 | <sup>6</sup> д | 760                          | 490 | 200 | 210 | 110 | 230 | 310 | 780 | <mark>1400</mark> 6  |
| 0                         | 39 | 30 | 25 | 21 | 26 | 32 | 37 | 69  | 110 | 5              | 600                          | 360 | 250 | 150 | 140 | 220 | 300 | 780 | 1400 5               |
|                           | 42 | 31 | 28 | 21 | 29 | 34 | 41 | 73  | 120 | 4              | 570                          | 310 | 330 | 140 | 180 | 230 | 310 | 760 | <mark>1300</mark> 4  |
|                           | 39 | 32 | 27 | 25 | 32 | 37 | 44 | 79  | 140 | 3              | 370                          | 370 | 150 | 210 | 150 | 210 | 300 | 750 | <mark>1300</mark> 3  |
|                           | 43 | 33 | 30 | 26 | 37 | 41 | 58 | 85  | 150 | 2              | 660                          | 280 | 150 | 150 | 130 | 220 | 300 | 770 | 1400 2               |
|                           | 44 | 38 | 42 | 38 | 61 | 63 | 92 | 130 | 230 | 1              | 260                          | 160 | 130 | 140 | 150 | 230 | 330 | 750 | 1300 1               |
|                           | -  | 2  | 4  | 8  | 16 | 32 | 64 | 128 | 256 |                | -                            | 0   | 4   | ω   | 16  | 32  | 64  | 128 | 256                  |
| Mem Size (Kb)             |    |    |    |    |    |    |    |     |     |                | Mem Size (Kb)                |     |     |     |     |     |     |     |                      |

- Larger overheads
- Smaller optimum region
- Mem Size has more impact than dm •



% Worst-case Overhead

~Stratix V

|    |      |     |     |     |        | 1      |     | -   |      |    |
|----|------|-----|-----|-----|--------|--------|-----|-----|------|----|
|    | 770  | 470 | 250 | 320 | 180    | 320    | 470 | 930 | 1600 | 11 |
| dm | 1300 | 330 | 310 | 220 | 140    | 270    | 360 | 810 | 1400 | 1( |
|    | 770  | 410 | 180 | 220 | 150    | 260    | 350 | 820 | 1400 | 9  |
|    | 680  | 300 | 430 | 320 | 190    | 240    | 330 | 790 | 1400 | 8  |
|    | 640  | 300 | 180 | 200 | 160    | 240    | 320 | 790 | 1400 | 7  |
|    | 760  | 490 | 200 | 210 | 110    | 230    | 310 | 780 | 1400 | 6  |
|    | 600  | 360 | 250 | 150 | 140    | 220    | 300 | 780 | 1400 | 5  |
|    | 570  | 310 | 330 | 140 | 180    | 230    | 310 | 760 | 1300 | 4  |
|    | 370  | 370 | 150 | 210 | 150    | 210    | 300 | 750 | 1300 | 3  |
|    | 660  | 280 | 150 | 150 | 130    | 220    | 300 | 770 | 1400 | 2  |
|    | 260  | 160 | 130 | 140 | 150    | 230    | 330 | 750 | 1300 | 1  |
|    | -    | 2   | 4   | 8   | 16     | 32     | 64  | 128 | 256  |    |
|    |      |     |     | Me  | em Siz | ze (Kk | )   |     |      |    |
|    |      |     |     |     |        |        |     |     |      |    |

- Next: 4 sets of architecture styles to explore:
  - 1 or 2 memory size(s), with or without CHM

## **Full Architectural Sweep**



- CHM:
  - Extends optimum region
  - Shifts towards larger memories
  - Absolute value of minimum is reduced by ~2x
  - Area is increased

## **Full Architectural Sweep**



Mem Size (Kb)

#### **Full Architectural Sweep**



#### geomean





#### geomean







#### geomean S



S ~StratixV



1-mem 0

1–mem [CHM] 🛛 🔷 2–mem [CHM]

S ~StratixV



#### **Conclusions**

- Energy-optimum is different from area-optimum:
  - Multiple memory levels
  - Continuous Hierarchy Memories (CHM)
  - Placed more frequently than on commercial FPGAs
- 8-32Kb are good memory sizes in general



#### **Questions?**

#### $(M_{arch} < M_{app})$

- Memories on FPGAs have a native shape, *e.g.* 8K=256x32
- Can be used in different modes:
  - 256x32
  - 512x16
  - 1024x8
  - 2048x4
  - 4096x2
  - 8192x1
- Each mode costs the same (cost of native shape, 256x32)

 $(M_{arch} < M_{app})$ 

- *e.g.*:
  - $M_{app} = 2Kx32$
  - March = 256x32

     (8K in 2048x4 mode)
     addr[10:0]
     en en en en en
     8K 8K 8K 8K
     4b 4b 4b 4b

<u>Delay-optimized</u> "Power-efficient RAM mapping algorithms for FPGA embedded memory blocks" (Tessier *et al.* IEEE Trans. On CAD 2007)

 $(M_{arch} < M_{app})$ 

- *e.g.*:
  - $M_{app} = 2Kx32$
  - March = 256x32

     (8K in 2048x4 mode)
     addr[10:0]
     en en en en
     8K 8K 8K 8K
     4b 4b 4b 4b
     en en en en
     8K 8K 8K 8K
     4b 4b 4b 4b



<u>Delay-optimized</u> "Power-efficient RAM mapping algorithms for FPGA embedded memory blocks" (Tessier *et al.* IEEE Trans. On CAD 2007)

 $(M_{arch} < M_{app})$ 

addr[7:0]

- *e.g.*:
  - $M_{app} = 2Kx32$
  - $M_{arch} = 256x32$ \_ (8K in 2048x4 mode) addr[10:0] en en en en 8K 8K 8K 8K 4b 4b 4b 4b en en en en 4b 4b 4b 4b



**Delay-optimized** Power-optimized "Power-efficient RAM mapping algorithms for FPGA embedded memory blocks" (Tessier et al. IEEE Trans. On CAD 2007)

 $(M_{arch} < M_{app})$ 

- *e.g.*:
  - $M_{app} = 2Kx32$
  - March = 256x32

     (8K in 2048x4 mode)
     addr[10:0]
     en en en en
     8K 8K 8K 8K
     4b 4b 4b 4b
     en en en en

**Delay-optimized** 





Power-optimized

"Power-efficient RAM mapping algorithms for FPGA embedded memory blocks" (Tessier *et al.* IEEE Trans. On CAD 2007)

What we did: Integrated with VPR

Without P-opt: +4-19% geomean energy overhead, **+40-108% worst-case overhead** P-opt source release: http://ic.ese.upenn.edu/abstracts/meme\_fpga2015.html

