

#### Reducing Memory Requirements for High-Performance and Numerically Stable Gaussian Elimination

David Boland Monash University





# Less Memory for Energy-efficient and *Actually Useful* Gaussian Elimination

David Boland Monash University



- Those who passionately care about anything to do with Gaussian Elimination
  - You will get:
    - A walk through of various different structures to perform GE.
    - Tradeoffs of parallelism, memory, pipelining, numerical stability,...
    - Disclaimer: Still room for improvement



- Those who passionately care about anything to do with Gaussian Elimination
  - You will get:
    - A walk through of various different structures to perform GE.
    - Tradeoffs of parallelism, memory, pipelining, numerical stability,...
    - Disclaimer: Still room for improvement
- Those who stumbled into this room accidentally after a coffee break



- Those who passionately care about anything to do with Gaussian Elimination
  - You will get:
    - A walk through of various different structures to perform GE.
    - Tradeoffs of parallelism, memory, pipelining, numerical stability,...
    - Disclaimer: Still room for improvement
- Those who stumbled into this room accidentally after a coffee break
  - If you care somewhat about parallel processing, you will get:
    - Some thoughts about how to reduce memory and I/O requirements for systolic arrays to use the whole FPGA.
    - Reminder of some algorithm that you learnt ages ago, something about solving matrices.



- Those who passionately care about anything to do with Gaussian Elimination
  - You will get:
    - A walk through of various different structures to perform GE.
    - Tradeoffs of parallelism, memory, pipelining, numerical stability,...
    - Disclaimer: Still room for improvement
- Those who stumbled into this room accidentally after a coffee break
  - If you care somewhat about parallel processing, you will get:
    - Some thoughts about how to reduce memory and I/O requirements for systolic arrays to use the whole FPGA.
    - Reminder of some algorithm that you learnt ages ago, something about solving matrices.
  - If you don't
    - Confusion about whether the speaker's ethnicity. Is he Australian/English/Canadian/Chinese



A direct method to find a solution for Ax=b

$$A = \begin{matrix} 16 & 4 & 8 & -12 & & 4 \\ 8 & 10 & 12 & -10 & & b = \begin{matrix} 4 \\ 11 \\ -2 & -4.5 & 10.5 & 3.5 \end{matrix}$$



• First form an augmented matrix:

| 16 | 4    | 8    | -12 | 4   |
|----|------|------|-----|-----|
| 8  | 10   | 12   | -10 | 4   |
| 4  | -7   | -3   | 7   | 11  |
| -2 | -4.5 | 10.5 | 3.5 | 3.5 |



- First form an augmented matrix:
- Then perform row elimination between two rows.

| 16 | 4    | 8    | -12 | 4   |
|----|------|------|-----|-----|
| 8  | 10   | 12   | -10 | 4   |
| 4  | -7   | -3   | 7   | 11  |
| -2 | -4.5 | 10.5 | 3.5 | 3.5 |



- First form an augmented matrix:
- Then perform row elimination between two rows.

 $(Row_2 = Row_2 - \frac{1}{2} Row_1)$ 



- First form an augmented matrix:
- Repeat for all other rows in a column:



 $(Row_3 = Row_3 - \frac{1}{4} Row_1)$ 



- First form an augmented matrix:
- Repeat for all other rows in a column:



$$(Row_4 = Row_4 - \frac{1}{8} Row_1)$$



- First form an augmented matrix:
- Repeat for all other rows in a column:



 $(Row_3 = Row_3 + Row_2)$ 



• Eventually, an upper triangular matrix is formed:



• Eventually, an upper triangular matrix is formed:

Find solution by back substitution:

$$x_4 = -\frac{1}{4}$$



Eventually, an upper triangular matrix is formed:

Find solution by back substitution:

$$x_4 = -\frac{1}{4'}, x_3 = \frac{12 - 6 \cdot -(\frac{1}{4})}{3}, \dots$$



- Good:
  - Simple algorithm
  - Often works
- Bad:
  - Slow (limited parallelism)
  - Potentially poor numerical performance



Do parallel row elimination:

| X | X | X | X | X       | X   | X | X | X | X |
|---|---|---|---|---------|-----|---|---|---|---|
| X | X | X | X | X       | _ 0 | X | X | X | x |
| X | x | X | X | $x^{-}$ | → 0 | X | X | X | x |
| X | X | X | X | x       | 0   | X | x | X | X |



Do parallel row elimination:

| X | X | X | X | X       | X        | X | X | X | X       | X                         | X | X | X | X |
|---|---|---|---|---------|----------|---|---|---|---------|---------------------------|---|---|---|---|
| x | X | x | x | x       | 0        | x | x | x | x       | $\rightarrow \frac{0}{0}$ | x | X | X | X |
| X | X | x | X | $x^{-}$ | <b>0</b> | X | X | x | $x^{-}$ | <b>0</b>                  | 0 | X | X | X |
| x | X | X | x | x       | 0        | X | X | X | X       | 0                         | 0 | x | X | X |



Do parallel row elimination:

| X | X  | X  | X | X       | X        | X   | X   | X   | X       | X                         | X | X | X | X       | X        | X | X | X | X  |
|---|----|----|---|---------|----------|-----|-----|-----|---------|---------------------------|---|---|---|---------|----------|---|---|---|----|
| X | x  | x  | x | x       | 0        | X   | x   | x   | X       | $\rightarrow \frac{0}{0}$ | x | X | X | X       | 0        | x | x | X | X  |
| X | x  | x  | x | $x^{-}$ | <b>0</b> | X   | X   | x   | $x^{-}$ | <b>0</b>                  | 0 | X | X | $x^{-}$ | <b>0</b> | 0 | x | X | X  |
| v | 24 | 24 | ~ | ~       | Δ        | ~ ~ | ~ ~ | ~ ~ | ~       | Δ                         | Δ |   |   |         | 0        | Δ | Δ |   | 24 |



Do parallel row elimination:

 $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$ X X X  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$ X X  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$ X  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$ Х  $\begin{array}{ccccc} x & x & x & x \\ x & x & x & x \end{array} \xrightarrow{0} 0$  $\begin{array}{ccc} x & x \\ 0 & x \end{array}$ 0 X  $\boldsymbol{\chi}$ X X 0 X X X  $0 \quad x \quad x$ x x x 0 0 x x 0 0 0 x x X X X X X

- Need parallel access to all rows
  - First need to load the entire matrix on chip
  - What about large matrices?



- Divide matrix into blocks, load blocks into on-chip RAM
- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE

- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE
- Double buffer for performance

- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE
- Double buffer for performance
- Update to right

- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE
- Double buffer for performance
- Update to right, continue to next rows

- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE
- Double buffer for performance
- Update to right, continue to next rows



- Green: update matrix
- Blue: stored matrix, needed for updates





- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE
- Double buffer for performance
- Update to right, continue to next rows

- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





- Good:
  - Simple algorithm
  - Fast
  - Top performing FPGA implementations
- Bad:
  - Potentially poor numerical performance



Making it actually work: GE with partial pivoting

• Simple GE algorithm may fail:





Making it actually work: GE with partial pivoting

• Simple GE algorithm may fail:



Just swap Row<sub>2</sub> and Row<sub>1</sub>



# Making it actually work: GE with partial pivoting

- More generally, for best numerical performance, you always want the row with the largest leading element unchanged
- E.g..



Still swap Row<sub>2</sub> and Row<sub>1</sub> before elimination



#### Back to the block-based algorithms

Partial pivoting makes basic block-based Gaussian elimination difficult:





# Tiled LU Factorisation



- 4 different subroutines, implement GE with partial pivoting & swap between blocks
- No known efficient FPGA implementation



#### But is a new problem forming?

Even with basic block-based Gaussian elimination, 5 NxN matrices must be stored in on-chip RAM



- Divide matrix into blocks, load blocks into on-chip RAM
- Perform parallel GE
- Double buffer for performance
- Update to right, continue to next rows

- Yellow: load to memory
- Green: update matrix
- Blue: stored matrix, needed for updates





## But is a new problem forming?

- Even with basic block-based Gaussian elimination, 5 NxN matrices must be stored in on-chip RAM
- To avoid I/O problems, only N parallel processing elements can be used. (O(N<sup>3</sup>) operations and O(N<sup>2</sup>) elements to load).



- Even with basic block-based Gaussian elimination, 5 NxN matrices must be stored in on-chip RAM
- To avoid I/O problems, only N parallel processing elements can be used. (O(N<sup>3</sup>) operations and O(N<sup>2</sup>) elements to load).
- Memory requirement (O(N<sup>2</sup>)) growing faster than required number of PEs (O(N))



- Even with basic block-based Gaussian elimination, 5 NxN matrices must be stored in on-chip RAM
- To avoid I/O problems, only N parallel processing elements can be used. (O(N<sup>3</sup>) operations and O(N<sup>2</sup>) elements to load).
- Memory requirement (O(N<sup>2</sup>)) growing faster than required number of PEs (O(N))
- Already a limitation on an Arria 10
  - 5 512\*512 matrices use 2560 M20Ks (up to 2713 available)
  - Uses 512 PEs (up to 1688 available)



An Ingenious solution: GE with pairwise pivoting























\* Because each PE performs comparison and swap if necessary, zeros will be at same location





\* Because each PE performs comparison and swap if necessary, zeros will be at same location















- Good:
  - Fast
  - Numerically stable
  - Advantages of systolic array (not continual access to memory)
- Bad:
  - I/O problem for large matrices



### Double the matrix size





## Using one PE to emulate two

• Let's study the output of one PE with a different input order:





## Using one PE to emulate two

• Let's study the output of one PE with a different input order:





## Using one PE to emulate two

• Let's study the output of one PE with a different input order:





PE



\* Because each PE performs comparison and swap if necessary, zeros will be at same location





\* Because each PE performs comparison and swap if necessary, zeros will be at same location



#### Can use second half of circuit for final solution



х X  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  ${\mathcal X}$ Х X  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$ Х  $\boldsymbol{\chi}$ X Х X  ${\mathcal X}$ X X  $\boldsymbol{\chi}$ x x x x x x 0 0 0 X X х х х X  ${\mathcal X}$ х  $\boldsymbol{\chi}$  $\boldsymbol{\chi}$ X X Х  ${\mathcal X}$  $\rightarrow$ 0  $0 \xrightarrow{\rightarrow}$ x 0 0 0 0 X X X  $\boldsymbol{\chi}$ X X X X X X 0 0 0 0 0 0 0 0 X X X X X X X X X X X X

















#### Re-arrange

х x х х х  $x \quad x$ х х х 0 x x x x x x x х х *x x* 0 0 x x x x 0 х 0 x • 0 х x x x 0 x x x 0 0 x x x x 0 0 x x 0 0 0  $x \quad x \quad 0 \quad 0 \quad 0$ *x x* 0 0 0



## Double the matrix size





## Double the matrix size









### An alternative output circuit





### An alternative output circuit







(a) First pass















# Efficiency concerns

Second half only gets inputs every other cycle. PEs wasted





## Efficiency concerns

Second half only gets inputs every four cycles. PEs wasted





- Not 100% efficient (could be done, but trade for on-chip memory)
- Cycle 1:





- Not 100% efficient (could be done, but trade for on-chip memory)
- Cycle 2:





- Not 100% efficient (could be done, but trade for on-chip memory)
- Cycle 3:





 Not 100% efficient (could be done, but trade for on-chip memory)





|               | Resource Use |      |       |
|---------------|--------------|------|-------|
| Entity        | ALMs (1000s) | DSPs | M20Ks |
| Individual PE | 0.15         | 1    | 1     |
| Stage 1       | 0.9          | 5    | 4     |
| Stage 2       | 2.6          | 11   | 18    |
| Stage 3       | 3.6          | 14   | 27    |
| Stage 4       | 5.7          | 20   | 45    |
| Stage 5       | 10           | 32   | 80    |
| Stage 6       | 19           | 56   | 146   |
| Stage 7       | 35           | 104  | 280   |
| Stage 8       | 70           | 200  | 542   |
| Stage 9       | 95           | 260  | 395   |
| Full design   | 338          | 962  | 1932  |





# Limited by Slices (mainly due to pipeline registers to boost clock frequency)

|               | Resource Use |      |       |
|---------------|--------------|------|-------|
| Entity        | ALMs (1000s) | DSPs | M20Ks |
| Individual PE | 0.15         | 1    | 1     |
| Stage 1       | 0.9          | 5    | 4     |
| Stage 2       | 2.6          | 11   | 18    |
| Stage 3       | 3.6          | 14   | 27    |
| Stage 4       | 5.7          | 20   | 45    |
| Stage 5       | 10           | 32   | 80    |
| Stage 6       | 19           | 56   | 146   |
| Stage 7       | 35           | 104  | 280   |
| Stage 8       | 70           | 200  | 542   |
| Stage 9       | 95           | 260  | 395   |
| Full design   | 338          | 962  | 1932  |





# Limited by Slices (mainly due to pipeline registers to boost clock frequency)

|               | Resource Use |      |       |
|---------------|--------------|------|-------|
| Entity        | ALMs (1000s) | DSPs | M20Ks |
| Individual PE | 0.15         | 1    | 1     |
| Stage 1       | 0.9          | 5    | 4     |
| Stage 2       | 2.6          | 11   | 18    |
| Stage 3       | 3.6          | 14   | 27    |
| Stage 4       | 5.7          | 20   | 45    |
| Stage 5       | 10           | 32   | 80    |
| Stage 6       | 19           | 56   | 146   |
| Stage 7       | 35           | 104  | 280   |
| Stage 8       | 70           | 200  | 542   |
| Stage 9       | 95           | 260  | 395   |
| Full design   | 338          | 962  | 1932  |

Use more DSPs

#### Use less memory





# Limited by Slices (mainly due to pipeline registers to boost clock frequency)

|               | Resource Use |      |       |
|---------------|--------------|------|-------|
| Entity        | ALMs (1000s) | DSPs | M20Ks |
| Individual PE | 0.15         | 1    | 1     |
| Stage 1       | 0.9          | 5    | 4     |
| Stage 2       | 2.6          | 11   | 18    |
| Stage 3       | 3.6          | 14   | 27    |
| Stage 4       | 5.7          | 20   | 45    |
| Stage 5       | 10           | 32   | 80    |
| Stage 6       | 19           | 56   | 146   |
| Stage 7       | 35           | 104  | 280   |
| Stage 8       | 70           | 200  | 542   |
| Stage 9       | 95           | 260  | 395   |
| Full design   | 338          | 962  | 1932  |

Use more DSPs

#### Use less memory





# Limited by Slices (mainly due to pipeline registers to boost clock frequency)

|               | Resource Use |      |       |
|---------------|--------------|------|-------|
| Entity        | ALMs (1000s) | DSPs | M20Ks |
| Individual PE | 0.15         | 1    | 1     |
| Stage 1       | 0.9          | 5    | 4     |
| Stage 2       | 2.6          | 11   | 18    |
| Stage 3       | 3.6          | 14   | 27    |
| Stage 4       | 5.7          | 20   | 45    |
| Stage 5       | 10           | 32   | 80    |
| Stage 6       | 19           | 56   | 146   |
| Stage 7       | 35           | 104  | 280   |
| Stage 8       | 70           | 200  | 542   |
| Stage 9       | 95           | 260  | 395   |
| Full design   | 338          | 962  | 1932  |

Use more DSPs

#### Use less memory

Achieve comparable performance to the peak possible using basic block-based GE 300 250 200 မိ <sub>150</sub> 100 50 10 10<sup>8</sup> 10<sup>10</sup> 104 10<sup>6</sup> 10 Matrix Order

Vs basic block-based GE, it will work on many more algorithms (subject to single precision being sufficient)



 The approach of this paper saves memory, achieves high performance and is numerically stable (and opens doors for some room for improvement)



### Conclusions

- Please, please, please, no more GE/LU decomposition papers that don't include some form of pivoting.
- Perhaps consider if its possible to re-examine the use of systolic arrays in your designs, perhaps with re-ordered inputs, to reduce I/O or memory.



### Conclusions

- Please, please, please, no more GE/LU decomposition papers that don't include some form of pivoting.
- Perhaps consider if its possible to re-examine the use of systolic arrays in your designs, perhaps with re-ordered inputs, to reduce I/O or memory.

 (Feel free to give me a Stratix 10 to see some big performance gains)



- Please, please, please, no more GE/LU decomposition papers that don't include some form of pivoting.
- Perhaps consider if its possible to re-examine the use of systolic arrays in your designs, perhaps with re-ordered inputs, to reduce I/O or memory.

- (Feel free to give me a Stratix 10 to see some big performance gains)
- (Don't want to offend anyone from Xilinx, I'll take your boards too)

