# Architecture Exploration for HLS-Oriented FPGA Debug Overlays

## Al-Shahna Jamal, Jeffrey Goeders, Steve Wilton

FPGA'18 - Monterey, CA



THE UNIVERSITY OF BRITISH COLUMBIA

## What this talk is about...

*Recent work:* Source-level, in-system debugging of HLS circuits

- Debug instrumentation is inserted at compile time
- Changing this instrumentation (to trace new data) requires a *recompile*



### What this talk is about...

*Recent work:* Source-level, in-system debugging of HLS circuits

- Debug instrumentation is inserted at compile time
- Changing this instrumentation (to trace new data) requires a *recompile*

<u>In this work:</u> Debug instrumentation still inserted at compile time BUT can be configured at runtime (**fast customization**)

**Impact**: Achieves software like compile times (~1sec) between debug iterations



## Outline

- Motivation for In-System Debug
- Previous Work: In-System Debug Framework for HLS
  - Debug Instrumentation at compile time
- This paper: HLS Debug Overlay to allow customization at runtime
- Evaluation
- Future Work

## Outline

#### • Motivation for In-System Debug

- Previous Work: In-System Debug Framework for HLS
  - Debug Instrumentation at compile time
- This paper: HLS Debug Overlay to allow customization at runtime
- Evaluation
- Future Work



Software designers need a full ecosystem of tools:

• Testing, debugging, optimization....

#### **Debugging:** When do we have to do in-system debug?

- Simulation may take too long
- Bug may be dependent on system interactions, IO traffic, etc.

#### For certain bugs we have to perform in-system debug, observing the actual hardware

### Hardware Debug Tools



Not practical for a software designer!

## Outline

- Motivation for In-System Debug
- Previous Work: In-System Debug Framework for HLS
  - Debug Instrumentation at compile time
- This paper: HLS Debug Overlay to allow customization at runtime
- Evaluation
- Future Work

## **Previous Work: In-System Debug Framework for HLS**

#### Capture system-level bugs $\rightarrow$ Need to run at-speed, on-chip



Limited on-chip memory  $\rightarrow$  Need to select what we want to record and use memory efficiently

## **Previous Work: Taking Advantage of HLS Scheduling**



- Recorded signals change each cycle
- 50x-100x more memory efficient than traditional Embedded Logic Analyzer (ELA) approach

- Circuit-by-Circuit custom compression
- Based on signals selected for tracing (compression algorithms)
- Selecting a different subset of signals requires a **recompile**

## Outline

- Motivation for In-System Debug
- Previous Work: In-System Debug Framework for HLS
  - Debug Instrumentation at compile time
- This paper: HLS Debug Overlay to allow customization at runtime
- Evaluation
- Future Work

## HLS Overlays: Software-like Debug Turn-Around Times



### Workflow Using the Debug Overlay



13

**Key:** The more general/flexible the overlay – the larger the area overhead

**Our Approach**: determine a set of **useful capabilities**, and architect an overlay that is *just flexible enough* to implement these

## What can this overlay do?

**Our approach**: determine a set of **useful capabilities**, and architect an overlay that is **just flexible enough** to implement these.

- 1. Selective Variable Tracing
  - Select user visible variables to trace
- 2. Selective Function Tracing
  - Select region of code to trace
- 3. Conditional Buffer Freeze
  - Specify a condition on the circuit that, when true, causes recording in the trace buffer to halt.

#### **Selective Variable Tracing: User Perspective**



## **Architecture to Support Capability**



## Selective Variable Tracing Architecture – Initial Ideas...

Could have a configurable memory that enables which RTL signals (that map to C code variables) we want to trace. Program this memory at runtime...

Aside: Intel's In-System Memory Content Editor

## **Selective Variable Tracing Architecture – Initial Ideas...**

Could associate a bit in Config RAM with each RTL signal that corresponds to a C code variable...



### **Selective Variable Tracing Architecture – Initial Ideas...**

Could associate a bit in Config RAM with each RTL signal that corresponds to a C code variable...



## **Selective Variable Tracing Architecture: Variant A**



#### **Selective Variable Tracing Architecture: Variant B**



<sup>22</sup> Page 22

#### Variant B: Line Packer – Architectural Parameter "G"



- G: granularity
- Increasing G splits the incoming trace data into smaller words – more fine grained packing
- Increasing G also increases the steering logic/area overhead



#### Variant B: Line Packer – Architectural Parameter "G"



- G: granularity
- Increasing G splits the incoming trace data into smaller words – more fine grained packing
- Increasing G also increases the steering logic/area overhead

### Variant B: Line Packer – Architectural Parameter "G"



- G: granularity
- Increasing G splits the incoming trace data into smaller words – more fine grained packing
- Increasing G also increases the steering logic/area overhead

### Variant B – Multi-Bit Configuration ROM



26

#### **Selective Function Tracing: User Perspective**



#### **Selective Function Tracing: Same architecture!**





### **Conditional Buffer Freeze – User Perspective**

| Execution Mode FPGA Replay Execution :                                                                            |   |   |            |                          |  |  |
|-------------------------------------------------------------------------------------------------------------------|---|---|------------|--------------------------|--|--|
| ►                                                                                                                 |   |   |            |                          |  |  |
|                                                                                                                   |   |   |            |                          |  |  |
| Design State: Paused   Function: sha256_transform_FSM State #: 12 (LEGUP_F_sha256_transform_BB_21_12)             |   |   |            |                          |  |  |
| Function: shazso_transform FSM State #: 12 (LEGOP_F_shazso_transform_BB_21_12)                                    |   |   |            |                          |  |  |
| sha256_labeled.c 🛛                                                                                                |   |   |            |                          |  |  |
| 74 }<br>75                                                                                                        |   | A | 74<br>75   |                          |  |  |
| 76 for (; i < 64; ++i) {<br>77 s = m[i - 2];                                                                      |   | _ | 76 🗖<br>77 |                          |  |  |
| 78 sig1 = ROTRIGHT(s, 17<br>79 sig1 ^= ROTRIGHT(s, 17                                                             |   |   | 78<br>79   |                          |  |  |
| 80 sig1 ^= s >> 10;<br>81                                                                                         |   |   | 80<br>81   |                          |  |  |
| 82 s = m[i - 15];                                                                                                 |   |   | 82<br>83   |                          |  |  |
| Condition                                                                                                         | ; |   | 84<br>85   |                          |  |  |
|                                                                                                                   |   |   | 86         |                          |  |  |
|                                                                                                                   |   | _ | 87<br>88   |                          |  |  |
| a < 0, line 94                                                                                                    |   | = | 89<br>90 🗖 |                          |  |  |
| a < 0, inte 54                                                                                                    |   |   | 91<br>92   |                          |  |  |
| 94 a = ctx->state[0];                                                                                             |   | = | 93<br>94   |                          |  |  |
| <pre>95 b = ctx-&gt;state[1];<br/>96 c = ctx-&gt;state[2];</pre>                                                  |   |   | 95<br>96   |                          |  |  |
| <pre>97 d = ctx-&gt;state[3];</pre>                                                                               |   |   | 97         |                          |  |  |
| <pre>98 e = ctx-&gt;state[4];</pre>                                                                               |   |   | 98         |                          |  |  |
| <pre>99 f = ctx-&gt;state[5];<br/>100 g = ctx-&gt;state[6];</pre>                                                 |   |   | 99<br>100  |                          |  |  |
| 101 h = ctx->state[7];                                                                                            |   |   | 101        |                          |  |  |
| 102                                                                                                               |   |   | 102        |                          |  |  |
| 103 for (i = 0; i < 64; ++i)                                                                                      |   |   | 103        |                          |  |  |
| 104 ep0 = ROTRIGHT(a, 2);<br>105 ep0 ^= ROTRIGHT(a, 13                                                            |   |   | 104        |                          |  |  |
| 106 ep0 ^= ROTRIGHT(a, 13);                                                                                       |   |   | 106        |                          |  |  |
| 107 ep1 = ROTRIGHT(e, 6);                                                                                         |   |   | 107        |                          |  |  |
| 108 ep1 ^= ROTRIGHT(e, 11); 108                                                                                   |   |   |            |                          |  |  |
| 109   ep1 ^= ROTRIGHT(e, 25);   109     110   ch = (e & f) ^ (~e & g);   110                                      |   |   |            |                          |  |  |
| 110 $c_n = (e_{\alpha} + f) \cdot \cdot (-e_{\alpha} + f)$<br>111 $maj = (a_{\alpha} + b) \cdot (a_{\alpha} + b)$ |   |   | 111        |                          |  |  |
| 112 t1 = h + ep1 + ch + k                                                                                         |   |   | 112 0 4    | <u>\$ 12 16 29 24 28</u> |  |  |
| 113 +2 - en@ + mai.                                                                                               |   |   | 113        |                          |  |  |

#### **Conditional Buffer Freeze**



## **Conditional Buffer Freeze – Architectural Parameter "C"**



- Increase C units to express a more complex condition
- Example: Stop tracing when err flag 1 OR err flag 2 goes high
- "Stop write controller" receives signals from all C units – OR trigger function

## Outline

- Motivation for In-System Debug
- Previous Work: In-System Debug Framework for HLS
  - Debug Instrumentation at compile time
- This paper: HLS Debug Overlay to allow customization at runtime
- Evaluation
- Future Work

#### **Evaluation: Run-Times**

Compile Time vs. Overlay Personalization Time (seconds)



## Variant A Overlay – Impact on Area



Baseline debug instrumentation is 20% size of the user circuit\*



ALMs on average, and 1 M9K – cheap!

\*Signal-Tracing Techniques for In-System FPGA Debugging of High-Level Synthesis Circuits". IEEE TCAD 2017. J Goeders, SJE Wilton.

#### **Architecture vs. Trace Window Length**



**Overlay Variants** 

Architectural enhancements improve trace window length

#### **Overhead: Variant B vs. Variant A**



Area goes up dramatically for high granularity in line packer

#### **Overhead: Conditional Units**



Area increases with number of C units with small decrease in Fmax

## How can a FPGA vendor use these results?



Provide a library of overlays.

Depending on the user's debugging needs, and resources available – select appropriate library:

- Economy Library: cheaper overlay (i.e. only selective variable tracing)
- Deluxe Library: supports more capabilities (i.e. conditional trigger functions)

Can also take advantage of:

- User input / estimates to user
- Variable reconstruction

## Outline

- Motivation for In-System Debug
- Previous Work: In-System Debug Framework for HLS
  - Debug Instrumentation at compile time
- This paper: HLS Debug Overlay to allow customization at runtime
- Evaluation
- Future Work

## **Future Work**

#### Currently, the user selects the overlay + capabilities to insert.

 Next step – create a tool that automatically determines the type of overlay to insert based on *estimated unused resources*

#### The overlay is passive (i.e. only monitors the user circuit)

- Investigate limited *controllability*
- Allow for simple "what if" scenarios

## Summary

Achieved software like compile times between debug turns in a limited context via an HLS oriented overlay

- Can personalize the overlay at runtime without a recompile
- Overlay supports a set of capabilities (selective variable/function tracing, conditional buffer freeze)
- Overheads are significant (335 ALMs for Variant B/G=2 line packer, 249 ALMs for C=1 unit) on top of the Baseline instrumentation

#### Worth it for the option to have software like compile times during debug

Thank you

Additional

### **Previous Work – Instrumentation Overhead**

| Circuit  | <b>User Module</b><br>(ALMs) | Instrumentation (100%) |                           | Proportion<br>in Debug |
|----------|------------------------------|------------------------|---------------------------|------------------------|
|          |                              | Fixed hlsd<br>(ALMs)   | Trace Scheduler<br>(ALMs) | Partition              |
| adpcm    | 7019                         | 480                    | 1749                      | 24.1%                  |
| aes      | 7135                         | 479                    | 754                       | 14.7%                  |
| blowfish | 3038                         | 528                    | 1187                      | 36.1%                  |
| dfadd    | 3605                         | 495                    | 1115                      | 30.9%                  |
| dfdiv    | 6000                         | 532                    | 1124                      | 21.6%                  |
| dfmul    | 1881                         | 483                    | 675                       | 38.1%                  |
| dfsin    | 11864                        | 529                    | 2904                      | 22.4%                  |
| gsm      | 4147                         | 473                    | 782                       | 23.2%                  |
| jpeg     | 18735                        | 506                    | 2781                      | 14.9%                  |
| mips     | 1441                         | 505                    | 419                       | 39.1%                  |
| motion   | 6470                         | 520                    | 524                       | 13.9%                  |
| sha      | 1720                         | 514                    | 334                       | 33.0%                  |
| combined | 66522                        | 583                    | 13525                     | 17.5%                  |
| Mean     | 10736                        | 509                    | 2114                      | 25.4%                  |

Roughly ¼ is debug instrumentation.