#### Delay-Bounded Routing for Shadow Registers

#### Eddie Hung, Josh Levine, Ed Stott, George Constantinides, Wayne Luk {e.hung,josh.levine,ed.stott, g.constantinides,w.luk}@ic.ac.uk

Department of Computing Department of Electrical and Electronic Engineering Imperial College London, UK

FPGA15 :: 23 Feb '15

- More accurate on-chip timing measurements
  - By inserting instruments precisely



- More accurate on-chip timing measurements
  - By inserting instruments precisely



- More accurate on-chip timing measurements
  - By inserting instruments precisely



- More accurate on-chip timing measurements
  - By inserting instruments precisely



In particular, we want to bound this routing delay <u>across 1000s</u> of instruments

- More accurate on-chip timing measurements
  - By inserting instruments precisely



- Most CAD focusses on minimising (or maximising)
  - We place and route instruments with bounded delay
- Key result: Insertion within 200ps avg. error

## Introduction

- Why do we care about on-chip timing?
  - Because of variation, vendors forced to provision for all possible scenarios
  - Fmax is worst-case path, at worst-case timing corner
  - Can we safely operate better than worst-case?
    - "Why operate at the worst corner if silicon is rarely that slow, and if critical-path is rarely (never) exercised?"

## Main Contributions

- Proposal for inserting shadow registers post place-and-route, using spare resources
  - Does not disrupt original circuit timing
- Modification to Dijkstra's algorithm for minimum cost (delay) bounds
- Experimental evaluation on commercial architecture
  - Achieve average bound error of 200ps

# Introduction

- How can on-chip timing instruments help?
  - Failure prediction
    - "Is my FPGA showing signs of being close to failing?"
  - Error detection
    - "Safe" overclocking (e.g. Razor)
  - Slack measurement
    - "How much leeway do I have before failing?"
  - All possible using shadow registers



User circuit (compiled using regular tools, and locked down)



Duplicates a critical path endpoint, but with a phase-shifted clock



Duplicates a critical path endpoint, but with a phase-shifted clock



Duplicates a critical path endpoint, but with a phase-shifted clock







By placing and routing these shadow registers carefully, target *consistent* skew

# **Existing Work**

- Existing CAD tools already do bounded routing
  - Minimum delay: Hold time minus Clock-to-Q
  - Maximum delay: Clock period minus Setup time
  - Algorithms target a relatively big landing window:



- We desire a very narrow window (ideally, 1ps)
  - Hard problem, but aided by two characteristics:
    - Freedom to <u>place</u> shadow register on any spare site
    - Freedom to <u>route</u> to any site using any spare resources
  - Solved simultaneously by routing to all sites



- FPGA routing is a case of graph search
  - Finding shortest path has long been solved: Dijkstra



Edge cost = 1 unless labelled

- FPGA routing is a case of graph search
  - Finding shortest path has long been solved: Dijkstra



- FPGA routing is a case of graph search
  - Finding shortest path has long been solved: Dijkstra



- FPGA routing is a case of graph search
  - Finding shortest path has long been solved: Dijkstra



- FPGA routing is a case of graph search
  - Finding shortest path has long been solved: Dijkstra



- FPGA routing is a case of graph search
  - Finding shortest path has long been solved: Dijkstra
  - But how can we find not-so-short paths?









• Summary: if path found is too short, undo search and rerun as if a prior edge did not exist



29



- Big picture: disjoint paths from each  ${\bf s}$  to any  ${\bf t}$ 



- Not dissimilar to Yen's K-shortest path algorithm
  - Guaranteed to find the shortest, and K-1 next shortest, paths
  - Essence: remove edge(s) of prior shortest paths, restart Dijkstra, then replace edge(s) and retry
  - Our heuristic:
    - Removes each edge permanently
    - Repair and continue, instead of restarting Dijkstra
    - No guarantee of finding next shortest, but much faster

• We propose three different algorithms:



• We propose three different algorithms:



Full details are in our paper!

## **Experimental Flow**

- Evaluated on Xilinx, but applicable to others
  - Place and route using ISE, apply our shadow register tool, then re-analyse using vendor STA
- Experimented on three large benchmarks:
  - LEON3: an 8-core system-on-chip
  - AES-x3: a chain of 3 encoders then 3 decoders
  - JPEG-x2: two parallel instances of CHStone JPEG synthesised using Vivado HLS
  - All occupying ~90% of mid-range Virtex6

#### Results – LEON3

• Critical endpoints with <10% slack (1424)



Critical Paths (by ascending slack)

- Critical endpoints with <10% slack (1424)</li>
  - With 1417 shadow registers at >1ns skew target



Critical Paths (by ascending slack)

- Critical endpoints with <10% slack (1424)</li>
  - With 1417 shadow registers at >1ns skew target



- Critical endpoints with <10% slack (1424)</li>
  - With 1417 shadow registers at >1ns skew target



- Average skew error of 200ps (from Xilinx STA)
  - Due to incomplete timing information
  - Based on our delay model, we achieve <1ps error!

- Average skew error of 200ps (from Xilinx STA)
  - Due to incomplete timing information
  - Based on our delay model, we achieve <1ps error!
  - Why so good?
    - Many spare registers (total: 300K, used: 62K, spare: 78K)
    - Millions of wires (nodes), 10s of millions of switches (edges)
    - ... compounding with up to 7 ways (with differing delays) to get to each:



41

# More in the paper...

- Experimental results for:
  - Three different algorithms on three benchmarks
  - Different number of steps to roll back
  - Different values for skew target
- Analysis on why not all nets could be shadowed

# More in the paper...

- Experimental results for:
  - Three different algorithms on three benchmarks
  - Different number of steps to roll back
  - Different values for skew target
- Analysis on why not all nets could be shadowed
- Future work: insert downstream infrastructure
  - Comparators, counters, recovery logic, etc.
  - Opportunities: latency-insensitive, reduction operation

# Conclusion

- Shadow registers can help detect, measure and react to physical imperfections
  - But need to be inserted precisely: consistent skew

# Conclusion

- Shadow registers can help detect, measure and react to physical imperfections
  - But need to be inserted precisely: consistent skew
- Our proposal: insert post place and route, with a minimum delay bound
  - Using just leftover resources  $\Rightarrow$  no timing impact
  - Bounding achieved by Dijkstra with Rollback
- Key result: achieve 200ps average skew error
  With more timing information, may do even better!

#### Backup

# Introduction

- Shadow register clock phase  $\boldsymbol{\phi}$ 



# Introduction

- Shadow register clock phase  $\boldsymbol{\phi}$ 



# Introduction

• Shadow register clock phase  $\phi$ 



- We insert shadow instruments entirely post P&R
  - By locking down all parts of existing user circuit ...
  - ... and using spare, leftover, resources only
  - This preserves timing of original user circuit
    - Barring one extra switch load: ~3ps







• For LEON3, shadowing top 10% critical nets:



only ~50% remains critical after recompiling!

• For LEON3, shadowing top 10% critical nets:



## Dijkstra with Rollback Profile



**Rollback Count** 

56

## Yen's K-shortest Profile



K-th Shortest Path

## Yen's K-shortest Profile



K-th Shortest Path

## **Benchmark Summary**

|                 | LEON3  | AES-x3 | JPEG-x2         | xc6vlx240t<br>capacity |
|-----------------|--------|--------|-----------------|------------------------|
| Slices          | 35217  | 32809  | 32733           | 37680                  |
| LUTs            | 95766  | 107685 | 91741           | 150720                 |
| Registers       | 60562  | 32025  | 125642          | 301440                 |
| $\mathbf{RAMs}$ | 688    | -      | 124             | 832                    |
| $\mathrm{DSPs}$ | 32     | -      | 172             | 768                    |
| Tcrit (ns)      | 13.329 | 6.055  | 13.936          | -                      |
| Wirelength      | 2.1M   | 1.1M   | $1.6\mathrm{M}$ | 6.3M                   |

# **Experimental Flow**

• Evaluated on Xilinx, but applicable to any vendor



# **Experimental Flow**

• Evaluated on Xilinx, but applicable to any vendor



# **Experimental Flow**

• Evaluated on Xilinx, but applicable to any vendor



• Unbounded (place closest reg, Xilinx PAR):



• >1ns on LEON3:

|                                                   | Base   | Shad<br>Source | ow Method<br><b>Post P&amp;R</b><br>(this work) |
|---------------------------------------------------|--------|----------------|-------------------------------------------------|
| Slice util.                                       | 93.5%  | 88.9%          | 94.5%                                           |
| LUT util.                                         | 63.5%  | 56.6%          | 64.0%                                           |
| Register util.                                    | 20.1%  | 20.6%          | $\mathbf{20.6\%}$                               |
| Tcrit (ns)                                        | 13.329 | 13.331         | 13.333                                          |
| Num. critical nets                                | 1436   | 1222           | 1436                                            |
| Common to Base                                    | 100%   | 41.0%          | 100%                                            |
| Shad. coverage                                    | -      | 41.0%          | $\mathbf{98.7\%}$                               |
| Pack & Place time (s)                             | 1875   | 1974           | -                                               |
| Routing runtime (s)                               | 1281   | 1241           | <b>183</b>                                      |
| (a) LEON3 (all critical nets within 10% of Tcrit) |        |                |                                                 |

>1ns on LEON3:

|                                                | Unbounded   | Delay-bounded    |         |          |  |
|------------------------------------------------|-------------|------------------|---------|----------|--|
| (ns)                                           | One-Closest | One-Closest      | One-All | All-All  |  |
| Nets shad.                                     | 1424        | 1051             | 1417    | 1418     |  |
| Max.  Err                                      | 1.942       | 1.779            | 1.002   | 0.928    |  |
| Mean  Err                                      | 0.867       | 0.329            | 0.223   | 0.249    |  |
| StdDev Err                                     | 0.367       | 0.187            | 0.111   | 0.107    |  |
| Range Err                                      | 2.771       | 1.744            | 1.075   | 0.956    |  |
| (a) From Xilinx trce STA.                      |             |                  |         |          |  |
|                                                | Unbounded   | $\mathbf{Delay}$ | bounded |          |  |
| (ns)                                           | One-Closest | One-Closest      | One-All | All-All  |  |
| Max.  Err                                      | -           | 1.636            | 0.114   | 0.114    |  |
| Mean  Err                                      | -           | 0.089            | < 0.001 | 0.001    |  |
| StdDev Err                                     | -           | 0.165            | 0.004   | 0.004    |  |
| Range Err                                      | -           | 1.636            | 0.114   | 0.114 65 |  |
| (c) From our delay model, omitting clock skew. |             |                  |         |          |  |

• >1ns on AES-x3:

|                         | Base     | Shade<br>Source | ow Method<br>Post P&R<br>(this work) |
|-------------------------|----------|-----------------|--------------------------------------|
| Slice util.             | 87.0%    | 86.9%           | 89.8%                                |
| LUT util.               | 71.4%    | 71.7%           | $\mathbf{72.7\%}$                    |
| Register util.          | 10.6%    | 12.4%           | $\mathbf{12.3\%}$                    |
| Tcrit (ns)              | 6.055    | 5.865           | 6.055                                |
| Num. critical nets      | 5362     | 7400            | $\boldsymbol{5362}$                  |
| Common to Base          | 100%     | 53.1%           | 100%                                 |
| Shad. coverage          | -        | 53.1%           | $\mathbf{94.2\%}$                    |
| Pack & Place time $(s)$ | 1014     | 1408            | -                                    |
| Routing runtime $(s)$   | 479      | 486             | ${\bf 224}$                          |
|                         | (b) AES- | x3 (within      | n $40\%$ of Tcrit)                   |

• >1ns on JPEG-x2:

|                         | Base     | Shade<br>Source | ow Method<br>Post P&R<br>(this work) |
|-------------------------|----------|-----------------|--------------------------------------|
| Slice util.             | 86.8%    | 88.8%           | 90.1%                                |
| LUT util.               | 60.9%    | 61.3%           | $\mathbf{61.8\%}$                    |
| Register util.          | 41.7%    | 42.7%           | $\mathbf{42.6\%}$                    |
| Tcrit (ns)              | 13.936   | 15.611          | 13.939                               |
| Num. critical nets      | 3290     | 3335            | 3295                                 |
| Common to Base          | 100%     | 46.7%           | $\mathbf{99.8\%}$                    |
| Shad. coverage          | -        | 46.7%           | $\boldsymbol{84.0\%}$                |
| Pack & Place time $(s)$ | 871      | 911             | -                                    |
| Routing runtime (s)     | 748      | 3658            | <b>276</b>                           |
|                         | (c) JPEG | -x2 (withi      | n $40\%$ of Tcrit)                   |

• Reasons behind failure

|                         | LEON3 | AES-x3 | JPEG-x2 |
|-------------------------|-------|--------|---------|
| Num. critical nets      | 1436  | 5362   | 3290    |
| Nets shadowed           | 1417  | 5049   | 2769    |
| i) Blocked in LE        | 10    | 313    | 468     |
| ii) Blocked out LE      | 4     | -      | 23      |
| iii) Path search failed | 5     | -      | 30      |

- i) and ii) require user circuit to be ripped up
- Only iii) could be solved by better algorithms