

#### Quality-Time Tradeoffs in Component-Specific Mapping



Dr: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays

Hans Giesen, Raphael Rubin, Benjamin Gojman, André DeHon University of Pennsylvania





#### Component-Specific Mapping

- Traditionally: Same bitstream for every chip.
  - Poor performance and energy



# Component-Specific Mapping

- Traditionally: Same bitstream for every chip.
  - Poor performance and energy
- Mapping using actual delays reduces delay and energy.
  - Mapping can take days.



# Component-Specific Mapping

- Traditionally: Same bitstream for every chip.
  - Poor performance and energy
- Mapping using actual delays reduces delay and energy.
  - Mapping can take days.
- Lightweight mapping combines best of both.
  77% of maximum delay
  - 77% of maximum delay benefit is achievable in 18 seconds.



Mapping Time (s)

# Our Contributions

- We achieved a lightweight mapping.
  - Using online measurement
  - Adapting precomputed alternatives

# Our Contributions

- We achieved a lightweight mapping.
  - Using online measurement
  - Adapting precomputed alternatives
- We developed 5 algorithms at intermediate points.
  - Characterized time and quality for all algorithms.

#### Overview

- Introduction
- Lightweight solutions
- Experiments
- Conclusions

# Introduction

#### Mapping – CAD View



#### Mapping – CAD View



#### Process Variation



#### Process Variation





• One-Mapping-Fits-All (OMFA)



• One-Mapping-Fits-All (OMFA)



• One-Mapping-Fits-All (OMFA)



• One-Mapping-Fits-All (OMFA) Timing margin Probability Density 0.5 0 πп 1e-09 1e-06 1e-10 1e-08 1e-07 Nominal Delay (s) Clock Delay period

8

# Full-Knowledge (FK) Mapping



#### Full-Knowledge (FK) Mapping -Repaired



9

# Different Chip



# Different Chip



10

#### Different Chip - Repaired



#### Different Chip - Repaired



Measure all path delays

Componentspecific mapping

using GROK-LAB or GROK-INT [Gojman, 2014]

Measure all path delays

Componentspecific mapping

using GROK-LAB or GROK-INT [Gojman, 2014] us

Measure all path delays

using PathFinder

Componentspecific mapping

using GROK-LAB or GROK-INT [Gojman, 2014] using PathFinder

Measure all path delays

Componentspecific mapping

Can take days



Can take days

Can take hours



For every chip



For every chip

# Lightweight solutions

• Precompute alternatives for 2-point nets.



• Precompute alternatives for 2-point nets.



Net 1: base Net 2: base

• Precompute alternatives for 2-point nets.



Net 1: alternative 1 Net 2: base

• Precompute alternatives for 2-point nets.



Net 1: alternative 2 Net 2: base

• Precompute alternatives for 2-point nets.



Net 1: alternative 3 Net 2: base

• Precompute alternatives for 2-point nets.



Net 1: base Net 2: base

• Precompute alternatives for 2-point nets.



Net 1: base Net 2: alternative 1

• Precompute alternatives for 2-point nets.



Net 1: base Net 2: alternative 2

• Precompute alternatives for 2-point nets.



Net 1: base Net 2: alternative 3

#### CYA for Variation

- Original CYA
  - Developed for defect-tolerance.
  - Functionality check at load-time: 1 or 0

### CYA for Variation

- Original CYA
  - Developed for defect-tolerance.
  - Functionality check at load-time: 1 or 0
- CYA for variation
  - Binary timing circuits: late or not

### CYA for Variation

- Original CYA
  - Developed for defect-tolerance.
  - Functionality check at load-time: 1 or 0
- CYA for variation
  - Binary timing circuits: late or not
- But: alternatives may conflict.
  - Can run out of good alternatives.
  - Delay unknown, so cannot prioritize nets.













#### Difference Detector with First-Fail Latch

• Adds 4% area overhead to Stratix-IV CLB.



Incremental CYA







#### Incremental CYA



#### Incremental CYA



• 20 chips, random data input



- 20 chips, random data input
- Upcoming results reported at 95% yield point.



- 20 chips, random data input
- Upcoming results reported at 95% yield point.



- 20 chips, random data input
- Upcoming results reported at 95% yield point.



# Experiments

### Methodology

- VPR 5.0.2 with CYA extension
  - Custom timing simulator for Incremental CYA
- Toronto20 benchmarks
- 22-nm CMOS, 0.8 V nominal
- Architecture similar to Stratix-IV
  - 4 LUTs, 16 CLB inputs, and 16 tracks extra
- 64 alternatives per 2-point net





22

















## Conclusions

### Conclusions



- OMFA mapping has outrageous timing margin.
- FK mapping typically lasts days.
  - Must be repeated for each chip.
- Fast algorithms that eliminate most of timing margin are feasible.
  - CYA achieves >50% of delay and energy gain.
  - Incremental CYA eliminates >50% of energy and >70% delay gain.



