# PIMap

#### A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology Mapping

Gai Liu and Zhiru Zhang

Computer Systems Lab Electrical and Computer Engineering Cornell University







- Technology mapping is an essential step in FPGA CAD flow
  - Dictates the design area (i.e., number of LUTs)
  - Large impact on timing of the final design

#### A typical FPGA CAD flow



- Cover a gate-level logic network using LUTs
  - K-input LUT (k-LUT) can implement any k-input 1-output combinational logic network



- Cover a gate-level logic network using LUTs
  - K-input LUT (k-LUT) can implement any k-input 1-output combinational logic network



- Cover a gate-level logic network using LUTs
  - K-input LUT (k-LUT) can implement any k-input 1-output combinational logic network
  - This work focuses on combinational circuit



- Cover a gate-level logic network using LUTs
  - K-input LUT (k-LUT) can implement any k-input 1-output combinational logic network
  - This work focuses on combinational circuit
- Quality metrics for technology mapping
  - Area: number of LUTs needed
  - Depth: longest path from PI to PO in # of LUTs



Case 1: no logic restructuring

Case 1: no logic restructuring

Goal: map to 3-input LUTs



Case 1: no logic restructuring





- Case 1: no logic restructuring
  - Already NP-hard <sup>[1]</sup>

Goal: map to 3-input LUTs



- Case 1: no logic restructuring
  - Already NP-hard <sup>[1]</sup>
- Case 2: with logic restructuring



[1] Farrahi and Sarrafzadeh, TCAD'02

- Case 1: no logic restructuring
  - Already NP-hard <sup>[1]</sup>
- Case 2: with logic restructuring



[1] Farrahi and Sarrafzadeh, TCAD'02

- Case 1: no logic restructuring
  - Already NP-hard <sup>[1]</sup>
- Case 2: with logic restructuring
  - Even harder to find optimal solution
  - Existing approach: heuristically transform logic network for better mapping quality



[1] Farrahi and Sarrafzadeh, TCAD'02

- Case 1: no logic restructuring
  - Already NP-hard <sup>[1]</sup>
- Case 2: with logic restructuring

Focus of this work

- Even harder to find optimal solution
- Existing approach: heuristically transform logic network for better mapping quality



#### **Representative Academic Mappers**



Average area reduction for a set of MCNC benchmarks

Chortle: Francis, et al., DAC'90 DAGMap: Chen, et al., DT'92 FlowMap: Cong and Ding, TCAD'94 CutMap: Cong and Hwang, FPGA'95 DAOMap: Chen and Cong, ICCAD'04 K and L: Kao and Lai, TDAES'05 Imap: Manohararajah, et al., TCAD'06 ABC Map: Mishchenko, et al., TCAD'07 Exact synthesis: Haaswijk, et al., ASPDAC'17

#### **Representative Academic Mappers**



Average area reduction for a set of MCNC benchmarks

Chortle: Francis, et al., DAC'90 DAGMap: Chen, et al., DT'92 FlowMap: Cong and Ding, TCAD'94 CutMap: Cong and Hwang, FPGA'95 DAOMap: Chen and Cong, ICCAD'04 K and L: Kao and Lai, TDAES'05 Imap: Manohararajah, et al., TCAD'06 ABC Map: Mishchenko, et al., TCAD'07 Exact synthesis: Haaswijk, et al., ASPDAC'17

### **World Record for Area Optimization**

#### Best LUT-6 implementation for EPFL benchmark suite <sup>[1]</sup>

|                     | Best res                            | sults for LUT6 c     | ount               |      |       |
|---------------------|-------------------------------------|----------------------|--------------------|------|-------|
|                     |                                     |                      |                    |      |       |
|                     |                                     | Arithmetic           |                    |      | -     |
| Benchmark name      | Author's name                       | Author's affiliation | Synthesis Method   | Size | Depth |
| Adder               | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 201  | 73    |
| Barrel Shifter      | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 512  | 4     |
| Divisor             | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 3813 | 1542  |
| Hypotenuse          | ***                                 | ***                  | ***                | ***  | ***   |
| Log2                | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 7344 | 142   |
| Max                 | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 532  | 192   |
| Multiplier          | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 5681 | 120   |
| Sine                | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 1347 | 62    |
| Square-root         | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 3286 | 1180  |
| Square              | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 3800 | 116   |
|                     |                                     | Random-control       |                    |      |       |
| Benchmark name      | Author's name                       | Author's affiliation | Synthesis Method   | Size | Depth |
| Round-robin arbiter | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 429  | 24    |
| ALU ctrl            | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 29   | 2     |
| Coding-CAVLC        | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 107  | 6     |
| Decoder             | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 272  | 2     |
| i2c controller      | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 230  | 7     |
| Int2float           | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 34   | 4     |
| Mem ctrl            | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 2399 | 23    |
| Priority encoder    | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 118  | 27    |
| Lookahead XY router | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 53   | 6     |
| Voter               | Robert K. Brayton & Alan Mishchenko | UC Berkeley          | ABC Extreme Mapper | 1521 | 18    |

[1] Amarù, et al., http://lsi.epfl.ch/benchmarks

### **Common Restructuring Techniques**

Heuristically shrink logic network by logic rewriting (e.g., [1])



[1] Mishchenko, Chatterjee, Brayton, DAC'06

### **Common Restructuring Techniques**

Heuristically shrink logic network by logic rewriting (e.g., [1])



[1] Mishchenko, Chatterjee, Brayton, DAC'06

► A typical area-minimizing script in ABC<sup>[1]</sup>: [1] Mishchenko, et al., TCAD'07

balance  $\rightarrow$  rewrite  $\rightarrow$  balance  $\rightarrow$  rewrite

► A typical area-minimizing script in ABC<sup>[1]</sup>: [1] Mishchenko, et al., TCAD'07

balance  $\rightarrow$  rewrite  $\rightarrow$  balance  $\rightarrow$  rewrite

Initial and-inverter graph for xor5









Key rationale of previous techniques: smaller logic network —> smaller post-mapping circuit

 $\frac{\text{Key rationale of previous techniques}}{\text{smaller logic network} \longrightarrow \text{smaller post-mapping circuit}}$ 

Key rationale of previous techniques: smaller logic network → smaller post-mapping circuit **?** 



Key rationale of previous techniques: smaller logic network → smaller post-mapping circuit **?** 



Key rationale of previous techniques: smaller logic network → smaller post-mapping circuit **?** 



Key rationale of previous techniques: smaller logic network → smaller post-mapping circuit



Key rationale of previous techniques: smaller logic network —> smaller post-mapping circuit



Key rationale of previous techniques: smaller logic network → smaller post-mapping circuit **?** 



Our study: smaller logic network **not necessarily** leads to smaller post-mapping circuit

#### PIMap: A Parallelized Iterative Improvement Approach to LUT-Based Tech Mapping

- Couple mapping and logic transformation
  - Close the gap between logic optimization and tech mapping
  - Incrementally improve area



#### PIMap: A Parallelized Iterative Improvement Approach to LUT-Based Tech Mapping

- Couple mapping and logic transformation
  - Close the gap between logic optimization and tech mapping
  - Incrementally improve area



- Effective partitioning and parallelization technique
  - Improve both runtime and design quality



#### **PIMap Technique: Iterative Area Minimization**

Intuition: use mapping result to guide randomly proposed logic transformations

#### **PIMap Technique: Iterative Area Minimization**

Intuition: use mapping result to guide randomly proposed logic transformations



15 LUTs



#### Intuition: use mapping result to guide randomly proposed logic transformations







#### Intuition: use mapping result to guide randomly proposed logic transformations



Metropolis-Hastings algorithm<sup>[1]</sup>: Accept current transformation if  $rand(0,1) < \exp(-\gamma \frac{N_{LUT\_new}}{N_{LUT\_old}})$ 

[1] Hastings, Biometrika'70

#### Intuition: use mapping result to guide randomly proposed logic transformations



Accept current transformation if  $rand(0,1) < \exp(-\gamma \frac{N_{LUT\_new}}{N_{LUT\_old}})$ 

<sup>[1]</sup> Hastings, Biometrika'70

#### Intuition: use mapping result to guide randomly proposed logic transformations



Metropolis-Hastings algorithm<sup>[1]</sup>: Accept current transformation if  $rand(0,1) < \exp(-\gamma \frac{N_{LUT\_new}}{N_{LUT\_old}})$ 

<sup>[1]</sup> Hastings, Biometrika'70



<sup>[1]</sup> Hastings, Biometrika'70

## **Partitioning Schemes**

- No circuit partitioning
  - Long runtime per trial
  - Easily stuck at local minimum



## **Partitioning Schemes**

- No circuit partitioning
  - Long runtime per trial
  - Easily stuck at local minimum
- Fine-grained partition
  - Similar concept to exact synthesis
  - Fast runtime per trial
  - Slow progress overall



## **Partitioning Schemes**

- No circuit partitioning
  - Long runtime per trial
  - Easily stuck at local minimum
- Fine-grained partition
  - Similar concept to exact synthesis
  - Fast runtime per trial
  - Slow progress overall

#### Coarse-grained partition

- Balance runtime and solution quality
- Repartition between trials to further improve quality



Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs











Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs

14 LUTs



15 LUTs



Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



15 LUTs



Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



15 LUTs



## **PIMap Technique: Repartition**

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs



Repartition using different seeds

## **PIMap Technique: Repartition**

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs





## **PIMap Technique: Repartition**

Initial mapping to LUT Subgraph extraction Iterative area minimization Recombine subgraphs





### **PIMap Overall Flow**

# Design C1908 from the MCNC benchmark suite 5 trials in total



**Initial Design** 

#### Observations:

- 1. Partition boundaries vary between trials
  - → Uncover better structure
- 2. Overall network structure differ significantly between trials
  - → Discover a wide range of designs

## **Experimental Setup**

| PIMap toolchain                           |                                                               |  |  |
|-------------------------------------------|---------------------------------------------------------------|--|--|
| ABC's tech<br>mapper                      | ABC's logic<br>transformations:<br>balance, rewrite, refactor |  |  |
| Iterative area<br>minimization<br>routine | Subgraph extraction<br>and parallelization<br>control         |  |  |

## **Experimental Setup**

| PIMap toolchain                           |                                                       | Benchmarks |                                              |                                                             |
|-------------------------------------------|-------------------------------------------------------|------------|----------------------------------------------|-------------------------------------------------------------|
|                                           | ABC's logic                                           | C's logic  |                                              | Initial design                                              |
| ABC's tech<br>mapper                      | transformations:<br>balance, rewrite, refactor        |            | 10 largest<br>MCNC<br>designs <sup>[1]</sup> | pre-synthesized using<br>ABC's <i>compress2rs</i><br>script |
| Iterative area<br>minimization<br>routine | Subgraph extraction<br>and parallelization<br>control |            | EPFL<br>arithmetic<br>designs <sup>[2]</sup> | best-known mapping<br>designs <sup>[2]</sup>                |

[1] Yang, MCNC'91

[2] Amarù, et al., http://lsi.epfl.ch/benchmarks

## **Experimental Setup**

| PIM                                       | ap toolchain                                          |  | Be                                           | nchmarks                                                    |
|-------------------------------------------|-------------------------------------------------------|--|----------------------------------------------|-------------------------------------------------------------|
| ABC's logic                               |                                                       |  | Benchmark                                    | Initial design                                              |
| ABC's tech<br>mapper                      | transformations:<br>balance, rewrite, refactor        |  | 10 largest<br>MCNC<br>designs <sup>[1]</sup> | pre-synthesized using<br>ABC's <i>compress2rs</i><br>script |
| Iterative area<br>minimization<br>routine | Subgraph extraction<br>and parallelization<br>control |  | EPFL<br>arithmetic<br>designs <sup>[2]</sup> | best-known mapping<br>designs <sup>[2]</sup>                |

[1] Yang, MCNC'91

[2] Amarù, et al., http://lsi.epfl.ch/benchmarks

#### Setup

| <b>Configuration</b> 40 trials, 100 iterations of area minimization p |                                                        |
|-----------------------------------------------------------------------|--------------------------------------------------------|
| Partitioning                                                          | up to 16 subgraphs, each with up to 100 LUTs           |
| Computing resource                                                    | up to 8 machines, each with a quad-core Xeon processor |



■ Initial design ■ 5 trials ■ 10 trials ■ 40 trials

Initial design: <u>best-known results</u> from EPFL record



■ Initial design ■ 5 trials ■ 10 trials ■ 40 trials

- Initial design: <u>best-known results</u> from EPFL record
- Area improvements
  - EPFL: 7% on average, up to 14%



■ Initial design ■ 5 trials ■ 10 trials ■ 40 trials

- Initial design: <u>best-known results</u> from EPFL record
- Area improvements
  - EPFL: 7% on average, up to 14%
  - Can effectively handle very large circuit (~44k LUTs)



■ Initial design ■ 5 trials ■ 10 trials ■ 40 trials

- Initial design: <u>best-known results</u> from EPFL record
- Area improvements
  - EPFL: 7% on average, up to 14%
  - Can effectively handle very large circuit (~44k LUTs)
- Also able to improve all 10 control-intensive designs in EPFL benchmark suite

#### **LUT Count vs. Gate Count Reduction**



#### LUT Count vs. Gate Count Reduction

Verified:

post-mapping area does not necessarily correlate with pre-mapping area



### Subgraph Size vs. Runtime

Tradeoff between runtime vs. progress per trial

Optimal subgraph size is around 100 LUTs



### **Depth Constrained Area Minimization**

- Constraint: no depth increase compared to initial design
  - Initial designs generated by ABC's depth-minimizing *resyn2* script
  - In PIMap, only accept designs within depth constraint after each trial

### **Depth Constrained Area Minimization**

- Constraint: no depth increase compared to initial design
  - Initial designs generated by ABC's depth-minimizing *resyn2* script
  - In PIMap, only accept designs within depth constraint after each trial
- Area improvements under depth constraint



#### **Area Reduction under a Tight Runtime Limit**

- In use cases with tight runtime budget
  - Use fewer number of trials and fewer iterations per trial

#### **Area Reduction under a Tight Runtime Limit**

- In use cases with tight runtime budget
  - Use fewer number of trials and fewer iterations per trial
  - PIMap still able to improve most of the best-known results of EPFL benchmark designs

| Designs | Best-known | PIMap |                |
|---------|------------|-------|----------------|
| Adder   | 201        | 197   |                |
| Shifter | 512        | 512   |                |
| Divisor | 3813       | 3787  | Runtime limit: |
| Нур     | 44635      | 44635 | 10 seconds     |
| Log2    | 7344       | 7305  |                |
| Max     | 532        | 526   |                |
| Mult    | 5681       | 5594  |                |
| Sine    | 1347       | 1309  |                |
| Sqrt    | 3286       | 3279  |                |
| square  | 3800       | 3675  | _              |

Area reduction using PIMap with tight runtime limit

## Conclusions

- Circuit area before/after mapping does not necessarily correlate
- Stochastic mapping-in-the-loop approach for area minimization
- Sub-circuit extraction and parallelization for runtime reduction
- Up to 14% and 7% on average over the best-known records for the EPFL arithmetic benchmark suite
- Future work: depth minimization in tech mapping

