# Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

Nitish Srivastava, Steve Dai, Rajit Manohar, Zhiru Zhang

Computer Systems Laboratory Electrical and Computer Engineering Cornell University



Cornell University



## Summary

- High Level Synthesis: An emerging alternative to traditional register-transfer-level to improve design productivity
- Context: There is a lack of complex applications to benchmark FPGA high-level synthesis (HLS) tools
- Case study: Viola-Jones face detection
  - A **<u>complex</u>** and **<u>reducible</u>** benchmark
  - Has realistic performance constraint
  - Widely used in embedded applications (surveillance, photography, etc.)
- Previous work: To the best of our knowledge there is no existing open source RTL/HLS implementation

### Our contributions

- Making a pure software code<sup>[1]</sup> synthesizable and optimized for FPGAs
- Real implementation/evaluation on FPSoC board
- Open sourcing the design for the FPGA & HLS communities

## **Viola-Jones: A Realistic and Reducible App**

#### It has realistic constraint

30 fps for real-time image processing





#### It is reducible

composed of small kernel like



Image



1+5+3+2+5+4 = 20

Integral Image

|                      | •  |    | •  |    |  |
|----------------------|----|----|----|----|--|
| 1                    | 3  | 5  | 9  | 10 |  |
| 4                    | 10 | 13 | 22 | 25 |  |
| 6                    | 15 | 21 | 32 | 39 |  |
| 10                   | 20 | 31 | 46 | 59 |  |
| 16                   | 29 | 42 | 58 | 74 |  |
| 46 - 9 - 20 + 3 = 20 |    |    |    |    |  |

Cascaded Classifier



### **Design Complexity in Parallel or Pipeline**



### **Design Complexity in Memory Banking**

Hand-coded window and line buffers

Store integral image window buffer in discrete registers

**Integral Image Banking** 

**Synthesis** 

[FAILED TIMING]

**170K LUTs** 



### Implementation

#### https://github.com/cornell-zhang/facedetect-fpga



- ZC-706 board
  - Xilinx Zynq-7000 XC7Z045 FPGA
  - ARM Cortex-A9 CPU
- Xilinx SDSoC 2016.1
  - To generate data-motion network

| Logic     | Usage ( % )   |
|-----------|---------------|
| LUT       | 65,522 (29 %) |
| Registers | 81,135 (19 %) |
| DSP48E    | 111 (12 %)    |
| BRAM 18K  | 157 (29 %)    |