#### A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

**Thierry Tambe**<sup>1</sup>

En-Yu Yang<sup>1</sup>, Glenn G. Ko<sup>1</sup>, Yuji Chai<sup>1</sup>, Coleman Hooper<sup>1</sup>, Marco Donato<sup>2</sup> Paul N. Whatmough<sup>1,3</sup>, Alexander M. Rush<sup>4</sup>, David Brooks<sup>1</sup>, and Gu-Yeon Wei<sup>1</sup>



<sup>1</sup>Harvard University, Cambridge, MA, <sup>2</sup>Tufts University, Medford, MA, <sup>3</sup>ARM, Boston, MA, <sup>4</sup>Cornell University, New York, NY

© 2021 IEEE International Solid-State Circuits Conference

9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

#### **Self Introduction**

- Thierry Tambe is an EE PhD student at Harvard University.
- Current research interests focus on designing algorithms, energy-efficient and high-performance hardware accelerators and systems for machine learning and natural language processing in particular.
- Thierry was a staff engineer at Intel, Hillsboro, OR, USA (2012-2017) designing various analog/mixedsignal architectures for high-bandwidth memory and peripheral interfaces on Xeon and Xeon-Phi HPC SoCs.
- B.S. (2010) and M.Eng. (2012) both in Electrical Engineering from Texas A&M University.



# **Motivating Vision of the Future**

Cloud and/or smart phone based *personal AI assistant* with speechbased conversational AI interfaces

Polyphonic adaptability *and* linguistic context understanding is of paramount importance

#### Amazon's echo frame



Double-press the mic off button to disconnect the microphones

Always-ON KWS

On-demand (or perpetual) ASR

Numerous NLP tasks, e.g., translation and text-to-speech

#### **Three Main Messages**



#### 1) Accelerating the Attention Mechanism



• The attention mechanism is nowadays the most effective neural building block in NLP

#### 1) Accelerating the Attention Mechanism



 Accelerating attention poses challenges and opportunities unique from CNNs and RNNs

#### Further context learning with Bidir. RNNs



 Bidirectional RNNs capture context by concatenating forward and backward time steps

## **Attention + Bidir. RNN Benefits**

|                                            | WER (LibriSpeech) |                 |
|--------------------------------------------|-------------------|-----------------|
| Unidir. LSTM w/o Attention<br>#Params=3.9M | 27.6              | Improvement     |
| (Conventional)                             |                   |                 |
| Unidir. LSTM w/ Attention                  | 20.11             |                 |
| #Params=3.3M                               | 20.11             | <b>*</b> 27.1%/ |
| Bidir. LSTM w/ Attention                   |                   |                 |
| #Params=3.5M                               | 10.54             | 61.8%           |
| (this work)                                |                   |                 |

 Attention-based bidirectional RNNs can improve ASR accuracy by up to 62% compared to a simpler unidirectional RNN

#### 2) Benefit from Pre-Recognition Denoising



#### 2) Benefit from Pre-Recognition Denoising



### 3) End-to-End Evaluation



### Outline

#### Motivation

- Speech-Enhancing ASR
  - Functional Pipeline
  - 16nm SoC Architecture
  - Markov Source Separation Engine (MSSE)
  - Attention-based Seq2Seq Accelerator (FlexASR)
    - FlexASR Processing Element
    - FlexASR Multi-Function Global Buffer
- Chip Measurement Results

#### Summary

### Outline

#### Motivation

#### Speech-Enhancing ASR

- Functional Pipeline
- 16nm SoC Architecture
- Markov Source Separation Engine (MSSE)
- Attention-based Seq2Seq Accelerator (FlexASR)
  - FlexASR Processing Element
  - FlexASR Multi-Function Global Buffer
- Chip Measurement Results

#### Summary





• M0: monitors incoming audio amplitudes and subsequently boots accelerators



M0: monitors incoming audio amplitudes and subsequently boots accelerators
 Dual A53: performs feature extraction tasks (framing, windowing, 1024-pt FFT)



- M0: monitors incoming audio amplitudes and subsequently boots accelerators
- Dual A53: performs feature extraction tasks (framing, windowing, 1024-pt FFT)
- MSSE: optimized for unsupervised speech enhancement via Gibbs sampling



- M0: monitors incoming audio amplitudes and subsequently boots accelerators
- Dual A53: performs feature extraction tasks (framing, windowing, 1024-pt FFT)
- MSSE: optimized for unsupervised speech enhancement via Gibbs sampling
- FlexASR: optimized for large vocabulary attention-based bidirectional RNNs

### Outline

#### Motivation

#### Speech-Enhancing ASR

- Functional Pipeline
- 16nm SoC Architecture
- Markov Source Separation Engine (MSSE)
- Attention-based Seq2Seq Accelerator (FlexASR)
  - FlexASR Processing Element
  - FlexASR Multi-Function Global Buffer
- Chip Measurement Results

#### Summary



9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET



Always-on ARM M0

FlexASR



FlexASR

9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET



FlexASR



9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET



- Always-on ARM M0
- Dual A53 with 2MB L2 cache
- MSSE: 12 parallel Gibbs samplers
- FlexASR: 4 processing elements and a multi-function global buffer
- 128-bit AXI and 32-bit AHB NoCs

#### FlexASR



9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET



### Outline

#### Motivation

#### Speech-Enhancing ASR

- Functional Pipeline
- 16nm SoC Architecture
- Markov Source Separation Engine (MSSE)
- Attention-based Seq2Seq Accelerator (FlexASR)
  - FlexASR Processing Element
  - FlexASR Multi-Function Global Buffer
- Chip Measurement Results

#### Summary

# **Gibbs Sampling Inference**



- O Nodes representing input speech features
  - Nodes representing output labels corresponding to feature locations
  - Node being sampled
  - Observed node
  - Neighbor labels

[2]: G. Ko et al., A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm, VLSI Symposium, 2020

# **Gibbs Sampling Inference**



- () Nodes representing input speech features
  - Nodes representing output labels corresponding to feature locations
- Node being sampled
- Observed node
- Neighbor labels



[2] G. Ko et al., A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm, VLSI Symposium, 2020

> 9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

30 of 62

#### MRF Sound Source Separation Engine (MSSE)



[2] G. Ko et al., A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm, VLSI Symposium, 2020

- Highly optimized for sound source separation while PGMA [2] is a general-purpose Bayesian inference accelerator
  - Only supports binary labels

#### MRF Sound Source Separation Engine (MSSE)



[2] G. Ko et al., A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm, VLSI Symposium, 2020

- Highly optimized for sound source separation while PGMA [2] is a general-purpose Bayesian inference accelerator
  - Only supports binary labels
  - 2x speedup over PGMA [2]



Iterations

9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

### Outline

#### Motivation

#### Speech-Enhancing ASR

- Functional Pipeline
- 16nm SoC Architecture
- Markov Source Separation Engine (MSSE)
- Attention-based Seq2Seq Accelerator (FlexASR)
  - FlexASR Processing Element
  - FlexASR Multi-Function Global Buffer
- Chip Measurement Results

#### Summary

## **FlexASR Processing Element**



 Processing element utilizes a floating-point datapath for high accuracy and high dynamic range computations

## **FlexASR Processing Element**



 8-bit weight / activation precision with additional support for 4-bit indexes via weight clustering (2x compression)

# Weight Tiling in FlexASR PE

![](_page_35_Figure_1.jpeg)

 16-by-16 RNN weight tiles are reordered and interleaved in weight buffer as shown above in order to ensure hazard-free computation in the activation unit

### Outline

#### Motivation

#### Speech-Enhancing ASR

- Functional Pipeline
- 16nm SoC Architecture
- Markov Source Separation Engine (MSSE)
- Attention-based Seq2Seq Accelerator (FlexASR)
  - FlexASR Processing Element
  - FlexASR Multi-Function Global Buffer
- Chip Measurement ResultsSummary

#### **FlexASR Multi-Function Global Buffer**

![](_page_37_Figure_1.jpeg)

## **FlexASR GB: Attention Mechanism**

![](_page_38_Figure_1.jpeg)

![](_page_38_Figure_2.jpeg)

- Attention mechanism
  - Computes numerically stable version of Softmax
  - MAC operations are skipped for null decoder states
    - Saves energy

# FlexASR GB: Time Step Pooling

![](_page_39_Figure_1.jpeg)

- Attention mechanism
- Time step pooling
  - Mean, Max and Element-Wise Addition

# FlexASR GB: Layer Normalization

![](_page_40_Figure_1.jpeg)

- Attention mechanism
- Time step pooling
- Layer Normalization
  - β and ¥ parameters stored in the auxiliary buffer

# FlexASR GB: Bidir. RNN Operation

![](_page_41_Figure_1.jpeg)

- Attention mechanism
- Time step pooling
- Layer Normalization
- Bidirectional RNN operation
  - Stripes forward and backward time steps across alternate banks in the unified activation buffer

#### FlexASR SW/HW Co-Design and Verification Flow

![](_page_42_Figure_1.jpeg)

#### Looped inside Catapult HLS environment

#### **Tunable HW Parameters**

- # of PEs
- MAC vector size
- Scratchpad size
- Datapath precision
- # of pipeline stages
- # of initiation intervals

9.8: A 25mm² SoC for loT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

#### FlexASR SW/HW Co-Design and Verification Flow

![](_page_43_Figure_1.jpeg)

#### Looped inside Catapult HLS environment

#### **Tunable HW Parameters**

- # of PEs
- MAC vector size
- Scratchpad size
- Datapath precision
- # of pipeline stages
- # of initiation intervals

From start of SystemC coding to tapeout in less than 4 months!

9.8: A 25mm² SoC for loT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

### Outline

#### Motivation

- Speech-Enhancing ASR
  - Functional Pipeline
  - 16nm SoC Architecture
  - Markov Source Separation Engine (MSSE)
  - Attention-based Seq2Seq Accelerator (FlexASR)
    - FlexASR Processing Element
    - FlexASR Multi-Function Global Buffer

#### Chip Measurement Results

#### Summary

## **16nm Test Chip**

![](_page_45_Figure_1.jpeg)

![](_page_45_Figure_2.jpeg)

| Technology           | TSMC 16nm FFC     |  |  |
|----------------------|-------------------|--|--|
| Die Area             | 25mm <sup>2</sup> |  |  |
| Total SRAM           | 9.8MB             |  |  |
| Gate Count           | 11M               |  |  |
| <b>Clock Domains</b> | 6                 |  |  |
| <b>Power Domains</b> | 5                 |  |  |
| Supply Voltage       | 0.55 – 1V         |  |  |
| Packaging            | Flip-chip BGA-672 |  |  |

#### Memory Breakdown (in MB)

| FlexASR       | PEs | 4.2   |  |
|---------------|-----|-------|--|
|               | GB  | 1.1   |  |
| MSSE          |     | 0.103 |  |
| ARM Dual A53  |     | 2.41  |  |
| ARM M0        |     | 0.128 |  |
| SoC Top Level |     | 1.8   |  |
| Total         |     | 9.8   |  |

#### **End-to-End Evaluation**

![](_page_46_Figure_1.jpeg)

# **ASR Accuracy**

| Inference Scenarios                                                    | ASR<br>Model Size<br>(MB) | Model fit<br>on-chip? | Noise<br>Resistant? |
|------------------------------------------------------------------------|---------------------------|-----------------------|---------------------|
| (A) Noiseless audio of the speaker                                     | 3.5                       | Yes                   | No                  |
| (B) Noise mixed with the speaker's voice at 0.9dB SNR                  | 3.5                       | Yes                   | No                  |
| (C) Large ASR model trained with a noise-corrupted LibriSpeech dataset | 22                        | No                    | Yes                 |
| (D) Speech-Enhanced ASR<br>Pipeline (This work)                        | 3.5                       | Yes                   | Yes                 |

![](_page_47_Figure_2.jpeg)

# **ASR Accuracy**

| Inference Scenarios                                                    | ASR<br>Model Size<br>(MB) | Model fit<br>on-chip? | Noise<br>Resistant? |
|------------------------------------------------------------------------|---------------------------|-----------------------|---------------------|
| (A) Noiseless audio of the speaker                                     | 3.5                       | Yes                   | No                  |
| (B) Noise mixed with the speaker's voice at 0.9dB SNR                  | 3.5                       | Yes                   | No                  |
| (C) Large ASR model trained with a noise-corrupted LibriSpeech dataset | 22                        | No                    | Yes                 |
| (D) Speech-Enhanced ASR<br>Pipeline (This work)                        | 3.5                       | Yes                   | Yes                 |

- Speech-enhancing pipeline allows much smaller ASR models to be stored on-chip
  - Obviates very inefficient strategy of scaling up the DNN model in order to achieve noise robustness

![](_page_48_Figure_4.jpeg)

# **End-to-End ASR Latency**

| Inference Scenarios                                                    | ASR<br>Model Size<br>(MB) | Model fit<br>on-chip? | Noise<br>Resistant? |
|------------------------------------------------------------------------|---------------------------|-----------------------|---------------------|
| (A) Noiseless audio of the speaker                                     | 3.5                       | Yes                   | No                  |
| (B) Noise mixed with the speaker's voice at 0.9dB SNR                  | 3.5                       | Yes                   | No                  |
| (C) Large ASR model trained with a noise-corrupted LibriSpeech dataset | 22                        | No                    | Yes                 |
| (D) Speech-Enhanced ASR<br>Pipeline (This work)                        | 3.5                       | Yes                   | Yes                 |

- Speech-enhancing ASR Pipeline is:
  - 4.3x faster compared to the scaledup DNN approach (c)
  - Speech Denoising and ASR account for 19% and 40% of latency

![](_page_49_Figure_5.jpeg)

## **Breakdown of CPU Work**

![](_page_50_Figure_1.jpeg)

- FFT, Windowing and Framing: spectrogram synthesis
- FP8 Quantization: 32-bit fixed-point to 8-bit floating-point conversion needed for input to FlexASR
- **Expectation-Maximization Algorithm:** outer loop to update the MRF distribution after Gibbs sampling
- Label Mask Convolution: clean speech extraction from MSSE binary label mask
- Other: instruction dispatch, IRQ handling, misc. application logic
- All tasks are vectorized on the A53 where applicable

# **End-to-End ASR Energy**

| Inference Scenarios                                                    | ASR<br>Model Size<br>(MB) | Model fit<br>on-chip? | Noise<br>Resistant? |
|------------------------------------------------------------------------|---------------------------|-----------------------|---------------------|
| (A) Noiseless audio of the speaker                                     | 3.5                       | Yes                   | No                  |
| (B) Noise mixed with the speaker's voice at 0.9dB SNR                  | 3.5                       | Yes                   | No                  |
| (C) Large ASR model trained with a noise-corrupted LibriSpeech dataset | 22                        | No                    | Yes                 |
| (D) Speech-Enhanced ASR<br>Pipeline (This work)                        | 3.5                       | Yes                   | Yes                 |

- Speech-enhancing ASR Pipeline is:
  - 7x more energy-efficient compared to the scaled-up DNN approach (c)
  - ASR dominates energy consumption

![](_page_51_Figure_5.jpeg)

## **Platform Comparison**

![](_page_52_Figure_1.jpeg)

 Speech-enhancing pipeline achieves real-time performance unlike commercial edge platforms despite substantial energy expenditures

# **Per-Layer Platform Comparison**

![](_page_53_Figure_1.jpeg)

© 2021 IEEE International Solid-State Circuits Conference

FlexASR: 4x – 716x faster, 11x – 228x more energy efficient MSSE: 2x – 1577x

#### MSSE: 2x – 1577x faster, 2x – 1969x more energy efficient

9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

<sup>54</sup> of 62

#### **Accelerator Efficiencies**

![](_page_54_Figure_1.jpeg)

- As voltage sweeps from 0.5V to 1.0V
  MSSE: 4.33 17.6 GSamples/s/W
  - FlexASR: 2.6 7.8 TFLOPs/W

#### **Benefits with Scaled Vdd**

![](_page_55_Figure_1.jpeg)

- As voltage sweeps from 0.5V to 1.0V
  - Latency: 45ms to 15ms
  - SoC Power: 19mW to 227mW

9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

## **Comparison Table**

|                             | [4]                 | [5]                 | [6]                 | [7]                 | This work                                |
|-----------------------------|---------------------|---------------------|---------------------|---------------------|------------------------------------------|
| Technology                  | 65 nm               | 65 nm               | 65 nm               | 28 nm               | 16 nm                                    |
| Core Dimension              | 9.6 mm <sup>2</sup> | 6.2 mm <sup>2</sup> | 2.6 mm <sup>2</sup> | 1.3 mm <sup>2</sup> | 21.8 mm <sup>2</sup>                     |
| Application                 | ASR                 | KWS                 | KWS                 | ASR                 | Speech Denoising, ASR                    |
| Algorithm                   | НММ                 | RNN                 | RNN                 | CNN                 | Bayesian MRF<br>+<br>Attention-based RNN |
| On-Chip Speech<br>Denoising | No                  | No                  | No                  | No                  | Yes (7.3 dB SDR)                         |
| Dataset                     | News 2              | Smart Home          | GSCD                | TIMIT               | LibriSpeech                              |
| (Vocabulary Size)           | (145k words)        | (11 words)          | (30 words)          | (6k words)          | (200k words)                             |
| Datatype                    | 4-12b FxP           | 1b ExP              | 4b/8b FxP           | 1h ExP              | Denoising: 32b FxP                       |
| Datatype                    |                     |                     |                     |                     | ASR: 8b FP                               |
| Total SRAM                  | 730 KB              | 18 KB               | 105 KB              | 52 KB               | 9.8 MB                                   |
| Supply Voltage              | 0.6 V – 1.2 V       | 0.9 V – 1.1 V       | 0.6 V – 1.2V        | 0.57 V – 0.9 V      | 0.55 V – 1.0 V                           |
| Frequency                   | 3 – 86 MHz          | 5 – 75 MHz          | 250 kHz – 12.5 MHz  | 2.5 – 50 MHz        | 130 – 775 MHz                            |
| Latency per Frame           | 1                   | 0.127 ms            | 16 ms               | 0.5 ms – 25 ms      | 15 – 45 ms                               |
| Power                       | 7.78 mW @           | 26 mW @             | 18.3 uW @           | 1.42 mW @           | 111 mW @                                 |
|                             | 0.9V/40MHz          | 0.9V/75MHz          | 0.6V/250 KHz        | 0.58V/20MHz         | 0.8V/Fmax                                |

 First work to demonstrate on-chip support for denoised, large-vocabulary, attention-based ASR with competitive latency

### Outline

#### Motivation

- Speech-Enhancing ASR
  - Functional Pipeline
  - 16nm SoC Architecture
  - Markov Source Separation Engine (MSSE)
  - Attention-based Seq2Seq Accelerator (FlexASR)
    - FlexASR Processing Element
    - FlexASR Multi-Function Global Buffer
  - Chip Measurement Results

#### Summary

# Summary

- Attention-based bidirectional RNNs enables significant WER improvement by enforcing context understanding
- Noise-isolating ASR is essential for edge/loT applications
- A 16nm SoC executing an end-to-end speech-enhancing ASR pipeline is developed featuring:
  - A programmable accelerator for seq2seq bidirectional RNNs with attention
  - A Markov source separation engine accelerator for speech denoising
  - Measurements on test chip show:
  - 18ms end-to-end per-frame latency enabling real-time performance
  - 2.24mJ end-to-end per-frame energy
  - Pipeline obviates very inefficient strategy of scaling up DNN model size to achieve noise robustness

9.8: A 25mm<sup>2</sup> SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

#### **Open-Source Releases**

- FlexASR HW architecture and simulator will be publicly released by end of February 2021 at this GitHub repository: <u>https://github.com/harvard-acc/FlexASR</u>
- FlexASR leveraged several SystemC/C++ IPs from MatchLib: <u>https://github.com/NVIabs/matchlib</u>
  - Development and verification of this test chip leveraged several hardware IPs and tools from the CHIPKIT framework: <u>https://github.com/whatmough/CHIPKIT</u>

## Acknowledgements

- This work is supported in part by JUMP ADA, DARPA CRAFT and DSSoC programs, NSF Awards 1704834 and 1718160, Intel Corp., and Arm Inc.
- We thank B. Khailany, R. Venkatesan, B. Keller, and Y. Shao (Nvidia); and U. Gupta, L. Pentecost, and V. Reddi (Harvard); and S. Garg (Mentor) for helpful discussions.

# **Thank You**