Publications

2023

Matthew Adiletta, Jesmin Jahan Tithi, Emmanouil-Ioannis Farsarakis, Gerasimos Gerogiannis, Robert Adolf, Robert Benke, Sidharth Kashyap, Samuel Hsia, Kartik Lakhotia, Fabrizio Petrini, Gu-Yeon Wei, and David Brooks. 4/24/2023. “Characterizing the Scalability of Graph Convolutional Networks on Intel® PIUMA.” In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Raleigh, North Carolina.Abstract

Large-scale Graph Convolutional Network (GCN) inference on traditional CPU/GPU systems is challenging due to a large memory footprint, sparse computational patterns, and irregular memory accesses with poor locality. Intel's Programmable Integrated Unified Memory Architecture (PIUMA) is designed to address these challenges for graph analytics. In this paper, a detailed characterization of GCNs is presented using the Open-Graph Benchmark (OGB) datasets to determine the viability of PIUMA as a potential solution to GCN scalability.

First, the extent of sparse matrix dense matrix multiplication~(SpMM) as a performance driver for GCN on CPU and GPU is explored, offering a methodology for predicting GCN behavior as a function of dataset characteristics. Second, an SpMM kernel optimized for PIUMA is described and investigated for sensitivity to system parameters including memory bandwidth, latency, and thread count. SpMM scalability on PIUMA is demonstrated, while the scalability limitations of a Xeon-optimized SpMM implementation are discussed. Finally, GCN performance is compared on PIUMA versus a Xeon CPU system and Ampere GPU system, showing impressive results on PIUMA for large-scale datasets.

ispass_gnn_characterization_on_piuma.pdf

Samuel Hsia, Udit Gupta, Bilge Acun, Newsha Ardalani, Pan Zhong, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. 3/25/2023. “MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation.” In ASPLOS. Vancouver, Canada. Publisher's Version Abstract

Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms.

On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.

2022

Matthew Adiletta, David Brooks, and Gu-Yeon Wei. 11/28/2022. Architectural Implications of Embedding Dimension During GCN on CPU and GPU. Cambridge: Harvard University.Abstract

Graph Neural Networks (GNNs) are a class of neural networks designed to extract information from the graphical structure of data. Graph Convolutional Networks (GCNs) are a widely used type of GNN for transductive graph learning problems which apply convolution to learn information from graphs. GCN is a challenging algorithm from an architecture perspective due to inherent sparsity, low data reuse, and massive memory capacity requirements. Traditional neural algorithms exploit the high compute capacity of GPUs to achieve high performance for both inference and training. The architectural decision to use a GPU for GCN inference is a question explored in this work. GCN on both CPU and GPU was characterized in order to better understand the implications of graph size, embedding dimension, and sampling on performance.

Architectural Implications of Embedding Dimension During GCN on CPU and GPU

Yu-Shun Hsiao, Siva Kumar Sastry Hari, Michał Filipiuk, Timothy Tsai, Michael B. Sullivan, Vijay Janapa Reddi, Vasu Singh, and Stephen W. Keckler. 7/10/2022. “Zhuyi: Perception Processing Rate Estimation for Safety in Autonomous Vehicles.” In ACM/IEEE Design Automation Conference (DAC). San Francisco, CA, USA.

Abdulrahman Mahmoud, Thierry Tambe, Tarek Aloui, David Brooks, and Gu-Yeon Wei. 6/27/2022. “GoldenEye: A Platform for Evaluating Emerging Numerical Data Formats in DNN Accelerators.” 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). Baltimore, Maryland, USA. Publisher's Version Abstract

This paper presents GoldenEye, a functional simulator with fault injection capabilities for common and emerging numerical formats, implemented for the PyTorch deep learning framework. Gold- enEye provides a unified framework for numerical format evaluation of DNNs, including traditional number systems such as fixed and floating point, as well as recent DNN-inspired formats such as block floating point and AdaptivFloat. Additionally, GoldenEye enables single- and multi- bit flips at various logical and functional points during a value’s lifetime for resiliency analysis, including for the first time attention to numerical values’ hardware metadata. This paper describes Golden- Eye’s technical design and implementation which make it an easy-to- use, extensible, versatile, and fast tool for dependability research and future DNN accelerator design. We showcase its utility with three case studies: a unifying platform for number system comparison and eval- uation, a design-space exploration heuristic for data type selection, and fast DNN reliability analysis for different error models. GoldenEye is open-sourced and available at: https://github.com/ma3mool/goldeneye.

L. Pentecost, A. Hankin, M. Donato, M. Hempstead, G.-Y. Wei, and D. Brooks. 4/2/2022. “NVMExplorer: A Framework for Cross-Stack Comparisons of Embedded Non-Volatile Memories.” In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). Seoul, South Korea. Publisher's Version Abstract

Repeated off-chip memory accesses to DRAM drive up operating power for data-intensive applications, and SRAM technology scaling and leakage power limits the efficiency of embedded memories. Future on-chip storage will need higher density and energy efficiency, and the actively expanding field of emerging, embeddable non-volatile memory (eNVM) technologies is providing many potential candidates to satisfy this need. Each technology proposal presents distinct trade-offs in terms of density, read, write, and reliability characteristics, and we present a comprehensive framework for navigating and quantifying these design trade-offs alongside realistic system constraints and application-level impacts. This work evaluates eNVM-based storage for a range of application and system contexts including machine learning on the edge, graph analytics, and general purpose cache hierarchy, in addition to describing a freely available (this http URL) set of tools for application experts, system designers, and device experts to better understand, compare, and quantify the next generation of embedded memory solutions.

2109.01188.pdf

Zishen Wan, Aqeel Anwar, Abdulrahman Mahmoud, Tianyu Jia, Yu-Shun Hsiao, Vijay Janapa Reddi, and Arijit Raychowdhury. 3/14/2022. “Frl-fi: Transient fault analysis for federated reinforcement learning-based navigation systems.” 2022 Design Automation and Test in Europe Conference (DATE). Publisher's Version Abstract

Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.

Frl-fi: Transient fault analysis for federated reinforcement learning-based navigation systems

Tianyu Jia, En-Yu Yang, Yu-Shun Hsiao, Jonathan Cruz, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 3/14/2022. “OMU: A Probabilistic 3D Occupancy Mapping Accelerator for Real-time OctoMap at the Edge.” In DATE: Design, Automation, and Test in Europe (DATE).

Maximilian Lam, Michael Mitzenmacher, Vijay Janapa Reddi, Gu-Yeon Wei, and David Brooks. 3/5/2022. “Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference”. Publisher's Version Abstract

Multiparty computation approaches to private neural network inference require significant communication between server and client, incur tremendous runtime penalties, and cost massive storage overheads. The primary source of these expenses is garbled circuits operations for nonlinear activation functions (typically ReLU), which require on the order of kilobytes of data transfer for each individual operation and tens of kilobytes of preprocessing storage per operation per inference. We propose a replacement for garbled circuits: Tabula, an algorithm to securely and efficiently perform single operand nonlinear functions for private neural network inference. Tabula performs a one time client initialization procedure with the help of a trusted third party (or via using fully homomorphic encryption), operates over smaller finite fields whose elements are representable with less than 16 bits, and employs a lookup table which stores the encrypted results of nonlinear operations over secretly shared values. We show Tabula is secure under a semi-honest threat model, allowing it to be used as a replacement for garbled circuits operations. Our results show that for private neural network inference, Tabula eliminates communication by a factor of more than 50×, enables speedups over 10×, and reduces storage costs from O(n) to O(1).

Tabula: Efficiently Computing Nonlinear Activation Functions for Secure Neural Network Inference

Thierry Tambe, David Brooks, and Gu-Yeon Wei. 3/1/2022. “Learnings from a HLS-based High-Productivity Digital VLSI Flow.” In Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE'22). Publisher's Version Abstract

Thetwilight of Dennardscalinghasactivatedaglobaltrendtowards application-based hardware specialization. This trend is currently accelerating due to the surging democratization and deployment of machine learning on mobile and IoT compute platforms. At the same time, the growing complexity of specialized system-on-chips (SoCs) is levying a more laborious tax on ASIC companies’ design and verification efforts. High-level synthesis (HLS) is emerging as a foremost agile VLSI development methodology gaining increasing adoption in the hardware design community. However, concerns over Quality of Results (QoR) remain a key factor inhibiting more mainstream adoption of HLS. Obtaining optimal PPA outcomes can sometimes be an elusive or challenging task and strongly correlates with the syntactic approach of the high-level source code. In this paper, we aim to share the proven HLS practices we employed to raise the level of confidence in the post-silicon functional and performance expectations from our accelerator designs. In doing so, we recount some of the main challenges we encountered in our HLS-based hardware-software co-design journey and offer a few recommendations cultivated from our learnings. Finally, we posit on wheretheresearch opportunities to further improve design QoR and HLS user experience lie.

Learnings from a HLS-based High-Productivity Digital VLSI Flow

Bo-Yuan Huang, Steven Lyubomirsky, Yi Li, Mike He, Thierry Tambe, Gus Henry Smith, Akash Gaonkar, Vishal Canumalla, Gu-Yeon Wei, Aarti Gupta, Zachary Tatlock, and Sharad Malik. 3/1/2022. “Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface”. Publisher's Version Abstract

Specialized accelerators are increasingly used to meet the power-performance goals of emerging applications such as machine learning, image processing, and graph analysis. Existing accelerator programming methodologies using APIs have several limitations: (1) The application code lacks portability to other platforms and compiler frameworks; (2) the lack of integration of accelerator code in the compiler limits useful optimizations such as instruction selection and operator fusion; and (3) the opacity of the accelerator function semantics limits the ability to check the final code for correctness. The root of these limitations is the lack of a formal software/hardware interface specification for accelerators.
In this paper, we use the recently developed Instruction-Level Abstraction (ILA) for accelerators to serve this purpose, similar to how the Instruction Set Architecture (ISA) has been used as the software/hardware interface for processors. We propose a compiler flow termed D2A using the ILA and present a prototype that demonstrates this flow for deep learning (DL) applications. This prototype compiles programs from high-level domain-specific languages, e.g., PyTorch and MxNet, to multiple target accelerators with no target-specific extensions to the application or compiler - thus demonstrating application portability. It includes compiler optimizations through instruction selection using equality saturation-based flexible matching. Finally, we show checking the correctness of the resulting code through formal verification of individual matched operations and fully automated simulation-based validation of complete applications. The evaluation of the prototype compiler is based on six different DL applications and three different accelerators. Overall, this methodology lays the foundation for integrating accelerators in compiler flows using a formal software/hardware interface.

Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface

2021

Zishen Wan, Aqeel Anwar, Yu-Shun Hsiao, Tianyu Jia, Vijay Janapa Reddi, and Arijit Raychowdhury. 11/9/2021. “Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems.” In 58th ACM/IEEE Design Automation Conference (DAC). Publisher's Version Abstract

Learning-based navigation systems are widely used in autonomous applications, such as robotics, unmanned vehicles and drones. Specialized hardware accelerators have been proposed for high-performance and energy-efficiency for such navigational tasks. However, transient and permanent faults are increasing in hardware systems and can catastrophically violate tasks safety. Meanwhile, traditional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the resilience of navigation systems with respect to algorithms, fault models and data types from both RL training and inference. We further propose two efficient fault mitigation techniques that achieve 2x success rate and 39% quality-of-flight improvement in learning-based navigation systems.

Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems

Thierry Tambe, Coleman Hooper, Lillian Pentecost, Tianyu Jia, En-Yu Yang, Marco Donato, Victor Sanh, Paul Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 10/17/2021. “EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference.” IEEE/ACM International Symposium on Microarchitecture (MICRO 2021). Publisher's Version Abstract

Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

M. M. Sharifi, L. Pentecost, R. Rajaei, A. Kazemi, Q. Lou, G.-Y. Wei, D. Brooks, K. Ni, X. S. Hu, M. Niemier, and M. Donato. 7/26/2021. “Application-driven Design Exploration for Dense Ferroelectric Embedded Non-volatile Memories.” In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED ’21. Boston, MA, USA. Publisher's Version Abstract

The memory wall bottleneck is a key challenge across many data-intensive applications. Multi-level FeFET-based embedded non-volatile memories are a promising solution for denser and more energy-efficient on-chip memory. However, reliable multi-level cell storage requires careful optimizations to minimize the design overhead costs. In this work, we investigate the interplay between FeFET device characteristics, programming schemes, and memory array architecture, and explore different design choices to optimize performance, energy, area, and accuracy metrics for critical data-intensive workloads. From our cross-stack design exploration, we find that we can store DNN weights and social network graphs at a density of over 8MB/mm 2 and sub-2ns read access latency without loss in application accuracy.

Application-driven Design Exploration for Dense Ferroelectric Embedded Non-volatile Memories

Maximilian Lam, Gu-Yeon Wei, David Brooks, Vijay Janapa Reddi, and Michael Mitzenmacher. 6/2021. “Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix”. Publisher's Version Abstract

We show that aggregated model updates in federated learning may be insecure. An untrusted central server may disaggregate user updates from sums of updates across participants given repeated observations, enabling the server to recover privileged information about individual users' private training data via traditional gradient inference attacks. Our method revolves around reconstructing participant information (e.g: which rounds of training users participated in) from aggregated model updates by leveraging summary information from device analytics commonly used to monitor, debug, and manage federated learning systems. Our attack is parallelizable and we successfully disaggregate user updates on settings with up to thousands of participants. We quantitatively and qualitatively demonstrate significant improvements in the capability of various inference attacks on the disaggregated updates. Our attack enables the attribution of learned properties to individual users, violating anonymity, and shows that a determined central server may undermine the secure aggregation protocol to break individual users' data privacy in federated learning.

Gradient Disaggregation: Breaking Privacy in Federated Learning by Reconstructing the User Participant Matrix

Yu-Shun Hsiao, Zishen Wan, Tianyu Jia, Radhika Ghosal, Arijit Raychowdhury, David Brooks, Gu-Yeon Wei, and Vijay Janapa Reddi. 5/27/2021. “Mavfi: An end-to-end fault analysis framework with anomaly detection and recovery for micro aerial vehicles”. Publisher's Version Abstract

Reliability and safety are critical in autonomous machine services, such as autonomous vehicles and aerial drones. In this paper, we first present an open-source Micro Aerial Vehicles (MAVs) reliability analysis framework, MAVFI, to characterize transient fault's impacts on the end-to-end flight metrics, e.g., flight time, success rate. Based on our framework, it is observed that the end-to-end fault tolerance analysis is essential for characterizing system reliability. We demonstrate the planning and control stages are more vulnerable to transient faults than the visual perception stage in the common "Perception-Planning-Control (PPC)" compute pipeline. Furthermore, to improve the reliability of the MAV system, we propose two low overhead anomaly-based transient fault detection and recovery schemes based on Gaussian statistical models and autoencoder neural networks. We validate our anomaly fault protection schemes with a variety of simulated photo-realistic environments on both Intel i9 CPU and ARM Cortex-A57 on Nvidia TX2 platform. It is demonstrated that the autoencoder-based scheme can improve the system reliability by 100% recovering failure cases with less than 0.0062% computational overhead in best-case scenarios. In addition, MAVFI framework can be used for other ROS-based cyber-physical applications and is open-sourced at this https URL.

Mavfi: An end-to-end fault analysis framework with anomaly detection and recovery for micro aerial vehicles

Udit Gupta, Samuel Hsia, Jeff Zhang, Mark Wilkening, Javin Pombra, Hsien-Hsin S. Lee, Gu-Yeon Wei, Carole-Jean Wu, and David Brooks. 5/22/2021. “RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance.” MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 870–884. Publisher's Version Abstract

Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.

RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

Sabrina M. Neuman, Brian Plancher, Thomas Bourgeat, Thierry Tambe, Srinivas Devadas, and Vijay Janapa Reddi. 4/19/2021. “Robomorphic Computing: A Design Methodology for Domain-Specific Accelerators Parameterized by Robot Morphology.” Architectural Support for Programming Languages and Operating Systems (ASPLOS'21), Pp. 674–686. Publisher's Version Abstract

Robotics applications have hard time constraints and heavy computational burdens that can greatly benefit from domain-specific hardware accelerators. For the latency-critical problem of robot motion planning and control, there exists a performance gap of at least an order of magnitude between joint actuator response rates and state-of-the-art software solutions. Hardware acceleration can close this gap, but it is essential to define automated hardware design flows to keep the design process agile as applications and robot platforms evolve. To address this challenge, we introduce robomorphic computing: a methodology to transform robot morphology into a customized hardware accelerator morphology. We (i) present this design methodology, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware; (ii) use the methodology to generate a parameterized accelerator design for the gradient of rigid body dynamics, a key kernel in motion planning; (iii) evaluate FPGA and synthesized ASIC implementations of this accelerator for an industrial manipulator robot; and (iv) describe how the design can be automatically customized for other robot models. Our FPGA accelerator achieves speedups of 8× and 86× over CPU and GPU when executing a single dynamics gradient computation. It maintains speedups of 1.9× to 2.9× over CPU and GPU, including computation and I/O round-trip latency, when deployed as a coprocessor to a host CPU for processing multiple dynamics gradient computations. ASIC synthesis indicates an additional 7.2× speedup for single computation latency. We describe how this principled approach generalizes to more complex robot platforms, such as quadrupeds and humanoids, as well as to other computational kernels in robotics, outlining a path forward for future robomorphic computing accelerators.

Robomorphic Computing: A Design Methodology for Domain-Specific Accelerators Parameterized by Robot Morphology

Bo-Yuan Huang, Steven Lyubomirsky, Thierry Tambe, Yi Li, Mike He, Gus Smith, Gu-Yeon Wei, Aarti Gupta, Sharad Malik, and Zachary Tatlock. 4/15/2021. “From DSLs to Accelerator-rich Platform Implementations: Addressing the Mapping Gap.” Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE'21). Publisher's Version

From DSLs to Accelerator-rich Platform Implementations: Addressing the Mapping Gap

Thierry Tambe, En-Yu Yang, Glenn G. Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N. Whatmough, Alexander M. Rush, David Brooks, and Gu-Yeon Wei. 2/13/2021. “A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET.” International Solid-State Circuits Conference (ISSCC'21). Publisher's Version Abstract

Automatic speech recognition (ASR) using deep learning is essential for user interfaces on IoT devices. However, previously published ASR chips [4-7] do not consider realistic operating conditions, which are typically noisy and may include more than one speaker. Furthermore, several of these works have implemented only small-vocabulary tasks, such as keyword-spotting (KWS), where context-blind deep neural network (DNN) algorithms are adequate. However, for large-vocabulary tasks (e.g., >100k words), the more complex bidirectional RNNs with an attention mechanism [1] provide context learning in long sequences, which improve ASR accuracy by up to 62% on the 200kwords LibriSpeech dataset, compared to a simpler unidirectional RNN (Fig. 9.8.1). Attention-based networks emphasize the most relevant parts of the source sequence during each decoding time step. In doing so, the encoder sequence is treated as a soft-addressable memory whose positions are weighted based on the state of the decoder RNN. Bidirectional RNNs learn past and future temporal information by concatenating forward and backward time steps.

A 25mm2 SoC for IoT Devices with 18ms Noise Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

Harvard Architecture, Circuits and Compilers

Research group of Prof. David Brooks and Prof. Gu-Yeon Wei

Publications

Pages

Search Publications

Browse by Year

Browse by Project

Browse by Author

7b31d72cd65b3801ac95f03689475737

d116f06a510b609055e4c6771dc22b81

9dfeed5fdb471663ad5d190f6c859077