This paper presents GoldenEye, a functional simulator with fault injection capabilities for common and emerging numerical formats, implemented for the PyTorch deep learning framework. Gold- enEye provides a unified framework for numerical format evaluation of DNNs, including traditional number systems such as fixed and floating point, as well as recent DNN-inspired formats such as block floating point and AdaptivFloat. Additionally, GoldenEye enables single- and multi- bit flips at various logical and functional points during a value’s lifetime for resiliency analysis, including for the first time attention to numerical values’ hardware metadata. This paper describes Golden- Eye’s technical design and implementation which make it an easy-to- use, extensible, versatile, and fast tool for dependability research and future DNN accelerator design. We showcase its utility with three case studies: a unifying platform for number system comparison and eval- uation, a design-space exploration heuristic for data type selection, and fast DNN reliability analysis for different error models. GoldenEye is open-sourced and available at: https://github.com/ma3mool/goldeneye.
Repeated off-chip memory accesses to DRAM drive up operating power for data-intensive applications, and SRAM technology scaling and leakage power limits the efficiency of embedded memories. Future on-chip storage will need higher density and energy efficiency, and the actively expanding field of emerging, embeddable non-volatile memory (eNVM) technologies is providing many potential candidates to satisfy this need. Each technology proposal presents distinct trade-offs in terms of density, read, write, and reliability characteristics, and we present a comprehensive framework for navigating and quantifying these design trade-offs alongside realistic system constraints and application-level impacts. This work evaluates eNVM-based storage for a range of application and system contexts including machine learning on the edge, graph analytics, and general purpose cache hierarchy, in addition to describing a freely available (this http URL) set of tools for application experts, system designers, and device experts to better understand, compare, and quantify the next generation of embedded memory solutions.
Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.
Multiparty computation approaches to private neural network inference require significant communication between server and client, incur tremendous runtime penalties, and cost massive storage overheads. The primary source of these expenses is garbled circuits operations for nonlinear activation functions (typically ReLU), which require on the order of kilobytes of data transfer for each individual operation and tens of kilobytes of preprocessing storage per operation per inference. We propose a replacement for garbled circuits: Tabula, an algorithm to securely and efficiently perform single operand nonlinear functions for private neural network inference. Tabula performs a one time client initialization procedure with the help of a trusted third party (or via using fully homomorphic encryption), operates over smaller finite fields whose elements are representable with less than 16 bits, and employs a lookup table which stores the encrypted results of nonlinear operations over secretly shared values. We show Tabula is secure under a semi-honest threat model, allowing it to be used as a replacement for garbled circuits operations. Our results show that for private neural network inference, Tabula eliminates communication by a factor of more than 50×, enables speedups over 10×, and reduces storage costs from O(n) to O(1).
Thetwilight of Dennardscalinghasactivatedaglobaltrendtowards application-based hardware specialization. This trend is currently accelerating due to the surging democratization and deployment of machine learning on mobile and IoT compute platforms. At the same time, the growing complexity of specialized system-on-chips (SoCs) is levying a more laborious tax on ASIC companies’ design and verification efforts. High-level synthesis (HLS) is emerging as a foremost agile VLSI development methodology gaining increasing adoption in the hardware design community. However, concerns over Quality of Results (QoR) remain a key factor inhibiting more mainstream adoption of HLS. Obtaining optimal PPA outcomes can sometimes be an elusive or challenging task and strongly correlates with the syntactic approach of the high-level source code. In this paper, we aim to share the proven HLS practices we employed to raise the level of confidence in the post-silicon functional and performance expectations from our accelerator designs. In doing so, we recount some of the main challenges we encountered in our HLS-based hardware-software co-design journey and offer a few recommendations cultivated from our learnings. Finally, we posit on wheretheresearch opportunities to further improve design QoR and HLS user experience lie.
Specialized accelerators are increasingly used to meet the power-performance goals of emerging applications such as machine learning, image processing, and graph analysis. Existing accelerator programming methodologies using APIs have several limitations: (1) The application code lacks portability to other platforms and compiler frameworks; (2) the lack of integration of accelerator code in the compiler limits useful optimizations such as instruction selection and operator fusion; and (3) the opacity of the accelerator function semantics limits the ability to check the final code for correctness. The root of these limitations is the lack of a formal software/hardware interface specification for accelerators. In this paper, we use the recently developed Instruction-Level Abstraction (ILA) for accelerators to serve this purpose, similar to how the Instruction Set Architecture (ISA) has been used as the software/hardware interface for processors. We propose a compiler flow termed D2A using the ILA and present a prototype that demonstrates this flow for deep learning (DL) applications. This prototype compiles programs from high-level domain-specific languages, e.g., PyTorch and MxNet, to multiple target accelerators with no target-specific extensions to the application or compiler - thus demonstrating application portability. It includes compiler optimizations through instruction selection using equality saturation-based flexible matching. Finally, we show checking the correctness of the resulting code through formal verification of individual matched operations and fully automated simulation-based validation of complete applications. The evaluation of the prototype compiler is based on six different DL applications and three different accelerators. Overall, this methodology lays the foundation for integrating accelerators in compiler flows using a formal software/hardware interface.