This paper presents GoldenEye, a functional simulator with fault injection capabilities for common and emerging numerical formats, implemented for the PyTorch deep learning framework. Gold- enEye provides a unified framework for numerical format evaluation of DNNs, including traditional number systems such as fixed and floating point, as well as recent DNN-inspired formats such as block floating point and AdaptivFloat. Additionally, GoldenEye enables single- and multi- bit flips at various logical and functional points during a value’s lifetime for resiliency analysis, including for the first time attention to numerical values’ hardware metadata. This paper describes Golden- Eye’s technical design and implementation which make it an easy-to- use, extensible, versatile, and fast tool for dependability research and future DNN accelerator design. We showcase its utility with three case studies: a unifying platform for number system comparison and eval- uation, a design-space exploration heuristic for data type selection, and fast DNN reliability analysis for different error models. GoldenEye is open-sourced and available at: https://github.com/ma3mool/goldeneye.
Specialized accelerators are increasingly used to meet the power-performance goals of emerging applications such as machine learning, image processing, and graph analysis. Existing accelerator programming methodologies using APIs have several limitations: (1) The application code lacks portability to other platforms and compiler frameworks; (2) the lack of integration of accelerator code in the compiler limits useful optimizations such as instruction selection and operator fusion; and (3) the opacity of the accelerator function semantics limits the ability to check the final code for correctness. The root of these limitations is the lack of a formal software/hardware interface specification for accelerators. In this paper, we use the recently developed Instruction-Level Abstraction (ILA) for accelerators to serve this purpose, similar to how the Instruction Set Architecture (ISA) has been used as the software/hardware interface for processors. We propose a compiler flow termed D2A using the ILA and present a prototype that demonstrates this flow for deep learning (DL) applications. This prototype compiles programs from high-level domain-specific languages, e.g., PyTorch and MxNet, to multiple target accelerators with no target-specific extensions to the application or compiler - thus demonstrating application portability. It includes compiler optimizations through instruction selection using equality saturation-based flexible matching. Finally, we show checking the correctness of the resulting code through formal verification of individual matched operations and fully automated simulation-based validation of complete applications. The evaluation of the prototype compiler is based on six different DL applications and three different accelerators. Overall, this methodology lays the foundation for integrating accelerators in compiler flows using a formal software/hardware interface.
In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.
Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.
Hardware specialization, in the form of accelerators that provide custom datapath and control for specific algorithms and applications, promises impressive performance and energy advantages compared to traditional architectures. Current research in accelerator analysis relies on RTL-based synthesis flows to produce accurate timing, power, and area estimates. Such techniques not only require significant effort and expertise but are also slow and tedious to use, making large design space exploration infeasible. To overcome this problem, we present Aladdin, a pre-RTL, power-performance accelerator modeling framework and demonstrate its application to system-on-chip (SoC) simulation. Aladdin estimates performance, power, and area of accelerators within 0.9%, 4.9%, and 6.6% with respect to RTL implementations. Integrated with architecture-level core and memory hierarchy simulators, Aladdin provides researchers an approach to model the power and performance of accelerators in an SoC environment.
This paper presents the many-accelerator architecture, a design approach combining the scalability of homogeneous multi-core architectures and system-on-chip’s high performance and power-efficient hardware accelerators. In preparation for systems containing tens or hundreds of accelerators, we characterize a diverse pool of accelerators and find each contains significant amounts of SRAM memory (up to 90% of their area). We take advantage of this discovery and introduce the accelerator store, a scalable architectural component to minimize accelerator area by sharing its memories between accelerators. We evaluate the accelerator store for two applications and find significant system area reductions (30%) in exchange for small overheads (2% performance, 0%–8% energy). The paper also identifies new research directions enabled by the accelerator store and the many-accelerator architecture.