Increasing demand for power-efficient, high- performance computing has spurred a growing number and diversity of hardware accelerators in mobile and server Systems on Chip (SoCs). This paper makes the case that the co-design of the accelerator microarchitecture with the system in which it belongs is critical to balanced, efficient accelerator microarchitectures. We find that data movement and coherence management for accelerators are significant yet often unaccounted components of total accelerator runtime, resulting in misleading performance predictions and inefficient accelerator designs. To explore the design space of accelerator-system co-design, we develop gem5-Aladdin, an SoC simulator that captures dynamic interactions between accelerators and the SoC platform, and validate it to within 6% against real hardware. Our co-design studies show that the optimal energy-delay-product (EDP) of an accelerator microarchitecture can improve by up to 7.4x when system-level effects are considered compared to optimizing accelerators in isolation.
Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.
This work presents a fully integrated 4-to-1 DC-DC symmetric ladder switched-capacitor converter (SLSCC) for voltage stacking applications. The SLSCC absorbs inter-layer load power mismatch to provide minimum voltage guarantees for the internal rails of a multicore system that implements four-way voltage stacking. A new hybrid feedback control scheme reduces the voltage ripple across stacked voltage layers for high levels of current mismatch, a condition that exacerbates voltage noise in conventional SC converters. Furthermore, the proposed SLSCC dynamically allocates valuable flying capacitor resources according to different load conditions, which improves conversion efficiency and supports more power mismatch between the layers. Implemented in TSMC’s 40G process, the SLSCC converts a 3.6 V input voltage down to four stacked output voltage layers, each nominally at 900 mV.
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
In previous work we presented design and manufacturing rules for optimizing the energy density of piezoelectric bimorph actuators through the use of laser-induced melting, insulating edge coating, and features for rigid ground attachments to maximize force output, as well as a pre-stacked technique to enable mass customization. Here we adapt these techniques to bending actuators with four active layers, which utilize thinner material layers. This allows the use of lower operating voltages, which is important for overall power usage optimization, as typical small-scale power supplies are low-voltage and the efficiency of boost-converter and drive circuitry increases with decreasing output voltage. We show that this optimization results in a 24%–47% reduction in the weight of the required power supply (depending on the type of drive circuit used). We also present scaling arguments to determine when multi-layer actuator are preferable to thinner actuators, and show that our techniques are capable of scaling down to sub-mg weight actuators.
A circuit for driving a plurality of capacitive actuators, the circuit having a low-voltage side, a high voltage side and a flyback transformer between the two. The low-voltage side comprises first and second pairs of low-side switches connected in series across an input voltage. The flyback transformer has a primary winding connected to the two pairs of switches. The high-voltage side has a pair of switches connected between the secondary winding of the flyback transformer and a ground and a plurality of capacitive loads and bidirectional switches to connect the loads to the secondary winding of the flyback transformer and a ground.
Software-based implementations of deep neural network predictions consume large amounts of energy, limiting their deployment in power-constrained environments. Hardware acceleration is a promising alternative. However, it is challenging to efficiently design accelerators that have both low prediction error and low energy consumption. Bayesian optimization can be used to accelerate the design problem. However, most of the existing techniques collect data in a coupled way by always evaluating the two objectives (energy and error) jointly at the same input, which is inefficient. Instead, in this work we consider a decoupled approach in which, at each iteration, we choose which objective to evaluate next and at which input. We show that considering decoupled evaluations produces better solutions when computational resources are limited. Our results also indicate that evaluating the prediction error is more important than evaluating the energy consumption.