The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.
Always-on classifiers for sensor data require a very wide operating range to support a variety of real-time workloads and must operate robustly at low supply voltages. We present a 16nm always-on wake-up controller with a fully-connected (FC) Deep Neural Network (DNN) accelerator that operates from 0.4-1 V. Calibration-free automatic voltage/frequency tuning is provided by tracking small non-zero Razor timing-error rates, and a novel timing-error driven sync-free fast adaptive clocking scheme provides resilience to on-chip supply voltage noise. The model access burden of neural networks is relaxed using a multicycle SRAM read, which allows memory voltage to be reduced at iso-throughput. The wide operating range allows for high performance at 1.36GHz, low-power consumption down to 750μW and state-of-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense, or 1.81 TOPS/W sparse.
Recent advances in high-level synthesis (HLS) have enabled an automatic means of generating register-transfer level from high-level specifications without compromising performance. HLS provides substantial improvements to productivity and is a promising solution to designing future heterogeneous chips consisting of dozens of unique IP blocks (i.e., hardware accelerators). Despite their impressive capabilities, HLS tools today are commonly used to target a small subset of workloads, i.e., ones with inordinately regular control flow and memory access patterns. The challenges of achieving high-quality hardware for irregular workloads stems from HLS relying on static analysis. Static analysis is overly conservative when dealing with non-uniform memory access and imbalanced workloads, and identifying the most appropriate parallelizing strategy. In this brief, we propose the use of dynamic analysis to generate higher quality designs using commercial HLS tools. Our evaluations show that with dynamic dependence analysis, HLS designs achieve 3.3× performance improvement for the sparse matrix-vector multiply benchmark.
Emerging Internet of Things (IoT) devices necessitate system-on-chips (SoCs) that can scale from ultralow power always-on (AON) operation, all the way up to less frequent high-performance tasks at high energy efficiency. Specialized accelerators are essential to help meet these needs at both ends of the scale, but maintaining workload flexibility remains an important goal. This article presents a 25-mm² SoC in 16-nm FinFET technology which demonstrates targeted, flexible acceleration of key compute-intensive kernels spanning machine learning (ML), DSP, and cryptography. The SMIV SoC includes a dedicated AON sub-system, a dual-core Arm Cortex-A53 CPU cluster, an SoC-attached embedded field-programmable gate array (eFPGA) array, and a quad-core cache-coherent accelerator (CCA) cluster. Measurement results demonstrate: 1) 1236x power envelope, from 1.1 mW (only AON cluster), up to 1.36 W (whole SoC at maximum throughput); 2) 5.5-28.9x energy efficiency gain from offloading compute kernels from A53 to eFPGA; 3) 2.94x latency improvement using coherent memory access (CCA cluster); and 4) 55x MobileNetV1 energy per inference improvement on CCA compared to the CPU baseline. The overall flexibility-efficiency range on SMIV spans measured energy efficiencies of 1x (dual-core A53), 3.1x (A53 with SIMD), 16.5x (eFPGA), 54.9x (CCA), and 256x (AON) at a peak efficiency of 4.8 TOPS/W.
This paper presents a 28-nm system-on-chip (SoC) for Internet of things (IoT) applications with a programmable accelerator design that implements a powerful fully connected deep neural network (DNN) classifier. To reach the required low energy consumption, we exploit the key properties of neural network algorithms: parallelism, data reuse, small/sparse data, and noise tolerance. We map the algorithm to a very large scale integration (VLSI) architecture based around an singleinstruction, multiple-data data path with hardware support to exploit data sparsity by completely eliding unnecessary computation and data movement. This approach exploits sparsity, without compromising the parallel computation. We also exploit the inherent algorithmic noise-tolerance of neural networks, by introducing circuit-level timing violation detection to allow worst case voltage guard-bands to be minimized. The resulting intermittent timing violations may result in logic errors, which conventionally need to be corrected. However, in lieu of explicit error correction, we cope with this by accentuating the noise tolerance of neural networks. The measured test chip achieves high classification accuracy (98.36% for the MNIST test set), while tolerating aggregate timing violation rates>10 -1 . The accelerator achieves a minimum energy of 0.36 μJ/inference at 667 MHz; maximum throughput at 1.2 GHz and 0.57 μJ/inference; or a 10% margined operating point at 1 GHz and 0.58 μJ/inference.
One of the biggest performance bottlenecks of today's neural network (NN) accelerators is off-chip memory accesses . In this paper, we propose a method to use multi-level, embedded nonvolatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators 
As the use of deep neural networks continues to grow, so does the fraction of compute cycles devoted to their execution. This has led the CAD and architecture communities to devote considerable attention to building DNN hardware. Despite these efforts, the fault tolerance of DNNs has generally been overlooked. This paper is the first to conduct a large-scale, empirical study of DNN resilience. Motivated by the inherent algorithmic resilience of DNNs, we are interested in understanding the relationship between fault rate and model accuracy. To do so, we present Ares: a light-weight, DNN-specific fault injection framework validated within 12% of real hardware. We find that DNN fault tolerance varies by orders of magnitude with respect to model, layer type, and structure.
One of the biggest performance bottlenecks of today’s neural network (NN) accelerators is off-chip memory accesses. In this paper, we propose a method to use multi-level, embedded non-volatile memory (eNVM) to eliminate all off-chip weight accesses. The use of multi-level memory cells increases the probability of faults. Therefore, we co-design the weights and memories such that their properties complement each other and the faults result in no noticeable NN accuracy loss. In the extreme case, the weights in fully connected layers can be stored using a single transistor. With weight pruning and clustering, we show our technique reduces the memory area by over an order of magnitude compared to an SRAM baseline. In the case of VGG16 (130M weights), we are able to store all the weights in 4.9 mm2, well within the area allocated to SRAM in modern NN accelerators.
This paper presents a power electronics design for the piezoelectric actuators of an insect-scale flapping-wing robot, the RoboBee. The proposed design outputs four high-voltage drive signals tailored for the two bimorph actuators of the RoboBee in an alternating drive configuration. It utilizes fully integrated drive stage circuits with a novel highside gate driver to save chip area and meet the strict mass constraint of the RoboBee. Compared with previous integrated designs, it also boosts efficiency in delivering energy to the actuators and recovering unused energy by applying three power saving techniques, dynamic common mode adjustment, envelope tracking, and charge sharing. Using this design to energize four 15 nF capacitor loads with a 200 V and 100 Hz drive signal and tracking the control commands recorded from an actual flight experiment for the robot, we measure an average power consumption of 290 mW.
This paper takes the position that, while cognitive computing today relies heavily on the cloud, we will soon see a paradigm shift where cognitive computing primarily happens on network edges. The shift toward edge devices is fundamentally propelled both by technological constraints in data centers and wireless network infrastructures, as well as practical considerations such as privacy and safety. The remainder of this paper lays out our view of how these constraints will impact future cognitive computing. Bringing cognitive computing to edge devices opens up several new opportunities and challenges, some of which demand new solutions and some of which require us to revisit entrenched techniques in light of new technologies. We close the paper with a call to action for future research.