Architectural Support for Deep Learning at Harvard
A Full-Stack Approach to Machine Learning
Designing a deep neural network accelerator is a multi-objective optimization problem maximizing accuracy and minimizing energy consumption. The design space can easily have millions of possible configurations and scales exponentially. We propose using Bayesian optimization to intelligently sample the design space. Compared to a standard search solutions, Bayesian optimization:
- Generates better samples: on average, 4.3x more accurate with 3.6x less energy
- Finds better Pareto points, consuming up to 8.7x less energy for equivalent accuracy.
- Discovers optimal designs faster, finding similar results with 3.2x fewer samples.
In submission; extended pre-print available
Fathom is a collection of eight archetypal deep learning workloads to enable broad, realistic architecture research. Each model is derived from a seminal work in the deep learning community, ranging from the convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the fundamental performance characteristics of each model. We breakdown where time is spent, computational similarities between the models, differences in inference and training, and the effects of parallelism on scaling.
While published accelerators easily give an order of magnitude improvement over general-purpose hardware, few studies look beyond an initial implementation. Minerva is an automated co-design approach to optimizing DNN accelerators which goes further. Compared to a fixed-point accelerator baseline, we show that fine-grained, data-type optimization reduces power by 1.5x; aggressive predication and pruning further reduces power by 2.0x; and active fault mitigation eliminates an additional 2.7x by lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1x power reduction over an accelerator baseline without compromising accuracy.
Deep learning algorithms present an exciting opportunity for efficient VLSI implementations due to several useful properties: (1) an embarrassingly parallel dataflow graph, (2) significant sparsity in model parameters and intermediate results, and (3) resilience to noisy computation and storage. Exploiting these characteristics can offer significantly improved performance and energy efficiency. We have taped out two SoCs, one in 28nm bulk and one in 16nm FinFET. These chips contain CPUs, peripherals, on- chip memory and custom accelerators to allow us to tune and characterize the efficiency and resilience of deep learning algorithms in custom silicon.