Publications by Author: Robert%20Adolf

2023
Matthew Adiletta, Jesmin Jahan Tithi, Emmanouil-Ioannis Farsarakis, Gerasimos Gerogiannis, Robert Adolf, Robert Benke, Sidharth Kashyap, Samuel Hsia, Kartik Lakhotia, Fabrizio Petrini, Gu-Yeon Wei, and David Brooks. 4/24/2023. “Characterizing the Scalability of Graph Convolutional Networks on Intel® PIUMA.” In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Raleigh, North Carolina.Abstract

Large-scale Graph Convolutional Network (GCN) inference on traditional CPU/GPU systems is challenging due to a large memory footprint, sparse computational patterns, and irregular memory accesses with poor locality. Intel's Programmable Integrated Unified Memory Architecture (PIUMA) is designed to address these challenges for graph analytics. In this paper, a detailed characterization of GCNs is presented using the Open-Graph Benchmark (OGB) datasets to determine the viability of PIUMA as a potential solution to GCN scalability.

First, the extent of sparse matrix dense matrix multiplication~(SpMM) as a performance driver for GCN on CPU and GPU is explored, offering a methodology for predicting GCN behavior as a function of dataset characteristics. Second, an SpMM kernel optimized for PIUMA is described and investigated for sensitivity to system parameters including memory bandwidth, latency, and thread count. SpMM scalability on PIUMA is demonstrated, while the scalability limitations of a Xeon-optimized SpMM implementation are discussed. Finally, GCN performance is compared on PIUMA versus a Xeon CPU system and Ampere GPU system, showing impressive results on PIUMA for large-scale datasets.

ispass_gnn_characterization_on_piuma.pdf
2018
Brandon Reagen, Udit Gupta, Robert Adolf, Michael Mitzenmacher, Alexander Rush, Gu Wei, and David Brooks. 11/13/2018. “Weightless: Lossy Weight Encoding For Deep Neural Network Compression.” In International Conference on Machine Learning, Pp. 4324–4333. Publisher's VersionAbstract
The large memory requirements of deep neural networks limit their deployment and adoption on many devices. Model compression methods effectively reduce the memory requirements of these models, usually through applying transformations such as weight pruning or quantization. In this paper, we present a novel scheme for lossy weight encoding which complements conventional compression techniques. The encoding is based on the Bloomier filter, a probabilistic data structure that can save space at the cost of introducing random errors. Leveraging the ability of neural networks to tolerate these imperfections and by re-training around the errors, the proposed technique, Weightless, can compress DNN weights by up to 496x with the same model accuracy. This results in up to a 1.51x improvement over the state-of-the-art.
Weightless: Lossy Weight Encoding For Deep Neural Network Compression
2017
Brandon Reagen, Robert Adolf, Paul Whatmough, Gu-Wei, and David Brooks. 8/2017. Deep Learning for Computer Architects. Morgan & Claypool Publishers. Publisher's VersionAbstract
Machine learning, and specifically deep learning, has been hugely disruptive in many fields of computer science. The success of deep learning techniques in solving notoriously difficult classification and regression problems has resulted in their rapid adoption in solving real-world problems. The emergence of deep learning is widely attributed to a virtuous cycle whereby fundamental advancements in training deeper models were enabled by the availability of massive datasets and high-performance computer hardware. This text serves as a primer for computer architects in a new and rapidly evolving field. We review how machine learning has evolved since its inception in the 1960s and track the key developments leading up to the emergence of the powerful deep learning techniques that emerged in the last decade. Next we review representative workloads, including the most commonly used datasets and seminal networks across a variety of domains. In addition to discussing the workloads themselves, we also detail the most popular deep learning tools and show how aspiring practitioners can use the tools with the workloads to characterize and optimize DNNs. The remainder of the book is dedicated to the design and optimization of hardware and architectures for machine learning. As high-performance hardware was so instrumental in the success of machine learning becoming a practical solution, this chapter recounts a variety of optimizations proposed recently to further improve future designs. Finally, we present a review of recent research published in the area as well as a taxonomy to help readers understand how various contributions fall in context.
Brandon Reagen, Jose Hernandez-Lobato, Robert Adolf, Michael Gelbart, Paul Whatmough, Gu Wei, and David Brooks. 7/24/2017. “A Case for Efficient Accelerator Design Space Exploration via Bayesian Optimization.” In International Symposium on Low Power Electronics and Design. Taipei, Taiwan. Publisher's VersionAbstract
In this paper we propose using machine learning to improve the design of deep neural network hardware accelerators. We show how to adapt multi-objective Bayesian optimization to overcome a challenging design problem: optimizing deep neural network hardware accelerators for both accuracy and energy efficiency. DNN accelerators exhibit all aspects of a challenging optimization space: the landscape is rough, evaluating designs is expensive, the objectives compete with each other, and both design spaces (algorithmic and microarchitectural) are unwieldy. With multi-objective Bayesian optimization, the design space exploration is made tractable and the design points found vastly outperform traditional methods across all metrics of interest.
A Case for Efficient Accelerator Design Space Exploration via Bayesian Optimization
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 2017. “The Design and Evolution of Deep Learning Workloads.” IEEE MICRO, 37, 1, Pp. 18–21. The Design and Evolution of Deep Learning Workloads
2016
Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 8/23/2016. “Fathom: Reference Workloads for Modern Deep Learning Methods.” In IEEE International Symposium on Workload Characterization. Publisher's VersionAbstract
Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook’s AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.
Fathom: Reference Workloads for Modern Deep Learning Methods
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, José Hernández-Lobato, Gu Wei, and David Brooks. 6/18/2016. “Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators.” In International Symposium on Computer Architecture (ISCA). Seoul, Korea (South). Publisher's VersionAbstract
The continued success of Deep Neural Networks (DNNs) in classification tasks has sparked a trend of accelerating their execution with specialized hardware. While published designs easily give an order of magnitude improvement over general-purpose hardware, few look beyond an initial implementation. This paper presents Minerva, a highly automated co-design approach across the algorithm, architecture, and circuit levels to optimize DNN hardware accelerators. Compared to an established fixed-point accelerator baseline, we show that fine-grained, heterogeneous datatype optimization reduces power by 1.5×; aggressive, inline predication and pruning of small activity values further reduces power by 2.0×; and active hardware fault detection coupled with domain-aware error mitigation eliminates an additional 2.7× through lowering SRAM voltages. Across five datasets, these optimizations provide a collective average of 8.1× power reduction over an accelerator baseline without compromising DNN model accuracy. Minerva enables highly accurate, ultra-low power DNN accelerators (in the range of tens of milliwatts), making it feasible to deploy DNNs in power-constrained IoT and mobile devices.
Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators
José Lobato, Michael A Gelbart, Brandon Reagen, Robert Adolf, Daniel Hernández-Lobato, Paul N Whatmough, David Brooks, Gu-Yeon Wei, and Ryan P Adams. 2016. “Designing neural network hardware accelerators with decoupled objective evaluations.” In NIPS workshop on Bayesian Optimization, Pp. 10. Publisher's VersionAbstract
Software-based implementations of deep neural network predictions consume large amounts of energy, limiting their deployment in power-constrained environments. Hardware acceleration is a promising alternative. However, it is challenging to efficiently design accelerators that have both low prediction error and low energy consumption. Bayesian optimization can be used to accelerate the design problem. However, most of the existing techniques collect data in a coupled way by always evaluating the two objectives (energy and error) jointly at the same input, which is inefficient. Instead, in this work we consider a decoupled approach in which, at each iteration, we choose which objective to evaluate next and at which input. We show that considering decoupled evaluations produces better solutions when computational resources are limited. Our results also indicate that evaluating the prediction error is more important than evaluating the energy consumption.
Designing neural network hardware accelerators with decoupled objective evaluations
2015
Brandon Reagen, Robert Adolf, Gu Wei, and David Brooks. 10/26/2015. “The MachSuite Benchmark.” In Boston Area Architecture Workshop (BARC). Raleigh, NC, USA. Publisher's VersionAbstract
Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection. To improve standardization within the accelerator research community, we present MachSuite, a collection of 19 benchmarks for evaluating high-level synthesis tools and accelerator-centric architectures. MachSuite spans a broad application space, captures a variety of different program behaviors, and provides implementations tailored towards the needs of accelerator designers and researchers, including support for high-level synthesis. We illustrate these aspects by characterizing each benchmark along five different dimensions, highlighting trends and salient features.
MachSuite: Benchmarks for accelerator design and customized architectures
2014
Brandon Reagen, Robert Adolf, Sophia Shao, Gu Wei, and David Brooks. 10/26/2014. “MachSuite: Benchmarks for Accelerator Design and Customized Architectures.” In IEEE International Symposium on Workload Characterization (IISWC). Publisher's VersionAbstract
Recent high-level synthesis and accelerator-related architecture papers show a great disparity in workload selection among projects and research groups. To provide standardization within the accelerator research community, we present MachSuite, a benchmark suite for high-level synthesis tools and accelerator-centric architectures. MachSuite is the compilation of carefully selected workloads to cover a diverse application space and algorithm choices. All the benchmarks in MachSuite are implemented to be well suited for high-level synthesis. A thorough characterization further demonstrates the diverse behaviors among benchmarks, representative of different customization challenges. MachSuite enables commensurability across research projects while mitigating the burden of accelerator implementation and workload selection.
MachSuite: Benchmarks for Accelerator Design and Customized Architectures