Publications by Year: 2019

Xi Likun, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu Wei, and David Brooks. 12/10/2019. “SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads.” arXiv e-prints. Publisher's VersionAbstract
In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25-40%, with the rest spent on data movement and in the deep learning software framework. Thus far, it has been very difficult to study end-to-end DNN performance during early stage design (before RTL is available) because there are no existing DNN frameworks that support end-to-end simulation with easy custom hardware accelerator integration. To address this gap in research infrastructure, we present SMAUG, the first DNN framework that is purpose-built for simulation of end-to-end deep learning applications. SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration. To demonstrate the power and value of SMAUG, we present case studies that show how we can optimize overall performance and energy efficiency for up to 1.8-5x speedup over a baseline system, without changing any part of the accelerator microarchitecture, as well as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.
SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads
Siming Ma, David Brooks, and Gu Wei. 11/30/2019. “A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM.” arXiv preprint arXiv:1912.00106. Publisher's VersionAbstract
We propose a new algorithm for training neural networks with binary activations and multi-level weights, which enables efficient processing-in-memory circuits with embedded nonvolatile memories (eNVM). Binary activations obviate costly DACs and ADCs. Multi-level weights leverage multi-level eNVM cells. Compared to existing algorithms, our method not only works for feed-forward networks (e.g., fully-connected and convolutional), but also achieves higher accuracy and noise resilience for recurrent networks. In particular, we present an RNN-based trigger-word detection PIM accelerator, with detailed hardware noise models and circuit co-design techniques, and validate our algorithm's high inference accuracy and robustness against a variety of real hardware non-idealities.
A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
Lillian Pentecost, Udit Gupta, Elisa Ngan, Gu Wei, David Brooks, Johanna Beyer, and Michael Behrisch. 11/17/2019. “CHAMPVis: Comparative Hierarchical Analysis of Microarchitectural Performance.” ProTools workshop co-located with Supercomputing. Publisher's VersionAbstract
Performance analysis and optimization are essential tasks for hardware and software engineers. In the age of datacenter-scale computing, it is particularly important to conduct comparative performance analysis to understand discrepancies and limitations among different hardware systems and applications. However, there is a distinct lack of productive visualization tools for these comparisons. We present CHAMPVis [1], a web-based, interactive visualization tool that leverages the hierarchical organization of hardware systems to enable productive performance analysis. With CHAMPVis, users can make definitive performance comparisons across applications or hardware platforms. In addition, CHAMPVis provides methods to rank and cluster based on performance metrics to identify common optimization opportunities. Our thorough task analysis reveals three types of datacenter-scale performance analysis tasks: summarization, detailed comparative analysis, and interactive performance bottleneck identification. We propose techniques for each class of tasks including (1) 1-D feature space projection for similarity analysis; (2) Hierarchical parallel coordinates for comparative analysis; and (3) User interactions for rapid diagnostic queries to identify optimization targets. We evaluate CHAMPVis by analyzing standard datacenter applications and machine learning benchmarks in two different case studies.
CHAMPVis: Comparative Hierarchical Analysis of Microarchitectural Performance
Marco Donato, Lillian Pentecost, David Brooks, and Gu Wei. 10/4/2019. “MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge.” IEEE MICRO, 39, 6. Publisher's VersionAbstract
The combination of specialized hardware and embedded nonvolatile memories (eNVM) holds promise for energy-efficient deep neural network (DNN) inference at the edge. However, integrating DNN hardware accelerators with eNVMs still presents several challenges. Multilevel programming is desirable for achieving maximal storage density on chip, but the stochastic nature of eNVM writes makes them prone to errors and further increases the write energy and latency. In this article, we present MEMTI, a memory architecture that leverages a multitask learning technique for maximal reuse of DNN parameters across multiple visual tasks. We show that by retraining and updating only 10% of all DNN parameters, we can achieve efficient model adaptation across a variety of visual inference tasks. The system performance is evaluated by integrating the memory with the open-source NVIDIA deep learning architecture.
MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 10/2/2019. “Mlperf training benchmark.” arXiv preprint arXiv:1910.01500. Publisher's VersionAbstract
Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.
Mlperf training benchmark
Lillian Pentecost, Marco Donato, Brandon Reagen, Udit Gupta, Siming Ma, Gu Wei, and David Brooks. 10/1/2019. “MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation.” In MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Pp. 769–781. Publisher's VersionAbstract
Deeply embedded applications require low-power, low-cost hardware that fits within stringent area constraints. Deep learning has many potential uses in these domains, but introduces significant inefficiencies stemming from off-chip DRAM accesses of model weights. Ideally, models would fit entirely on-chip. However, even with compression, memory requirements for state-of-the-art mod- els make on-chip inference impractical. Due to increased density, emerging eNVMs are one promising solution. We present MaxNVM, a principled co-design of sparse encodings, protective logic, and fault-prone MLC eNVM technologies (i.e.,RRAM and CTT) to enable highly-efficient DNN inference. We find bit reduction techniques (e.g., clustering and sparse compression) increase weight vulnerability to faults. This limits the capabilities of MLC eNVM. To circumvent this limitation, we improve storage den- sity (i.e., bits-per-cell) with minimal overhead using protective logic. Tradeoffs between density and reliability result in a rich design space. We show that by balancing these techniques, the weights of large networks are able to reasonably fit on-chip. Compared to a naive, single-level-cell eNVM solution, our highly-optimized MLC memory systems reduce weight area by up to 29×. We compare our technique against NVDLA, a state-of-the-art industry-grade CNN accelerator, and demonstrate up to 3.2× reduced power and up to 3.5× reduced energy per ResNet50 inference.
MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation
Brian Plancher, Camelia Brumar, Iulian Brumar, Lillian Pentecost, Saketh Rama, and David Brooks. 9/24/2019. “Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM.” In IEEE High Performance Extreme Computing Conference (HPEC). Waltham, MA, USA. Publisher's VersionAbstract
Computational efficiency is a critical constraint for a variety of cutting-edge real-time applications. In this work, we identify an opportunity to speed up the end-to-end runtime of two such compute bound applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous Localization and Mapping (DSLAM). Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication, we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications experience variance in their results. This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used in practice.
Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM
Bhardwaj Kshitij, Havasi Marton, Yao Yuan, David Brooks, Lobato Hernández, and Gu Wei. 9/16/2019. “Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization.” IEEE Computer Architecture Letters, 18, 2, Pp. 119–123.Abstract
The modern system-on-chip (SoC) of the current exascale computing era is complex. These SoCs not only consist of several general-purpose processing cores but also integrate many specialized hardware accelerators. Three common coherency interfaces are used to integrate the accelerators with the memory hierarchy: non-coherent,coherent with the last-level cache (LLC), and fully-coherent.However, using a single coherence interface for all the accelerators in an SoC can lead to significant overheads: in the non-coherent model, accelerators directly access the main memory, which can have considerable performance penalty; whereas in the LLC-coherent model, the accelerators access the LLC but may suffer from performance bottleneck due to contention between several accelerators; and the fully-coherent model, that relies on private caches, can incur non-trivial power/area overheads. Given the limitations of each of these interfaces, this paper proposes a novel performance-aware hybrid coherency interface, where different accelerators use different coherency models, decided at design time based on the target applications so as to optimize the overall system performance. A new Bayesian optimization based framework is also proposed to determine the optimal hybrid coherency interface, i.e., use machine learning to select the best coherency model for each of the accelerators in the SoC in terms of performance. For image processing and classification workloads, the proposed framework determined that a hybrid interface achieves up to 23 percent better performance compared to the other 'homogeneous` interfaces, where all the accelerators use a single coherency model.
Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization
Glenn Ko, Yuji Chai, Rob Rutenbar, David Brooks, and Gu Wei. 9/8/2019. “Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling.” International Conference on Field-Programmable Logic and Applications. Barcelona, Spain. Publisher's VersionAbstract
Bayesian models and inference is a class of machine learning that is useful for solving problems where the amount of data is scarce and prior knowledge about the application allows you to draw better conclusions. However, Bayesian models often requires computing high-dimensional integrals and finding the posterior distribution can be intractable. One of the most commonly used approximate methods for Bayesian inference is Gibbs sampling, which is a Markov chain Monte Carlo (MCMC) technique to estimate target stationary distribution. The idea in Gibbs sampling is to generate posterior samples by iterating through each of the variables to sample from its conditional given all the other variables fixed. While Gibbs sampling is a popular method for probabilistic graphical models such as Markov Random Field (MRF), the plain algorithm is slow as it goes through each of the variables sequentially. In this work, we describe a binary label MRF Gibbs sampling inference architecture and extend it to 64-label version capable of running multiple perceptual applications, such as sound source separation and stereo matching. The described accelerator employs a chromatic scheduling of variables to parallelize all the conditionally independent variables to 257 samplers, imple- mented on the FPGA portion of a CPU-FPGA SoC. For real-time streaming sound source separation task, we show the hybrid CPU- FPGA implementation is 230x faster than a commercial mobile processor, while maintaining a recommended latency under 50 ms. The 64-label version showed 137x and 679x speedups for binary label MRF Gibbs sampling inference and 64 labels, respectively.
Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling
Thierry Tambe, Yu Yang, Zishen Wan, Yuntian Deng, Vijay Reddi, Alexander Rush, David Brooks, and Gu Wei. 9/2019. “AdaptivFloat: A Floating-Point Based Data Type for Resilient Deep Learning Inference.” arXiv preprint arXiv:1909.13271. Publisher's VersionAbstract
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encoding of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at very low precision (≤ 8-bit) across a diverse set of state-of-the-art neural network topologies. And notably, AdaptivFloat is seen surpassing baseline FP32 performance by up to +0.3 in BLEU score and -0.75 in word error rate at weight bit widths that are ≤ 8-bit. Experimental results on a deep neural network (DNN) hardware accelerator, exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9× and 1.14×, respectively, that of equivalent bit width integer-based accelerator variants.
AdaptivFloat: A Floating-Point Based Data Type for Resilient Deep Learning Inference
Udit Gupta, Brandon Reagen, Lillian Pentecost, Marco Donato, Thierry Tambe, Alexander Rush, Gu Wei, and David Brooks. 8/23/2019. “MASR: A Modular Accelerator for Sparse RNNs.” In International Conference on Parallel Architectures and Compilation Techniques. Publisher's VersionAbstract
Recurrent neural networks (RNNs) are becoming the de facto solution for speech recognition. RNNs exploit long-term temporal relationships in data by applying repeated, learned transformations. Unlike fully-connected (FC) layers with single vector matrix operations, RNN layers consist of hundreds of such operations chained over time. This poses challenges unique to RNNs that are not found in convolutional neural networks (CNNs) or FC models, namely large dynamic activation. In this paper we present MASR, a principled and modular architecture that accelerates bidirectional RNNs for on-chip ASR. MASR is designed to exploit sparsity in both dynamic activations and static weights. The architecture is enhanced by a series of dynamic activation optimizations that enable compact storage, ensure no energy is wasted computing null operations, and maintain high MAC utilization for highly parallel accelerator designs. In comparison to current state-of-the-art sparse neural network accelerators (e.g., EIE), MASR provides 2x area 3x energy, and 1.6x performance benefits. The modular nature of MASR enables designs that efficiently scale from resource-constrained low-power IoT applications to large-scale, highly parallel datacenter deployments.
MASR: A Modular Accelerator for Sparse RNNs
Yu Wang, Gu Wei, and David Brooks. 7/24/2019. “Benchmarking TPU, GPU, and CPU platforms for deep learning.” arXiv preprint arXiv:1907.10701. Publisher's VersionAbstract
Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.
Benchmarking TPU, GPU, and CPU platforms for deep learning
Siming Ma, Marco Donato, Sae Lee, David Brooks, and Gu Wei. 7/22/2019. “Fully-CMOS Multi-Level Embedded Non-Volatile Memory Devices With Reliable Long-Term Retention for Efficient Storage of Neural Network Weights.” IEEE Electron Device Letters, 40, 9, Pp. 1403–1406. Publisher's VersionAbstract
We present a fully CMOS-compatible multilevel non-volatile memory technology, without any special process cost. It is especially suitable for storing the weights of artificial neural networks on chip with low cost, high density, and high power-efficiency. We use hot carrier injection to program the single-transistor cells, and we conduct charge pumping experiments which identify interfacial traps, rather than bulk oxide traps, as the dominant factor in producing stable I-V shifts. We also derive a new physics-based experimentally verified logarithmic model to explain the rate of interfacial trap generation in large I-V shift regimes where the conventional power-law no longer applies. We fabricate two chips, one using TSMC's 16 nm FinFET and the other in 28 nm planar, and show the FinFET cells are more favorable for non-volatile memory due to their better channel control. We store multiple levels in each FinFET cell using a “program and check” strategy which sets memory cells' currents with standard deviations less than 2 μA across a shifting range of over 100 μA. We demonstrate 8 level FinFET cells with extrapolated 10-year charge loss within 10% at 125 °C.
Fully-CMOS Multi-Level Embedded Non-Volatile Memory Devices With Reliable Long-Term Retention for Efficient Storage of Neural Network Weights
Sae Lee, Paul Whatmough, David Brooks, and Gu Wei. 7/1/2019. “A 16-nm always-on DNN processor with adaptive clocking and multi-cycle banked SRAMs.” IEEE Journal of Solid-State Circuits, 54, 7, Pp. 1982 - 1992. Publisher's VersionAbstract
Always-on subsystems in mobile/Internet of Things (IoT) SoCs process a variety of real-time sensor data deep neural network (DNN) classification workloads in a heavily constrained energy budget. This can be achieved with robust, low-voltage circuits, and specialized hardware accelerators. We present a 16-nm always-on DNN processor, which consists primarily of a microcontroller and a DNN accelerator with on-chip SRAM for the model weights. The design operates robustly from 0.4 to 1-V, with calibration-free automatic voltage/frequency tuning provided by tracking small non-zero razor timing error rates. A novel timing error-driven synchronization-free adaptive clocking scheme significantly reduces the adaptation latency to provide resilience to fast on-chip supply noise and reduce margins. To accommodate the tight energy constraints of always-on IoT workloads, we implement a multi-cycle SRAM read scheme that allows the memory voltage to scale at iso-throughput, improving energy efficiency across the entire operating range. The wide operating range allows for high performance at 1.36 GHz, low-power consumption downs to 750 μW, and stateof-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense or 1.81 TOPS/W sparse.
A 16-nm always-on DNN processor with adaptive clocking and multi-cycle banked SRAMs
Paul Whatmough, Sae Lee, Marco Donato, Hsea Hsueh, Sam Xi, Udit Gupta, Lillian Pentecost, Glenn Ko, David Brooks, and Gu Wei. 6/2019. “A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators.” Symposium on VLSI Circuits. Publisher's VersionAbstract
This paper presents a 25mm^2 SoC in 16nm FinFET technology targeting flexible acceleration of compute intensive kernels in DNN, DSP and security algorithms. The SoC includes an always-on sub-system, a dual-core Arm A53 CPU cluster, an embedded FPGA array, and a quad-core cache-coherent accelerator cluster. Measurement results demonstrate the following observations: 1) moving DSP/cryptography kernels from A53 to eFPGA increases energy efficiency between 5.5× - 28.9×, 2) the use of cache coherency for datapath accelerators increases throughput by 2.94×, and 3) accelerator flexibility-efficiency (GOPS/W) range spans from 3.1× (A53+S1MD), to 16.5× (eFPGA), to 54.5× (CCA) compared to the dual-core CPU baseline on comparable tasks. The energy per inference on MobileNet-128 CNN shows a peak improvement of 47.6×.
A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators
G Ko, Yuji Chai, A Rutenbar, David Brooks, and Gu Wei. 4/28/2019. “Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for structured graphs.” In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Pp. 334–334. Publisher's VersionAbstract
Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.
Flexgibbs: Reconfigurable parallel gibbs sampling accelerator for structured graphs
Yu Emma Wang, Yuhao Zhu, Glenn Ko, Brandon Reagen, Gu-Wei, and David Brooks. 3/24/2019. “Demystifying Bayesian Inference Workloads.” International Symposium on Performance Analysis of Systems and Software (ISPASS). Publisher's VersionAbstract
The recent surge of machine learning has motivated computer architects to focus intently on accelerating related workloads, especially in deep learning. Deep learning has been the pillar algorithm that has led the advancement of learning patterns from a vast amount of labeled data, or supervised learning. However, for unsupervised learning, Bayesian methods often work better than deep learning. Bayesian modeling and inference works well with unlabeled or limited data, can leverage informative priors, and has interpretable models. Despite being an important branch of machine learning, Bayesian inference generally has been overlooked by the architecture and systems communities. In this paper, we facilitate the study of Bayesian inference with the development of BayesSuite, a collection of seminal Bayesian inference workloads. We characterize the power and performance profiles of BayesSuite across a variety of current-generation processors and find significant diversity. Manually tuning and deploying Bayesian inference workloads requires deep understanding of the workload characteristics and hardware specifications. To address these challenges and provide high-performance, energy-efficient support for Bayesian infer- ence, we introduce a scheduling and optimization mechanism that can be plugged into a system scheduler. We also propose a computation elision technique that further improves the performance and energy efficiency of the workloads by skipping computations that do not improve the quality of the inference. Our proposed techniques are able to increase Bayesian inference performance by 5.8× on average over the naive assignment and execution of the workloads.
Demystifying Bayesian Inference Workloads
Yu Wang, Victor Lee, Gu Wei, and David Brooks. 1/1/2019. “Predicting New Workload or CPU Performance by Analyzing Public Datasets.” ACM Transactions on Architecture and Code Optimization (TACO), 15, 4, Pp. 53:1–53:21. Publisher's VersionAbstract
The marketplace for general-purpose microprocessors offers hundreds of functionally similar models, differing by traits like frequency, core count, cache size, memory bandwidth, and power consumption. Their performance depends not only on microarchitecture, but also on the nature of the workloads being executed. Given a set of intended workloads, the consumer needs both performance and price information to make rational buying decisions. Many benchmark suites have been developed to measure processor performance, and their results for large collections of CPUs are often publicly available. However, repositories of benchmark results are not always helpful when consumers need performance data for new processors or new workloads. Moreover, the aggregate scores for benchmark suites designed to cover a broad spectrum of workload types can be misleading. To address these problems, we have developed a deep neural network (DNN) model, and we have used it to learn the relationship between the specifications of Intel CPUs and their performance on the SPEC CPU2006 and Geekbench 3 benchmark suites. We show that we can generate useful predictions for new processors and new workloads. We also cross-predict the two benchmark suites and compare their performance scores. The results quantify the self-similarity of these suites for the first time in the literature. This work should discourage consumers from basing purchasing decisions exclusively on Geekbench 3, and it should encourage academics to evaluate research using more diverse workloads than the SPEC CPU suites alone.
Predicting New Workload or CPU Performance by Analyzing Public Datasets
Gu Wei, Tao Tong, David Brooks, and Sae Lee. 2019. “Capacitor reconfiguration of a single-input, multi-output, switched-capacitor converter”.
Gu Wei, Tao Tong, David Brooks, and Sae Lee. 2019. “Device and method for hybrid feedback control of a switch-capacitor multi-unit voltage regulator.” United States of America. Publisher's VersionAbstract

A device and method for hybrid feedback control of a switch-capacitor multi-unit voltage regulator are presented. A multi-unit switched-capacitor (SC) core includes a plurality of SC converter units, each unit with a capacitor and a plurality of switches controllable by a plurality of switching signals. Power switch drivers provide a switching signal to each SC converter unit. A secondary proactive loop circuit includes a feedback control circuit configured to control one or more of the plurality of switches. A comparator is configured to compare the regulator output voltage with a reference voltage and provide a comparator trigger signal. Ripple reduction logic is configured to receive the comparator trigger signal and provide an SC unit allocation signal. A multiplexer is configured to receive a first clock signal, a second clock signal, and the SC unit allocation signal and provide a signal to the power switch drivers.

Device and method for hybrid feedback control of a switch-capacitor multi-unit voltage regulator