Publications by Type: Journal Article

2019
Siming Ma, David Brooks, and Gu Wei. 11/30/2019. “A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM.” arXiv preprint arXiv:1912.00106. Publisher's VersionAbstract
We propose a new algorithm for training neural networks with binary activations and multi-level weights, which enables efficient processing-in-memory circuits with embedded nonvolatile memories (eNVM). Binary activations obviate costly DACs and ADCs. Multi-level weights leverage multi-level eNVM cells. Compared to existing algorithms, our method not only works for feed-forward networks (e.g., fully-connected and convolutional), but also achieves higher accuracy and noise resilience for recurrent networks. In particular, we present an RNN-based trigger-word detection PIM accelerator, with detailed hardware noise models and circuit co-design techniques, and validate our algorithm's high inference accuracy and robustness against a variety of real hardware non-idealities.
A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
Marco Donato, Lillian Pentecost, David Brooks, and Gu Wei. 10/4/2019. “MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge.” IEEE MICRO, 39, 6. Publisher's VersionAbstract
The combination of specialized hardware and embedded nonvolatile memories (eNVM) holds promise for energy-efficient deep neural network (DNN) inference at the edge. However, integrating DNN hardware accelerators with eNVMs still presents several challenges. Multilevel programming is desirable for achieving maximal storage density on chip, but the stochastic nature of eNVM writes makes them prone to errors and further increases the write energy and latency. In this article, we present MEMTI, a memory architecture that leverages a multitask learning technique for maximal reuse of DNN parameters across multiple visual tasks. We show that by retraining and updating only 10% of all DNN parameters, we can achieve efficient model adaptation across a variety of visual inference tasks. The system performance is evaluated by integrating the memory with the open-source NVIDIA deep learning architecture.
MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debo Dutta, Udit Gupta, Kim Hazelwood, Andy Hock, Xinyuan Huang, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St John, Carole-Jean Wu, Lingjie Xu, Cliff Young, and Matei Zaharia. 10/2/2019. “Mlperf training benchmark.” arXiv preprint arXiv:1910.01500. Publisher's VersionAbstract
Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.
Mlperf training benchmark
Bhardwaj Kshitij, Havasi Marton, Yao Yuan, David Brooks, Lobato Hernández, and Gu Wei. 9/16/2019. “Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization.” IEEE Computer Architecture Letters, 18, 2, Pp. 119–123.Abstract
The modern system-on-chip (SoC) of the current exascale computing era is complex. These SoCs not only consist of several general-purpose processing cores but also integrate many specialized hardware accelerators. Three common coherency interfaces are used to integrate the accelerators with the memory hierarchy: non-coherent,coherent with the last-level cache (LLC), and fully-coherent.However, using a single coherence interface for all the accelerators in an SoC can lead to significant overheads: in the non-coherent model, accelerators directly access the main memory, which can have considerable performance penalty; whereas in the LLC-coherent model, the accelerators access the LLC but may suffer from performance bottleneck due to contention between several accelerators; and the fully-coherent model, that relies on private caches, can incur non-trivial power/area overheads. Given the limitations of each of these interfaces, this paper proposes a novel performance-aware hybrid coherency interface, where different accelerators use different coherency models, decided at design time based on the target applications so as to optimize the overall system performance. A new Bayesian optimization based framework is also proposed to determine the optimal hybrid coherency interface, i.e., use machine learning to select the best coherency model for each of the accelerators in the SoC in terms of performance. For image processing and classification workloads, the proposed framework determined that a hybrid interface achieves up to 23 percent better performance compared to the other 'homogeneous` interfaces, where all the accelerators use a single coherency model.
Determining Optimal Coherency Interface for Many-Accelerator SoCs Using Bayesian Optimization
Thierry Tambe, Yu Yang, Zishen Wan, Yuntian Deng, Vijay Reddi, Alexander Rush, David Brooks, and Gu Wei. 9/2019. “AdaptivFloat: A Floating-Point Based Data Type for Resilient Deep Learning Inference.” arXiv preprint arXiv:1909.13271. Publisher's VersionAbstract
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encoding of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at very low precision (≤ 8-bit) across a diverse set of state-of-the-art neural network topologies. And notably, AdaptivFloat is seen surpassing baseline FP32 performance by up to +0.3 in BLEU score and -0.75 in word error rate at weight bit widths that are ≤ 8-bit. Experimental results on a deep neural network (DNN) hardware accelerator, exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9× and 1.14×, respectively, that of equivalent bit width integer-based accelerator variants.
AdaptivFloat: A Floating-Point Based Data Type for Resilient Deep Learning Inference
Yu Wang, Gu Wei, and David Brooks. 7/24/2019. “Benchmarking TPU, GPU, and CPU platforms for deep learning.” arXiv preprint arXiv:1907.10701. Publisher's VersionAbstract
Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.
Benchmarking TPU, GPU, and CPU platforms for deep learning
Siming Ma, Marco Donato, Sae Lee, David Brooks, and Gu Wei. 7/22/2019. “Fully-CMOS Multi-Level Embedded Non-Volatile Memory Devices With Reliable Long-Term Retention for Efficient Storage of Neural Network Weights.” IEEE Electron Device Letters, 40, 9, Pp. 1403–1406. Publisher's VersionAbstract
We present a fully CMOS-compatible multilevel non-volatile memory technology, without any special process cost. It is especially suitable for storing the weights of artificial neural networks on chip with low cost, high density, and high power-efficiency. We use hot carrier injection to program the single-transistor cells, and we conduct charge pumping experiments which identify interfacial traps, rather than bulk oxide traps, as the dominant factor in producing stable I-V shifts. We also derive a new physics-based experimentally verified logarithmic model to explain the rate of interfacial trap generation in large I-V shift regimes where the conventional power-law no longer applies. We fabricate two chips, one using TSMC's 16 nm FinFET and the other in 28 nm planar, and show the FinFET cells are more favorable for non-volatile memory due to their better channel control. We store multiple levels in each FinFET cell using a “program and check” strategy which sets memory cells' currents with standard deviations less than 2 μA across a shifting range of over 100 μA. We demonstrate 8 level FinFET cells with extrapolated 10-year charge loss within 10% at 125 °C.
Fully-CMOS Multi-Level Embedded Non-Volatile Memory Devices With Reliable Long-Term Retention for Efficient Storage of Neural Network Weights
Sae Lee, Paul Whatmough, David Brooks, and Gu Wei. 7/1/2019. “A 16-nm always-on DNN processor with adaptive clocking and multi-cycle banked SRAMs.” IEEE Journal of Solid-State Circuits, 54, 7, Pp. 1982 - 1992. Publisher's VersionAbstract
Always-on subsystems in mobile/Internet of Things (IoT) SoCs process a variety of real-time sensor data deep neural network (DNN) classification workloads in a heavily constrained energy budget. This can be achieved with robust, low-voltage circuits, and specialized hardware accelerators. We present a 16-nm always-on DNN processor, which consists primarily of a microcontroller and a DNN accelerator with on-chip SRAM for the model weights. The design operates robustly from 0.4 to 1-V, with calibration-free automatic voltage/frequency tuning provided by tracking small non-zero razor timing error rates. A novel timing error-driven synchronization-free adaptive clocking scheme significantly reduces the adaptation latency to provide resilience to fast on-chip supply noise and reduce margins. To accommodate the tight energy constraints of always-on IoT workloads, we implement a multi-cycle SRAM read scheme that allows the memory voltage to scale at iso-throughput, improving energy efficiency across the entire operating range. The wide operating range allows for high performance at 1.36 GHz, low-power consumption downs to 750 μW, and stateof-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense or 1.81 TOPS/W sparse.
A 16-nm always-on DNN processor with adaptive clocking and multi-cycle banked SRAMs
Paul Whatmough, Sae Lee, Marco Donato, Hsea Hsueh, Sam Xi, Udit Gupta, Lillian Pentecost, Glenn Ko, David Brooks, and Gu Wei. 6/2019. “A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators.” Symposium on VLSI Circuits. Publisher's VersionAbstract
This paper presents a 25mm^2 SoC in 16nm FinFET technology targeting flexible acceleration of compute intensive kernels in DNN, DSP and security algorithms. The SoC includes an always-on sub-system, a dual-core Arm A53 CPU cluster, an embedded FPGA array, and a quad-core cache-coherent accelerator cluster. Measurement results demonstrate the following observations: 1) moving DSP/cryptography kernels from A53 to eFPGA increases energy efficiency between 5.5× - 28.9×, 2) the use of cache coherency for datapath accelerators increases throughput by 2.94×, and 3) accelerator flexibility-efficiency (GOPS/W) range spans from 3.1× (A53+S1MD), to 16.5× (eFPGA), to 54.5× (CCA) compared to the dual-core CPU baseline on comparable tasks. The energy per inference on MobileNet-128 CNN shows a peak improvement of 47.6×.
A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators
Yu Wang, Victor Lee, Gu Wei, and David Brooks. 1/1/2019. “Predicting New Workload or CPU Performance by Analyzing Public Datasets.” ACM Transactions on Architecture and Code Optimization (TACO), 15, 4, Pp. 53:1–53:21. Publisher's VersionAbstract
The marketplace for general-purpose microprocessors offers hundreds of functionally similar models, differing by traits like frequency, core count, cache size, memory bandwidth, and power consumption. Their performance depends not only on microarchitecture, but also on the nature of the workloads being executed. Given a set of intended workloads, the consumer needs both performance and price information to make rational buying decisions. Many benchmark suites have been developed to measure processor performance, and their results for large collections of CPUs are often publicly available. However, repositories of benchmark results are not always helpful when consumers need performance data for new processors or new workloads. Moreover, the aggregate scores for benchmark suites designed to cover a broad spectrum of workload types can be misleading. To address these problems, we have developed a deep neural network (DNN) model, and we have used it to learn the relationship between the specifications of Intel CPUs and their performance on the SPEC CPU2006 and Geekbench 3 benchmark suites. We show that we can generate useful predictions for new processors and new workloads. We also cross-predict the two benchmark suites and compare their performance scores. The results quantify the self-similarity of these suites for the first time in the literature. This work should discourage consumers from basing purchasing decisions exclusively on Geekbench 3, and it should encourage academics to evaluate research using more diverse workloads than the SPEC CPU suites alone.
Predicting New Workload or CPU Performance by Analyzing Public Datasets
2018
Sae Lee, Paul Whatmough, Niamh Mulholland, Patrick Hansen, David Brooks, and Gu Wei. 10/18/2018. “A wide dynamic range sparse FC-DNN processor with multi-cycle banked SRAM read and adaptive clocking in 16nm FinFET.” ESSCIRC 2018-IEEE 44th European Solid State Circuits Conference. Publisher's VersionAbstract
Always-on classifiers for sensor data require a very wide operating range to support a variety of real-time workloads and must operate robustly at low supply voltages. We present a 16nm always-on wake-up controller with a fully-connected (FC) Deep Neural Network (DNN) accelerator that operates from 0.4-1 V. Calibration-free automatic voltage/frequency tuning is provided by tracking small non-zero Razor timing-error rates, and a novel timing-error driven sync-free fast adaptive clocking scheme provides resilience to on-chip supply voltage noise. The model access burden of neural networks is relaxed using a multicycle SRAM read, which allows memory voltage to be reduced at iso-throughput. The wide operating range allows for high performance at 1.36GHz, low-power consumption down to 750μW and state-of-the-art raw efficiency at 16-bit precision of 750 GOPS/W dense, or 1.81 TOPS/W sparse.
A wide dynamic range sparse FC-DNN processor with multi-cycle banked SRAM read and adaptive clocking in 16nm FinFET
Rafael Garibotti, Brandon Reagen, Yakun Shao, Gu Wei, and David Brooks. 10/2018. “Assisting high-level synthesis improve spmv benchmark through dynamic dependence analysis.” IEEE Transactions on Circuits and Systems II: Express Briefs, 65, 10, Pp. 1440–1444. Publisher's VersionAbstract
Recent advances in high-level synthesis (HLS) have enabled an automatic means of generating register-transfer level from high-level specifications without compromising performance. HLS provides substantial improvements to productivity and is a promising solution to designing future heterogeneous chips consisting of dozens of unique IP blocks (i.e., hardware accelerators). Despite their impressive capabilities, HLS tools today are commonly used to target a small subset of workloads, i.e., ones with inordinately regular control flow and memory access patterns. The challenges of achieving high-quality hardware for irregular workloads stems from HLS relying on static analysis. Static analysis is overly conservative when dealing with non-uniform memory access and imbalanced workloads, and identifying the most appropriate parallelizing strategy. In this brief, we propose the use of dynamic analysis to generate higher quality designs using commercial HLS tools. Our evaluations show that with dynamic dependence analysis, HLS designs achieve 3.3× performance improvement for the sparse matrix-vector multiply benchmark.
Assisting High-Level Synthesis Improve SpMV Benchmark Through Dynamic Dependence Analysis
Paul Whatmough, Sae Lee, Sam Xi, Udit Gupta, Lillian Pentecost, Marco Donato, Hsea Hseuh, David Brooks, and Gu Wei. 10/2018. “SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices.” Hot Chips 30: A Symposium on High Performance Chips, 99, Pp. 1-1. Publisher's VersionAbstract
Emerging Internet of Things (IoT) devices necessitate system-on-chips (SoCs) that can scale from ultralow power always-on (AON) operation, all the way up to less frequent high-performance tasks at high energy efficiency. Specialized accelerators are essential to help meet these needs at both ends of the scale, but maintaining workload flexibility remains an important goal. This article presents a 25-mm² SoC in 16-nm FinFET technology which demonstrates targeted, flexible acceleration of key compute-intensive kernels spanning machine learning (ML), DSP, and cryptography. The SMIV SoC includes a dedicated AON sub-system, a dual-core Arm Cortex-A53 CPU cluster, an SoC-attached embedded field-programmable gate array (eFPGA) array, and a quad-core cache-coherent accelerator (CCA) cluster. Measurement results demonstrate: 1) 1236x power envelope, from 1.1 mW (only AON cluster), up to 1.36 W (whole SoC at maximum throughput); 2) 5.5-28.9x energy efficiency gain from offloading compute kernels from A53 to eFPGA; 3) 2.94x latency improvement using coherent memory access (CCA cluster); and 4) 55x MobileNetV1 energy per inference improvement on CCA compared to the CPU baseline. The overall flexibility-efficiency range on SMIV spans measured energy efficiencies of 1x (dual-core A53), 3.1x (A53 with SIMD), 16.5x (eFPGA), 54.9x (CCA), and 256x (AON) at a peak efficiency of 4.8 TOPS/W.
SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices
Paul Whatmough, Sae Lee, David Brooks, and Gu Wei. 9/2018. “DNN ENGINE: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications.” IEEE Journal of Solid-State Circuits (JSSC), 53, 9. Publisher's VersionAbstract
This paper presents a 28-nm system-on-chip (SoC) for Internet of things (IoT) applications with a programmable accelerator design that implements a powerful fully connected deep neural network (DNN) classifier. To reach the required low energy consumption, we exploit the key properties of neural network algorithms: parallelism, data reuse, small/sparse data, and noise tolerance. We map the algorithm to a very large scale integration (VLSI) architecture based around an singleinstruction, multiple-data data path with hardware support to exploit data sparsity by completely eliding unnecessary computation and data movement. This approach exploits sparsity, without compromising the parallel computation. We also exploit the inherent algorithmic noise-tolerance of neural networks, by introducing circuit-level timing violation detection to allow worst case voltage guard-bands to be minimized. The resulting intermittent timing violations may result in logic errors, which conventionally need to be corrected. However, in lieu of explicit error correction, we cope with this by accentuating the noise tolerance of neural networks. The measured test chip achieves high classification accuracy (98.36% for the MNIST test set), while tolerating aggregate timing violation rates>10 -1 . The accelerator achieves a minimum energy of 0.36 μJ/inference at 667 MHz; maximum throughput at 1.2 GHz and 0.57 μJ/inference; or a 10% margined operating point at 1 GHz and 0.58 μJ/inference.
DNN ENGINE: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications
Mario Lok, Elizabeth Farrell Helbling, Xuan Zhang, Robert Wood, David Brooks, and Gu Wei. 4/2018. “A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots.” IEEE Transactions on Power Electronics, 33, 4, Pp. 3180 - 3191. Publisher's VersionAbstract
This paper presents a power electronics design for the piezoelectric actuators of an insect-scale flapping-wing robot, the RoboBee. The proposed design outputs four high-voltage drive signals tailored for the two bimorph actuators of the RoboBee in an alternating drive configuration. It utilizes fully integrated drive stage circuits with a novel highside gate driver to save chip area and meet the strict mass constraint of the RoboBee. Compared with previous integrated designs, it also boosts efficiency in delivering energy to the actuators and recovering unused energy by applying three power saving techniques, dynamic common mode adjustment, envelope tracking, and charge sharing. Using this design to energize four 15 nF capacitor loads with a 200 V and 100 Hz drive signal and tracking the control commands recorded from an actual flight experiment for the robot, we measure an average power consumption of 290 mW.
A Low Mass Power Electronics Unit to Drive Piezoelectric Actuators for Flying Microrobots
Yuhao Zhu, Gu Wei, and David Brooks. 2/15/2018. “Cloud no longer a silver bullet, edge to the rescue.” arXiv preprint arXiv:1802.05943. Publisher's VersionAbstract
This paper takes the position that, while cognitive computing today relies heavily on the cloud, we will soon see a paradigm shift where cognitive computing primarily happens on network edges. The shift toward edge devices is fundamentally propelled both by technological constraints in data centers and wireless network infrastructures, as well as practical considerations such as privacy and safety. The remainder of this paper lays out our view of how these constraints will impact future cognitive computing. Bringing cognitive computing to edge devices opens up several new opportunities and challenges, some of which demand new solutions and some of which require us to revisit entrenched techniques in light of new technologies. We close the paper with a call to action for future research.
Cloud no longer a silver bullet, edge to the rescue
2017
Simone Campanoni, Kevin Brownell, Svilen Kanev, Timothy Jones, Gu Wei, and David Brooks. 12/2017. “Automatically accelerating non-numerical programs by architecture-compiler co-design.” Communications of the ACM, 60, 12, Pp. 88–97. Publisher's VersionAbstract
Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.
Automatically accelerating non-numerical programs by architecture-compiler co-design
Simon Chaput, David Brooks, and Gu Wei. 10/5/2017. “An area-efficient 8-bit single-ended ADC with extended input voltage range.” IEEE Transactions on Circuits and Systems II: Express Briefs, 65, 11, Pp. 1549–1553.Abstract
This brief presents an 8-bit successive approximation register analog-to-digital converter (ADC) implemented within a system-on-chip (SoC) for autonomous flapping-wing microrobots. The ADC implements hybrid split-capacitor sub-digital-to-analog converter (DAC) techniques to achieve 35.72% improvement in a capacitor bank energy-area product. The device also implements an extended single-ended input voltage range allowing a direct connection to sensors while maintaining low-power operation. This technique allows 51.7% DAC switching energy reduction compared to the state of the art. The SoC, fabricated in 40-nm CMOS, includes four parallel 0.001 mm 2 1 MS/s ADC cores multiplexed across 13 input ports. It enables 0 to 1.8-V input range while operating off of a 0.9 V supply. At 1 MS/s, the ADC achieves a signal-to-noise and distortion ratio of 45.6 dB for a 1.6-V pp input signal and consumes 10.4 μW.
An area-efficient 8-bit single-ended ADC with extended input voltage range
Xuan Zhang, Mario Lok, Tao Tong, Sae Lee, Brandon Reagen, Pierre. Duhamel, Robert Wood, David Brooks, and Gu Wei. 6/12/2017. “A Fully Integrated Battery-Powered System-on-Chip in 40-nm CMOS for Closed-Loop Control of Insect-Scale Pico-Aerial Vehicle.” IEEE Journal of Solid-State Circuits, 52, 9, Pp. 2374 - 2387. Publisher's VersionAbstract
We demonstrate a fully integrated system-on-chip (SoC) optimized for insect-scale flapping-wing pico-aerial vehicles. The SoC is able to meet the stringent weight, power, and real-time performance demands of autonomous flight for a bee-sized robot. The entire integrated system with embedded voltage regulation, data conversion, clock generation, as well as both general-purpose and accelerated computing units, weighs less than 3 mg after die thinning. It is self-contained and can be powered directly off of a lithium battery. Measured results show open-loop wing flapping controlled by the SoC and improved energy efficiency through the use of hardware acceleration and supply resilience through the use of adaptive clocking.
A Fully Integrated Battery-Powered System-on-Chip in 40-nm CMOS for Closed-Loop Control of Insect-Scale Pico-Aerial Vehicle
Sae Kyu Lee, Tao Tong, Xuan Zhang, David Brooks, and Gu Wei. 4/1/2017. “A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter.” IEEE Transactions on VLSI, 25, 4, Pp. 1271-1284. Publisher's VersionAbstract
This paper presents a 16-core voltage-stacked system with adaptive frequency clocking (AFClk) and a fully integrated voltage regulator that demonstrates efficient on-chip power delivery for multicore systems. Voltage stacking alleviates power delivery inefficiencies due to off-chip parasitics but adds complexity to combat internal voltage noise. To address the corresponding issue of internal voltage noise, the system utilizes an AFClk scheme with an efficient switched-capacitor dc-dc converter to mitigate noise on the stack layers and to improve system performance and efficiency. Experimental results demonstrate robust voltage noise mitigation as well as the potential of voltage stacking as a highly efficient power delivery scheme. This paper also illustrates that augmenting the hardware techniques with intelligent workload allocation that exploits the inherent properties of voltage stacking can preemptively reduce the interlayer activity mismatch and improve system efficiency.
A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter

Pages