Sae Kyu Lee, Tao Tong, Xuan Zhang, David Brooks, and Gu Wei. 4/1/2017. “
A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter.” IEEE Transactions on VLSI, 25, 4, Pp. 1271-1284.
Publisher's VersionAbstractThis paper presents a 16-core voltage-stacked system with adaptive frequency clocking (AFClk) and a fully integrated voltage regulator that demonstrates efficient on-chip power delivery for multicore systems. Voltage stacking alleviates power delivery inefficiencies due to off-chip parasitics but adds complexity to combat internal voltage noise. To address the corresponding issue of internal voltage noise, the system utilizes an AFClk scheme with an efficient switched-capacitor dc-dc converter to mitigate noise on the stack layers and to improve system performance and efficiency. Experimental results demonstrate robust voltage noise mitigation as well as the potential of voltage stacking as a highly efficient power delivery scheme. This paper also illustrates that augmenting the hardware techniques with intelligent workload allocation that exploits the inherent properties of voltage stacking can preemptively reduce the interlayer activity mismatch and improve system efficiency.
A 16-Core Voltage-Stacked System With Adaptive Clocking and an Integrated Switched-Capacitor DC–DC Converter Svilen Kanev, Sam Xi, Gu Wei, and David Brooks. 4/2017. “
Mallacc: Accelerating Memory Allocation.” In International Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2nd ed., 5: Pp. 33-45.
Publisher's VersionAbstractRecent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 μm 2 of silicon area, less than 0.006% of a typical high-performance processor core.
Mallacc: Accelerating Memory Allocation Yuhao Zhu and Vijay Reddi. 3/2/2017. “
Cognitive computing safety: The new horizon for reliability.” IEEE Micro, 37, Pp. 15–21.
Publisher's VersionAbstractThis column includes two invited position papers about the challenges and opportunities in cognitive architectures.
Cognitive computing safety: The new horizon for reliability Paul Whatmough, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/9/2017. “
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications.” In International Solid-State Circuits Conference. San Francisco, CA, USA.
Publisher's VersionAbstractThis paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse Deep-Neural-Network Engine with >0.1 Timing Error Rate Tolerance for IoT Applications Whatmough N, Sae Lee, Hyunkwang Lee, Saketh Rama, David Brooks, and Gu Wei. 2/5/2017. “
14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 242–243. IEEE.
Publisher's VersionAbstractThis paper presents a 28nm SoC with a programmable FC-DNN accelerator design that demonstrates: (1) HW support to exploit data sparsity by eliding unnecessary computations (4× energy reduction); (2) improved algorithmic error tolerance using sign-magnitude number format for weights and datapath computation; (3) improved circuit-level timing violation tolerance in datapath logic via timeborrowing; (4) combined circuit and algorithmic resilience with Razor timing violation detection to reduce energy via VDD scaling or increase throughput via FCLK scaling; and (5) high classification accuracy (98.36% for MNIST test set) while tolerating aggregate timing violation rates >10-1. The accelerator achieves a minimum energy of 0.36μJ/pred at 667MHz, maximum throughput at 1.2GHz and 0.57μJ/pred, or a 10%-margined operating point at 1GHz and 0.58μJ/pred.
14.3 A 28nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with> 0.1 timing error rate tolerance for IoT applications Simon Chaput, David Brooks, and Gu Wei. 2/2/2017. “
21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver.” In 2017 IEEE International Solid-State Circuits Conference (ISSCC), Pp. 360–361. San Francisco, CA, USA: IEEE.
Publisher's VersionAbstractPiezoelectric actuators are used in a growing range of applications, e.g., haptic feedback systems, cooling fans, and microrobots. However, to fully realize their potential, these actuators require drivers able to efficiently generate high-voltage (>100V pp ) low frequency (<;300Hz) analog waveforms from a low-voltage source (3-to-5V) with small form factor. Certain applications, such as piezoelectric (PZT) cooling fans, further demand low distortion waveforms (THD+N <; 1%) to minimize sound emission from the actuator. Existing solutions for small PZT drivers typically rely on designs comprising a power converter to step up a low voltage followed by a high-voltage amplifier [1,2,3]. Although envelope tracking can help reduce amplifier power [3], none of these designs can recover the energy stored on the actuator to maximize efficiency. And while a differential bidirectional flyback converter [4] can recover energy, it requires four inductors, thereby incurring large size penalty. This paper introduces a single-inductor, highly integrated, bidirectional, high-voltage actuator driver that achieves 12.6× lower power and 2.1× lower THD+N at a similar size to the currently available state-of-the art solution [1]. Measured results from an IC prototype demonstrate 200Hz sinusoidal waveforms up to 100V pp with 0.42% THD+N from a 3.6V source while dissipating 57.7mW to drive a 150nF capacitor. Beyond PZT actuators, the IC can also drive any type of capacitive load, e.g., electrostatic and electroactive polymer actuators.
21.5 A 3-to-5V input 100V pp output 57.7 mW 0.42% THD+ N highly integrated piezoelectric actuator driver Robert Adolf, Saketh Rama, Brandon Reagen, Gu Wei, and David Brooks. 2017. “
The Design and Evolution of Deep Learning Workloads.” IEEE MICRO, 37, 1, Pp. 18–21.
The Design and Evolution of Deep Learning Workloads