Publications by Type: Journal Article

Mark Hempstead, David Brooks, and Gu Wei. 7/2011. “An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS.” Emerging and Selected Topics in Circuits and Systems, IEEE Journal on, 1, 2, Pp. 193–202. Publisher's VersionAbstract
Networks of ultra-low-power nodes capable of sensing, computation, and wireless communication have applications in medicine, science, industrial automation, and security. Reducing power consumption requires the development of system-on-chip implementations that must provide both energy efficiency and adequate performance to meet the demands of the long deployment lifetimes and bursts of computation that characterize wireless sensor network (WSN) applications. Therefore, this work argues that designers should evaluate the design in terms of average power for an entire workload, including active and idle periods, not just the metric of energy-per-instruction.
An Accelerator-Based Wireless Sensor Network Processor in 130 nm CMOS
Kevin Brownell, Ali Khan, Gu Wei, and David Brooks. 3/1/2011. “Automating design of voltage interpolation to address process variations.” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 99, Pp. 1–14. Publisher's VersionAbstract

Post-fabrication tuning provides a promising design approach to mitigate the performance and power overheads of process variation in advanced fabrication technologies. This paper explores design considerations and VLSI-CAD support for a recently proposed post-fabrication tuning knob called voltage interpolation. Successful implementation of this technique requires examination of the design tradeoffs between circuit tuning range and static power overheads within the synthesis flow of the design process, in addition to the implications of place and route. Results from the exploration of the scheme for a 64-core chip-multiprocessor machine using industrial-grade design blocks show that the scheme can be used to mitigate overhead arising from random and correlated within-die process variations. A design using voltage interpolation can match the nominal delay target with a 16% power cost, or for the same power budget, incur only a 13% delay overhead after variations.

Automating design of voltage interpolation to address process variations
Vijay Reddi, Svilen Kanev, Wonyoung Kim, Simone Campanoni, Michael Smith, Gu Wei, and David Brooks. 1/2011. “Voltage Noise in Production Processors.” IEEE Micro, 31, 1. Publisher's VersionAbstract
Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.
Voltage Noise in Production Processors
Benjamin Lee and David Brooks. 9/2010. “Applied inference: Case studies in microarchitectural design.” ACM Transactions on Architecture and Code Optimization (TACO), 7, 2, Pp. 8. Publisher's VersionAbstract

We propose and apply a new simulation paradigm for microarchitectural design evaluation and optimization. This paradigm enables more comprehensive design studies by combining spatial sampling and statistical inference. Specifically, this paradigm (i) defines a large, comprehensive design space, (ii) samples points from the space for simulation, and (iii) constructs regression models based on sparse simulations. This approach greatly improves the computational efficiency of microarchitectural simulation and enables new capabilities in design space exploration.

We illustrate new capabilities in three case studies for a large design space of approximately 260,000 points: (i) Pareto frontier, (ii) pipeline depth, and (iii) multiprocessor heterogeneity analyses. In particular, regression models are exhaustively evaluated to identify Pareto optimal designs that maximize performance for given power budgets. These models enable pipeline depth studies in which all parameters vary simultaneously with depth, thereby more effectively revealing interactions with nondepth parameters. Heterogeneity analysis combines regression-based optimization with clustering heuristics to identify efficient design compromises between similar optimal architectures. These compromises are potential core designs in a heterogeneous multicore architecture. Increasing heterogeneity can improve bips3/w efficiency by as much as 2.4×, a theoretical upper bound on heterogeneity benefits that neglects contention between shared resources as well as design complexity. Collectively these studies demonstrate regression models' ability to expose trends and identify optima in diverse design regions, motivating the application of such models in statistical inference for more effective use of modern simulator infrastructure.

Applied inference: Case studies in microarchitectural design
Vijay Reddi, Simone Campanoni, Meeta Gupta, Michael Smith, Gu Wei, David Brooks, and Kim Hazelwood. 9/2010. “Eliminating voltage emergencies via software-guided code transformations.” ACM Transactions on Architecture and Code Optimization (TACO), 7, 2, Pp. 1-28. Publisher's VersionAbstract
In recent years, circuit reliability in modern high-performance processors has become increasingly important. Shrinking feature sizes and diminishing supply voltages have made circuits more sensitive to microprocessor supply voltage fluctuations. These fluctuations result from the natural variation of processor activity as workloads execute, but when left unattended, these voltage fluctuations can lead to timing violations or even transistor lifetime issues. In this paper, we present a hardware-software collaborative approach to mitigate voltage fluctuations. A checkpoint-recovery mechanism rectifies errors when voltage violates maximum tolerance settings, while a run-time software layer reschedules the program’s instruction stream to prevent recurring violations at the same program location. The run-time layer, combined with the proposed code rescheduling algorithm, removes 60% of all violations with minimal overhead, thereby significantly improving overall performance. Our solution is a radical departure from the ongoing industry standard approach to circumvent the issue altogether by optimizing for the worst case voltage flux, which compromises power and performance efficiency severely, especially looking ahead to future technology generations. Existing conservative approaches will have severe implications on the ability to deliver efficient microprocessors. The proposed technique reassembles a traditional reliability problem as a runtime performance optimization problem, thus allowing us to design processors for typical case operation by building intelligent algorithms that can prevent recurring violations.
Eliminating voltage emergencies via software-guided code transformations
Benton Calhoun and David Brooks. 7/2010. “Can Subthreshold and Near-Threshold Circuits Go Mainstream?” Micro, IEEE, 30, 4, Pp. 80–85. Publisher's VersionAbstract
Recent research has shown the potential benefits of subthreshold or near-threshold operation, which gives up a substantial degree of speed in order to reduce energy per operation. This is an excellent trade-off for many tasks, such as cyberphysical systems. This prolegomenon summarizes the benefits and challenges of subthreshold or near-threshold operation.
Can Subthreshold and Near-Threshold Circuits Go Mainstream?
Michael Lyons, Mark Hempstead, Gu Wei, and David Brooks. 2/2010. “The Accelerator Store framework for high-performance, low-power accelerator-based systems.” IEEE Computer Architecture Letters, 9, 2, Pp. 53-56. Publisher's VersionAbstract
Hardware acceleration can increase performance and reduce energy consumption. To maximize these benefits, accelerator- based systems that emphasize computation on accelerators (rather than on general purpose cores) should be used. We introduce the “accelerator store,” a structure for sharing memory between accelerators in these accelerator-based systems. The accelerator store simplifies accelerator I/O and reduces area by mapping memory to accelerators when needed at runtime. Preliminary results demonstrate a 30% system area reduction with no energy overhead and less than 1% performance overhead in contrast to conventional DMA schemes.
The Accelerator Store framework for high-performance, low-power accelerator-based systems
Vijay Reddi, Meeta Gupta, Glenn Holloway, Michael Smith, Gu Wei, and David Brooks. 1/2010. “Predicting voltage droops using recurring program and microarchitectural event activity.” IEEE Micro, 30, 1. Publisher's VersionAbstract
Shrinking feature size and diminishing supply voltage are making circuits more sensitive to supply voltage fluctuations within a microprocessor. If left unattended, voltage fluctuations can lead to timing violations or even transistor lifetime issues. A mechanism that dynamically learns to predict dangerous voltage fluctuations based on program and microarchitectural events can help steer the processor clear of danger.
Predicting voltage droops using recurring program and microarchitectural event activity
Ankur Agrawal, Andrew Liu, Pavan Kumar Hanumolu, and Gu-Yeon Wei. 8/2009. “An 8$, times, $5 Gb/s Parallel Receiver With Collaborative Timing Recovery.” IEEE Journal of Solid-State Circuits, 44, 11, Pp. 3120–3130. Publisher's VersionAbstract
This paper presents the design of an 8 channel, 5 & Gb/s per channel parallel receiver with collaborative timing recovery and no forwarded clock. The receiver architecture exploits synchrony in the transmitted data streams in a parallel interface and combines error information from multiple phase detectors in the receiver to produce one global synthesized clock. This collaborative timing recovery scheme enables wideband jitter tracking without increasing the dithering jitter in the synthesized clock. Circuit design techniques employed to implement this receiver architecture are discussed. Experimental results from a 130 nm CMOS test chip demonstrate the enhanced tracking bandwidth and lower dithering jitter of the recovered clock.
An 8$, times, $5 Gb/s Parallel Receiver With Collaborative Timing Recovery
Lukasz Strozek and David Brooks. 3/2009. “Energy-and area-efficient architectures through application clustering and architectural heterogeneity.” ACM Transactions on Architecture and Code Optimization (TACO), 6, 1, Pp. 4. Publisher's VersionAbstract

Customizing architectures for particular applications is a promising approach to yield highly energy-efficient designs for embedded systems. This work explores the benefits of architectural customization for a class of embedded architectures typically used in energy- and area-constrained application domains, such as sensor nodes and multimedia processing. We implement a process flow that performs an automatic synthesis and evaluation of the different architectures based on runtime profiles of applications and determines an efficient architecture, with consideration for both energy and area constraints. An expressive architectural model, used by our engine, is introduced that takes advantage of efficient opcode allocation, several memory addressing modes, and operand types. By profiling embedded benchmarks from a variety of sensor and multimedia applications, we show that the energy savings resulting from various architectural optimizations relative to the base architectures (e.g., MIPS and MSP430) are significant and can reach 50%, depending on the application. We then identify the set of architectures that achieves near-optimal savings for a group of applications. Finally, we propose the use of heterogeneous ISA processors implementing those architectures as a solution to capitalize on energy savings provided by application customization while executing a range of applications efficiently.

Energy-and area-efficient architectures through application clustering and architectural heterogeneity
Vijay Reddi, Meeta Gupta, Krishna Rangan, Simone Campanoni, Glenn Holloway, Michael Smith, Gu Wei, and David Brooks. 1/2009. “Voltage noise: Why it’s bad, and what to do about it.” 5th IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), Palo Alto, CA.Abstract
Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose hardware-software collaboration to enable aggressive voltage margins: a fail-safe hardware mechanism tolerates margin violations in order to train a run-time software layer that reschedules instructions to avoid recurring violations. Additionally, the software controls an emergency signature-based predictor that throttles to suppress emergencies that code rescheduling cannot eliminate.
Voltage noise: Why it’s bad, and what to do about it
Mark Hempstead, Gu Wei, and David Brooks. 2009. “Navigo: An early-stage model to study power-contrained architectures and specialization.” Workshop on Modeling, Benchmarking, and Simulation.Abstract
As the number of transistors double, it becomes difficult to power all of them within a strict power budget and still achieve the performance gains of that the industry has achieved historically. This work presents, Navigo, a modeling framework for architecture exploration across future process technology generations. The model includes support for voltage and frequency scaling based on ITRS and PTM models. This work is designed to aid architects in the planning stages of next generation microprocessors, by addressing the space between early-stage back-of-the-envelope calculations and later stage cycle accurate simulators. Using parameters from existing commercial processor cores, we show how power consumption limits the theoretical throughput of future processors. Navigo shows that specialization is the answer to circumvent the power density limit that curbs performance gains and resume traditional 1.58x performance growth trends. We present analysis, using next generation of process technologies, that shows the fraction of area that must be allocated for specialization to maintain performance growth must increase with each new generation of process technology.
Navigo: An early-stage model to study power-contrained architectures and specialization
Ruwan Ratnayake, Aleksandar Kavcic, and Gu Wei. 9/16/2008. “A high-throughput maximum a posteriori probability detector.” IEEE Journal of solid-state circuits, 43, 8, Pp. 1846–1858. Publisher's VersionAbstract
This paper presents a maximum a posteriori probability (MAP) detector, based on a forward-only algorithm that can achieve high throughputs. The MAP algorithm is optimal in terms of bit error rate (BER) performance and, with Turbo decoding, can approach performance close to the channel capacity limit. The proposed detector utilizes a deep-pipelined architecture implemented in skew-tolerant domino and experimentally measured results verify the detector can achieve throughputs greater than 750MHz while consuming 2.4W. The detector is implemented in a 0.13μm CMOS technology and has a die area of 9.9 mm 2 .
A high-throughput maximum a posteriori probability detector
Helal M, Z Straayer, Gu Wei, and Perroth H. 4/2008. “A highly digital MDLL-based clock multiplier that leverages a self-scrambling time-to-digital converter to achieve subpicosecond jitter performance.” IEEE Journal of Solid-State Circuits, 43, 4, Pp. 855–863. Publisher's VersionAbstract
This paper presents a mostly digital multiplying delay-locked loop (MDLL) architecture that leverages a new time-to-digital converter (TDC) and a correlated double-sampling technique to achieve subpicosecond jitter performance. The key benefit of the proposed structure is that it provides a highly digital technique to reduce deterministic jitter in the MDLL output with low sensitivity to mismatch and offset in the associated tuning circuits. The TDC structure, which is based on a gated ring oscillator (GRO), is expected to benefit other PLL/DLL applications as well due to the fact that it scrambles and first-order noise shapes its associated quantization noise. Measured results are presented of a custom MDLL prototype that multiplies a 50 MHz reference frequency to 1.6 GHz with 928 fs rms jitter performance. The prototype consists of two 0.13 mum integrated circuits, which have a combined active area of 0.06 mm 2 and a combined core power of 5.1 mW, in addition to an FPGA board, a discrete DAC, and a simple RC filter.
A highly digital MDLL-based clock multiplier that leverages a self-scrambling time-to-digital converter to achieve subpicosecond jitter performance
Mark Hempstead, Michael Lyons, David Brooks, and Gu Wei. 4/2008. “Survey of hardware systems for wireless sensor networks.” Journal of Low Power Electronics, 4, 1, Pp. 11–20. Publisher's VersionAbstract
Wireless sensor networks have been gaining interest as a platform that changes how we interact with the physical world. Applications in medicine, military, inventory management, structural and environmental monitoring, and the like can benefit from low-power wireless nodes that communicate data collected via a variety of sensors. Current deployments of wireless sensor networks (WSN) rely on off-the-shelf commodity-based microcontrollers, but the unoptimized energy consumption of these systems can limit the effective lifetimes. Ideally, researchers would like to deeply embed wireless sensor network nodes in the physical world, relying on energy scavenged from the ambient environment. This paper provides a survey of ultra low power processors specifically designed for WSN applications that have begun to emerge from research labs, which require detailed understanding of tradeoffs between application space, architecture, and circuit techniques to implement these low-power systems.
Survey of hardware systems for wireless sensor networks
Hanumolu Kumar, Kratyuk Volodymyr, Gu Wei, and Moon Un-Ku. 2/2008. “A sub-picosecond resolution 0.5–1.5 GHz digital-to-phase converter.” IEEE Journal of Solid-State Circuits, 43, 2, Pp. 414–424. Publisher's VersionAbstract
A digital-to-phase converter (DPC) is an essential building block in applications such as source-synchronous interfaces and digital phase modulators. The resolution of DPCs using analog phase interpolators is severely affected by the operating frequency and rise times of the interpolator inputs. In this paper, we present a new DPC architecture that achieves high resolution independent of both the operating frequency and the rise time. The 8 phases generated by a phase-locked loop are dithered using a delta-sigma modulator to shape the truncation error to high frequency and is subsequently filtered using a delay-locked loop phase filter. The test chip, fabricated in a 0.13 mum CMOS process, operates from 0.5 -1.5 GHz and achieves a differential nonlinearity of less than plusmn0.1 ps and an integral nonlinearity of plusmn12 ps. The total power consumption while operating at 1 GHz is 15 mW.
A sub-picosecond resolution 0.5–1.5 GHz digital-to-phase converter
Hanumolu Kumar, Gu Wei, and Moon Ku. 2/2008. “A wide-tracking range clock and data recovery circuit.” IEEE Journal of Solid-State Circuits, 43, 2, Pp. 425–439. Publisher's VersionAbstract
A hybrid analog-digital quarter-rate clock and data recovery circuit (CDR) that achieves a wide-tracking range and excellent frequency and phase tracking resolution is presented in this paper. A split-tuned analog phase-locked loop (PLL) provides eight equally spaced phases needed for quarter-rate data recovery and the digital CDR loop adjusts the phase of the PLL output clocks in a precise manner to facilitate plesiochronous clocking. The CDR employs a second-order digital loop filter and combines delta-sigma modulation with the analog PLL to achieve sub-picosecond phase resolution and better than 2 ppm frequency resolution. A test chip fabricated in a 0.18 mum CMOS process achieves BER <10 -12 and consumes 14 mW power while operating at 2 Gb/s. The tracking range is greater than plusmn5000 ppm and plusmn2500 ppm at 10 kHz and 20 kHz modulation frequencies, respectively, making this CDR suitable for systems employing spread-spectrum clocking.
A wide-tracking range clock and data recovery circuit
Xiaoyao Liang, Ramon Canal, Gu Wei, and David Brooks. 1/2008. “Replacing 6t srams with 3t1d drams in the l1 data cache to combat process variability.” Micro, IEEE, 28, 1, Pp. 60–68. Publisher's VersionAbstract
With continued technology scaling, process variations will be especially detrimental to six-transistor static memory structures (6T SRAMs). A memory architecture using three-transistor, one-diode DRAM (3T1D) cells in the L1 data cache tolerates wide process variations with little performance degradation, making it a promising choice for on-chip cache structures for next-generation microprocessors.
Replacing 6t srams with 3t1d drams in the l1 data cache to combat process variability
Michael Lyons and David Brooks. 2008. “Application-Specific Hardware Design for Wireless Sensor Network Energy and Delay Reduction.” Workshop on Optimizations for DSP and Embedded Systems (ODES). Publisher's VersionAbstract
Battery-powered embedded systems, such as wireless sensor network (WSN) motes, require low energy usage to extend system lifetime. WSN motes must power sensors, a processor, and a radio for wireless communication over long periods of time, and are therefore particularly sensitive to energy use. Recent techniques for reducing WSN energy consumption, such as aggregation, require additional computation to reduce the cost of sending data by minimizing radio data transmissions. Larger demands on the processor will require more computational energy, but traditional energy reduction approaches, such as multi-core scaling with reduced frequency and voltage may prove heavy handed and ineffective for motes. Instead, application-specific hardware design (ASHD) architectures can reduce computational energy consumption by processing operations common to specific applications more efficiently than a general purpose processor. By the nature of their deeply embedded operation, motes support a limited set of applications, and thus the conventional general purpose computing paradigm may not be well-suited to mote operation. Both simple and complex operations can improve performance and use orders of magnitude less energy with application-specific hardware. This paper examines the design considerations of a hardware accelerator for compressed Bloom filters, a data structure for efficiently storing set membership. Additionally, we evaluate our ASHD design for three representative wireless sensor network applications: monitoring network-wide mote status, object tracking, and on-mote duplicate packet filtering. We demonstrate that ASHD design reduces network latency by 59% and computational energy by 98%, and show the need for architecting processors for ASHD accelerators. 
Xiaoyao Liang, Benjamin Lee, Gu Wei, and David Brooks. 2008. “Design and Test Strategies for Microarchitectural PostFabrication”. Publisher's VersionAbstract

Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.

Design and Test Strategies for Microarchitectural PostFabrication