Publications by Type: Conference Paper

2007
Meeta Gupta, Krishna Rangan, Michael Smith, Gu Wei, and David Brooks. 8/27/2007. “Towards a software approach to mitigate voltage emergencies.” In Low Power Electronics and Design (ISLPED), 2007 ACM/IEEE International Symposium on, Pp. 123–128. IEEE. Publisher's VersionAbstract
Increases in peak current draw and reductions in the operating voltages of processors continue to amplify the importance of dealing with voltage fluctuations in processors. One approach suggested has been to not only react to these fluctuations but also attempt to eliminate future occurrences of these fluctuations by dynamically modifying the executing program. This paper investigates the potential of a very simple dynamic scheme to appreciably reduce the number of run-time voltage emergencies. It shows that we can map many of the voltage emergencies in the execution of the SPEC benchmarks on an aggressive superscalar design to a few static loops, categorize the microarchitectural cause of the emergencies in each important loop through simple observations and a simple priority function, and finally apply straight forward software optimization strategies to mitigate up to 70% of the future voltage swings.
Towards a software approach to mitigate voltage emergencies
Helal M, Straayer Z, Gu Wei, and Perrott H. 6/14/2007. “A low jitter 1.6 GHz multiplying DLL utilizing a scrambling time-to-digital converter and digital correlation.” In 2007 IEEE Symposium on VLSI Circuits, Pp. 166–167. IEEE. Publisher's VersionAbstract
This paper presents a 1.6 GHz multiplying delay-locked loop (MDLL) that leverages time-to-digital conversion and a digital correlation technique to achieve low deterministic jitter while still maintaining low random jitter. A proposed time-to-digital converter consists of a ring oscillator that is gated on and off to accurately measure time and scramble the measurement's residual error. Using a 50 MHz reference, the prototype system has measured reference spurs less than -59 dBc and an overall measured jitter of 1.41 ps.
A low jitter 1.6 GHz multiplying DLL utilizing a scrambling time-to-digital converter and digital correlation
Meeta Gupta, Jarod Oatley, Russ Joseph, Gu Wei, and David Brooks. 4/16/2007. “Understanding voltage variations in chip multiprocessors using a distributed power-delivery network.” In Design, Automation & Test in Europe Conference & Exhibition, 4/16/2007. DATE'07, Pp. 1–6. Nice, France: IEEE. Publisher's VersionAbstract
Recent efforts to address microprocessor power dissipation through aggressive supply voltage scaling and power management require that designers be increasingly cognizant of power supply variations. These variations, primarily due to fast changes in supply current, can be attributed to architectural gating events that reduce power dissipation. In order to study this problem, the authors propose a fine-grain, parameterizable model for power-delivery networks that allows system designers to study localized, on-chip supply fluctuations in high-performance microprocessors. Using this model, the authors analyze voltage variations in the context of next-generation chip-multiprocessor (CMP) architectures using both real applications and synthetic current traces. They find that the activity of distinct cores in CMPs present several new design challenges when considering power supply noise, and they describe potentially problematic activity sequences that are unique to CMP architectures
Understanding voltage variations in chip multiprocessors using a distributed power-delivery network
Benjamin Lee, David Brooks, Bronis Supinski, Martin Schulz, Karan Singh, and Sally McKee. 3/2007. “Methods of inference and learning for performance modeling of parallel applications.” In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, Pp. 249–258. ACM. Publisher's VersionAbstract

Increasing system and algorithmic complexity combined with a growing number of tunable application parameters pose significant challenges for analytical performance modeling. We propose a series of robust techniques to address these challenges. In particular, we apply statistical techniques such as clustering, association, and correlation analysis, to understand the application parameter space better. We construct and compare two classes of effective predictive models: piecewise polynomial regression and artifical neural networks. We compare these techniques with theoretical analyses and experimental results. Overall, both regression and neural networks are accurate with median error rates ranging from 2.2 to 10.5 percent. The comparable accuracy of these models suggest differentiating features will arise from ease of use, transparency, and computational efficiency.

Methods of inference and learning for performance modeling of parallel applications
Benjamin Lee and David Brooks. 2/10/2007. “Illustrative design space studies with microarchitectural regression models.” In High Performance Computer Architecture, 2/10/2007. HPCA 2/10/2007. IEEE 13th International Symposium on, Pp. 340–351. Phoenix, Arizona, USA: IEEE. Publisher's VersionAbstract
We apply a scalable approach for practical, comprehensive design space evaluation and optimization. This approach combines design space sampling and statistical inference to identify trends from a sparse simulation of the space. The computational efficiency of sampling and inference enables new capabilities in design space exploration. We illustrate these capabilities using performance and power models for three studies of a 260,000 point design space: (1) Pareto frontier analysis, (2) pipeline depth analysis, and (3) multiprocessor heterogeneity analysis. For each study, we provide an assessment of predictive error and sensitivity of observed trends to such error. We construct Pareto frontiers and find predictions for Pareto optima are no less accurate than those for the broader design space. We reproduce and enhance prior pipeline depth studies, demonstrating constrained sensitivity studies may not generalize when many other design parameters are held at constant values. Lastly, we identify efficient heterogeneous core designs by clustering per benchmark optimal architectures. Collectively, these studies motivate the application of techniques in statistical inference for more effective use of modern simulator infrastructure
Illustrative design space studies with microarchitectural regression models
2006
Fang Chi, Sharon Kedar, Susan Owen, Gu Wei, David Brooks, and Jonathan Lees. 12/18/2006. “System-on-chip architecture design for intelligent sensor networks.” In 2006 International Conference on Intelligent Information Hiding and Multimedia, Pp. 579–582. IEEE. Publisher's VersionAbstract
While wireless sensor networks can generically be used for a wide variety of applications, breakthrough innovations are most often achieved when driven by a genuine need or application, with its specific system-level and science-related requirements and objectives. Hence, our work focuses on the development of wireless sensor network system-on-chip devices and supporting software for volcano monitoring, which we call Sensor Network for Active Volcanoes (SNAV). In this paper we present preliminary results of our research and development work on intelligent sensor networks for monitoring hazardous environments especially the SNAV system-on-chip design for active volcanoes monitoring.
Xiaoyao Liang and David Brooks. 12/9/2006. “Mitigating the impact of process variations on processor register files and execution units.” In Microarchitecture, 12/9/2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, Pp. 504–514. IEEE. Publisher's VersionAbstract
Design variability due to die-to-die and within-die process variations has the potential to significantly reduce the maximum operating frequency and the effective yield of high-performance microprocessors in future process technology generations. One serious manifestation of this increased variability is a reduction in the mean frequency of fabricated chips due to fluctuations in device characteristics causing reduced circuit performance. In this paper, we propose to mitigate the impact of variations through variable-latency register files and execution units which are key architectural components that may encounter variability problems. We also illustrate the importance of closing the gap in expected delay of these distinct structures. A post fabrication test and configuration strategy is proposed. We find that 23% mean frequency improvement with an average IPC loss of 3% (and never exceeding 5% for worst case chips) is possible for the 65nm technology node by properly adopting the proposed schemes
Mitigating the impact of process variations on processor register files and execution units
Benjamin Lee and David Brooks. 12/2006. “Accurate and efficient regression modeling for microarchitectural performance and power prediction.” In ACM SIGOPS Operating Systems Review, 5th ed., 40: Pp. 185–194. ACM. Publisher's VersionAbstract

We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental challenges in microarchitectural simulation cost by reducing the number of required simulations and using simulated results more effectively via statistical modeling and inference.Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving median error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) application-specific models to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing sensitivity to the number of samples simulated for model formulation, we find fewer than 4,000 samples from a design space of approximately 22 billion points are sufficient. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

Accurate and efficient regression modeling for microarchitectural performance and power prediction
Xiaoyao Liang and David Brooks. 11/5/2006. “Microarchitecture parameter selection to optimize system performance under process variation.” In Computer-Aided Design, 11/5/2006. ICCAD'06. IEEE/ACM International Conference on, Pp. 429–436. IEEE. Publisher's VersionAbstract

Design variability due to within-die and die-to-die process variations has the potential to significantly reduce the maximum operating frequency and the effective yield of high-performance microprocessors in future process technology generations. This variability manifests itself by increasing the number and criticality of long delay paths. To quantify this impact, we use an architectural process variation model that is appropriate for the analysis of system performance in the early-stages of the design process. We propose a method of selecting microarchitectural parameters to mitigate the frequency impact due to process variability for distinct structures, while minimizing IPC (instructions-per-cycle) loss. We propose an optimization procedure to be used for system-level design decisions, and we find that joint architecture and statistical timing analysis can be more advantageous than pure circuit level optimization. Overall, the technique can improve the 90% yield frequency by about 14% with 3% IPC loss for a baseline machine with a 20FO4 logic depth per pipestage. This approach is sensitive to the selection of processor pipeline depth, and we demonstrate that machines with aggressive pipelines will experience greater challenges in coping with process variability.

Microarchitecture parameter selection to optimize system performance under process variation
Lukasz Strozek and David Brooks. 10/22/2006. “Efficient architectures through application clustering and architectural heterogeneity.” In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 190–200. ACM. Publisher's VersionAbstract

Customizing architectures for particular applications is a promising approach to yield highly energy-efficient designs for embedded systems. This work explores the benefits of architectural customization for a class of embedded architectures typically used in energy-constrained application domains such as sensor node and multimedia processing. We implement a process flow that analyzes runtime profiles of applications and combines this information with a model for our architectural design space providing a robust customization engine built upon a fully automated method for determining an efficient architecture (together with appropriate application transformations). By profiling embedded benchmarks from a variety of sensor and multimedia applications, the paper shows the relative energy savings resulting from various architectural optimizations and identifies the number of architectures that achieves near-optimal savings for a group of applications. This paper proposes the use of heterogeneous chip-multiprocessors as a cost-effective approach to capitalize on the potential energy savings provided by application customization while executing a range of applications efficiently.

Efficient architectures through application clustering and heterogeneity
Mark Hempstead, Gu Wei, and David Brooks. 10/2006. “Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations.” In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 368–378. ACM. Publisher's VersionAbstract

Rising interest in the applications of wireless sensor networks has spurred research in the development of computing systems for low-throughput, energy-constrained applications. Unlike traditional performance oriented applications, sensor network nodes are primarily constrained by operation lifetime, which is limited by power consumption. Advanced CMOS process technologies provide ever increasing transistor density and improved performance characteristics. However, shrinking feature size and decreasing threshold voltages also lead to significant increases in leakage current, which is especially troublesome for applications with significant idle times. This work investigates tradeoffs between leakage and active power for low-throughput applications. We study these issues across a range of process technologies on a computing architecture that provides explicit support for fine-grain leakage-control techniques such as Vdd-gating and adaptive body bias. We present a methodology for selecting design parameters, including choice of process technology, that makes the optimal tradeoff between active power and leakage power for a given workload. Our results show that leakage power will dominate the selection of process technology, and architectures that support advanced leakage control techniques at the circuit level will be essential. We argue that without advanced low-power architectures future nano-scale process technologies will not be suited for sensor network applications.

Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations
Hanumolu Kumar, Gyu Kim, Gu Wei, and Moon Ku. 9/10/2006. “A 1.6 Gbps digital clock and data recovery circuit.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 603–606. IEEE. Publisher's VersionAbstract
A digital clock and data recovery circuit employs simple 3-level digital-to-analog converters to interface the digital loop filter to the voltage controlled oscillator and achieves low jitter performance. Test chip fabricated in a 0.13mum CMOS process achieves BER < 10 -12 , plusmn1500ppm lock-in range, plusmn2500ppm tracking range, recovered clock jitter of 8.9ps rms and consumes 12mW power from a single-pin 1.2V supply, while operating at 1.6Gbps
TanYuan and Gu Wei. 9/10/2006. “Adaptive-bandwidth mixing PLL/DLL based multi-phase clock generator for optimal jitter performance.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 749–752. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
This paper presents an adaptive-bandwidth mixing PLL/DLL (MX-PDLL) based multi-phase clock generator that can operate as a PLL, DLL, or a mixture of the two. Moreover, this clock generator can be used in a proposed dual-loop CDR to minimize output clock jitter under various noise environments. A test-chip prototype of the MX-PDLL and a 360deg phase rotator was fabricated in a 0.18mum CMOS process, operating off of a 1.8V supply. Experimentally measured results verify that while PLL-mode operation offers the ability to better filter quantization noise from the digital CDR control, shifting towards DLL-mode operation offers the ability to reduce jitter as the amount of on-chip noise increases
TanYuan and Gu Wei. 9/10/2006. “Phase mismatch detection and compensation for PLL/DLL based multi-phase clock generator.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 417–420. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
Device mismatch and systematic imbalances in the physical design can cause static phase mismatch in a PLL/DLL based multi-phase clock generator and degrade performance. This problem gets worse in deep sub-micron technologies. Interleaved transceiver architectures require precise clocking to maximize data rate and minimize bit errors. In this paper, a static phase mismatch compensation scheme for multiple sampling clocks is proposed and tested in an adaptive-bandwidth mixing PLL/DLL based multi-phase clock generator. The proposed charge pump compensator and power efficient phase-averaging network together reduce the static phase mismatch standard deviation by 37% when operating in DLL mode. A simple and robust duty-cycle correction circuit exhibits a small residual error of 0.65% across a wide range (36% to 49%) of input clock duty-cycle values
Howard W, Gu Wei, Dally J, and Horowitz Paul. 9/10/2006. “Pulsenet-A Parallel Flash Sampler and Digital Processor IC for Optical SETI.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 261–264. IEEE. Publisher's VersionAbstract

PulseNet is a full-custom IC with parallel flash ADC and digital processing that enables an all-sky optical search for extraterrestrial intelligence. It integrates 448 sense amplifiers that digitize 32 analog signals at 1GS/s, and other circuits that filter samples, store candidate signals, and perform astronomical observations. Its ~250,000 CMOS transistors (TSMC 0.25μm) dissipate 1.1W at 400MHz and 2.5V.

Pulsenet-A Parallel Flash Sampler and Digital Processor IC for Optical SETI
Hanumolu P, Wei Y, and U-K Moon. 6/15/2006. “A wide tracking range 0.2-4Gbps clock and data recovery circuit.” In 2006 Symposium on VLSI Circuits, 6/15/2006. Digest of Technical Papers., Pp. 71–72. IEEE. Publisher's VersionAbstract

A hybrid analog and digital quarter-rate clock and data recovery circuit employs a second-order digital loop filter with delta-sigma truncation to achieve sub-ps phase resolution and better than 2ppm frequency resolution. A test chip fabricated in a 0.18mum CMOS process achieves BER < 10 -12 and consumes 14mW power while operating at 2Gbps. The tracking range is greater than plusmn5000 ppm and plusmn2500 ppm at 10kHz and 20kHz modulation frequencies respectively, thus, making this CDR suitable for systems with spread spectrum clocking

A wide tracking range 0.2-4Gbps clock and data recovery circuit
Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 2/11/2006. “CMP design space exploration subject to physical constraints.” In High-Performance Computer Architecture, 2/11/2006. The Twelfth International Symposium on, Pp. 17–28. Austin, TX, USA: IEEE. Publisher's VersionAbstract
This paper explores the multi-dimensional design space for chip multiprocessors, exploring the inter-related variables of core count, pipeline depth, superscalar width, L2 cache size, and operating voltage and frequency, under various area and thermal constraints. The results show the importance of joint optimization. Thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery, demonstrating the importance of considering thermal constraints while optimizing these other parameters. For aggressive cooling solutions, reducing power density is at least as important as reducing total power, while for low-cost cooling solutions, reducing total power is more important. Finally, the paper shows the challenges of accommodating both CPU-bound and memory-bound workloads on the same design. Their respective preferences for more cores and larger caches lead to increasingly irreconcilable configurations as area and other constraints are relaxed; rather than accommodating a happy medium, the extra resources simply encourage more extreme optimization points.
CMP design space exploration subject to physical constraints
2005
Qiang Wu, Margaret Martonosi, Douglas Clark, Vijay Reddi, Dan Connors, Youfeng Wu, Jin Lee, and David Brooks. 11/12/2005. “A dynamic compilation framework for controlling microprocessor energy and performance.” In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, Pp. 271–282. Barcelona: IEEE Computer Society. Publisher's VersionAbstract
Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS time-interrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer is implemented and integrated into an industrial-strength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3times-5times better than static voltage scaling, and is more than 2times (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control
A dynamic compilation framework for controlling microprocessor energy and performance
Yingmin Li, Mark Hempstead, Patrick Mauro, David Brooks, Zhigang Hu, and Kevin Skadron. 8/2005. “Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices.” In Proceedings of the 2005 international symposium on Low power electronics and design, Pp. 173–178. ACM. Publisher's VersionAbstract

This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain.We study both SRAM and latch and multiplexer ("latch-mux") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the "unconstrained" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when "stall gating" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit-design choices when performing early-stage design exploration

Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices
Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu Wei, and David Brooks. 6/4/2005. “An ultra low power system architecture for sensor network applications.” In ACM SIGARCH Computer Architecture News, 33: Pp. 208–219. Madison, WI, USA: IEEE Computer Society. Publisher's VersionAbstract
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
An ultra low power system architecture for sensor network applications

Pages