Publications

2006
Howard W, Gu Wei, Dally J, and Horowitz Paul. 9/10/2006. “Pulsenet-A Parallel Flash Sampler and Digital Processor IC for Optical SETI.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 261–264. IEEE. Publisher's VersionAbstract

PulseNet is a full-custom IC with parallel flash ADC and digital processing that enables an all-sky optical search for extraterrestrial intelligence. It integrates 448 sense amplifiers that digitize 32 analog signals at 1GS/s, and other circuits that filter samples, store candidate signals, and perform astronomical observations. Its ~250,000 CMOS transistors (TSMC 0.25μm) dissipate 1.1W at 400MHz and 2.5V.

Pulsenet-A Parallel Flash Sampler and Digital Processor IC for Optical SETI
B Lee and David Brooks. 6/18/2006. “Statistically rigorous regression modeling for the microprocessor design space.” ISCA-33: Workshop on Modeling, Benchmarking, and Simulation.Abstract
Regression models enhance existing techniques in detailed microarchitectural simulation by reducing the number of required simulations and using simulation data more efficiently to identify trends and trade-offs. We present a rigorous derivation of such models for microprocessor performanceandpowerprediction, emphasizing the need to apply domain-specific knowledge when performing statistical inference. In particular, we propose sampling observations uniformly at random from a large design space, discuss approaches for identifying statistically significant predictors, and detail strategies for effectively modeling predictor interaction and non-linearity. The resulting models enable computationally efficient statistical inference, requiring the simulation of only 1 in every 5 million points of a joint microarchitecture-application design space while achieving median prediction error rates as low as 4.1 percent for performance and 4.3 percent for power.
Statistically rigorous regression modeling for the microprocessor design space
Hanumolu P, Wei Y, and U-K Moon. 6/15/2006. “A wide tracking range 0.2-4Gbps clock and data recovery circuit.” In 2006 Symposium on VLSI Circuits, 6/15/2006. Digest of Technical Papers., Pp. 71–72. IEEE. Publisher's VersionAbstract

A hybrid analog and digital quarter-rate clock and data recovery circuit employs a second-order digital loop filter with delta-sigma truncation to achieve sub-ps phase resolution and better than 2ppm frequency resolution. A test chip fabricated in a 0.18mum CMOS process achieves BER < 10 -12 and consumes 14mW power while operating at 2Gbps. The tracking range is greater than plusmn5000 ppm and plusmn2500 ppm at 10kHz and 20kHz modulation frequencies respectively, thus, making this CDR suitable for systems with spread spectrum clocking

A wide tracking range 0.2-4Gbps clock and data recovery circuit
Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 2/11/2006. “CMP design space exploration subject to physical constraints.” In High-Performance Computer Architecture, 2/11/2006. The Twelfth International Symposium on, Pp. 17–28. Austin, TX, USA: IEEE. Publisher's VersionAbstract
This paper explores the multi-dimensional design space for chip multiprocessors, exploring the inter-related variables of core count, pipeline depth, superscalar width, L2 cache size, and operating voltage and frequency, under various area and thermal constraints. The results show the importance of joint optimization. Thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery, demonstrating the importance of considering thermal constraints while optimizing these other parameters. For aggressive cooling solutions, reducing power density is at least as important as reducing total power, while for low-cost cooling solutions, reducing total power is more important. Finally, the paper shows the challenges of accommodating both CPU-bound and memory-bound workloads on the same design. Their respective preferences for more cores and larger caches lead to increasingly irreconcilable configurations as area and other constraints are relaxed; rather than accommodating a happy medium, the extra resources simply encourage more extreme optimization points.
CMP design space exploration subject to physical constraints
Qiang Wu, Margaret Martonosi, Douglas Clark, Vijay Reddi, Dan Connors, Youfeng Wu, Jin Lee, and David Brooks. 1/2006. “Dynamic-compiler-driven control for microprocessor energy and performance.” Micro, IEEE, 26, 1, Pp. 119–129. Publisher's VersionAbstract
A general dynamic-compilation environment offers power and performance control opportunities for microprocessors. The authors propose a dynamic-compiler-driven runtime voltage and frequency optimizer. A prototype of their design, implemented and deployed in a real system, achieves energy savings of up to 70 percent
Xiaoyao Liang and David Brooks. 2006. “Latency adaptation for multiported register files to mitigate the impact of process variations.” Workshop on Architectural Support for Gigascale Integration (ASGI-06, held in conjuction with ISCA-33).Abstract

Design variability due to die-to-die and within-die process variations has the potential to significantly reduce the maximum operating frequency and the effective yield of high-performance microprocessors in future process technology generations. This variability manifests itself by increasing the frequency variance and decreasing the mean frequency of fabricated chips. In this paper we develop a model for the impact of variability on the performance of multiported SRAM-based structures such as physical register files which are key architectural components that may encounter variability problems. We find that naively resizing or increasing the access latency of these performance critical datapath resources can have frequency benefits, but may incur a significant IPC loss that limits overall system performance. We propose an extension to latency adaptation called port switching which more efficiently exploits the technique to remedy the IPC loss. We find that even under a conservative, worst case study, 18 % mean frequency improvement with less than 5 % IPC loss is possible for the 65nm technology node. Finally, we contrast the impact of die-to-die and within-die variations on chip performance and demonstrate that the proposed technique can compensate the frequency loss mainly due to within-die variations.

Latency adaptation for multiported register files to mitigate the impact of process variations
Yingmin Li, Kevin Skadron, Benjamin Lee, and David Brooks. 2006. “Quantifying Latency and Throughput Compromises in CMP Design”. Publisher's VersionAbstract

Designers of chip multiprocessors will increasingly be called upon to optimize for a combination of design metrics under a variety of design constraints. The adoption of chip multiprocessors has also led to a shift in design metrics toward aggregate throughput and away from single thread latency. We examine the compromises between latency and throughput under various power, thermal, area, and bandwidth constraints to quantify the latency penalties of a purely throughput optimized design. We consider a large chip multiprocessor design space that includes core count, core complexity (pipeline dimensions, in-order versus out-of-order execution), and cache hierarchy sizes. We demonstrate an approach to effectively assess trade-offs given a comprehensive core model, a set of optimization criteria, and a set of design constraints. We perform a number of case studies to evaluate these trade-offs, exposing significant single thread latency penalties when optimizing solely for throughput and neglecting other measures of performance. As single thread latency continues to be one of several design metrics, any choice to compromise latency should be well understood before implementation. Collectively, our results suggest single thread latency is still a design metric of importance given that optimizing throughput alone will significantly compromise latency. Furthermore, the case for simple, in-order cores should be taken with caution given this balanced view of performance.

Quantifying Latency and Throughput Compromises in CMP Design
Benjamin Lee and David Brooks. 2006. “Regression modeling strategies for microarchitectural performance and power prediction.” Proceedings of the 2006 ASPLOS Conference, Pp. 185–194.Abstract

We propose regression modeling as an effective approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This report addresses fundamental challenges in microarchitectural simulation costs via statistical modeling. Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) per benchmark application-specific models designed to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing model sensitivity to sample and region sizes, we find 4,000 samples from a design space of approximately 22 billion points, are sufficient for both application-specific and regional modeling and prediction. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

Regression modeling strategies for microarchitectural performance and power prediction
Benjamin Lee and David Brooks. 2006. “Statistical inference”.
2005
Qiang Wu, Margaret Martonosi, Douglas Clark, Vijay Reddi, Dan Connors, Youfeng Wu, Jin Lee, and David Brooks. 11/12/2005. “A dynamic compilation framework for controlling microprocessor energy and performance.” In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, Pp. 271–282. Barcelona: IEEE Computer Society. Publisher's VersionAbstract
Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS time-interrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer is implemented and integrated into an industrial-strength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3times-5times better than static voltage scaling, and is more than 2times (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control
A dynamic compilation framework for controlling microprocessor energy and performance
Yingmin Li, Mark Hempstead, Patrick Mauro, David Brooks, Zhigang Hu, and Kevin Skadron. 8/2005. “Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices.” In Proceedings of the 2005 international symposium on Low power electronics and design, Pp. 173–178. ACM. Publisher's VersionAbstract

This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain.We study both SRAM and latch and multiplexer ("latch-mux") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the "unconstrained" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when "stall gating" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit-design choices when performing early-stage design exploration

Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices
Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu Wei, and David Brooks. 6/4/2005. “An ultra low power system architecture for sensor network applications.” In ACM SIGARCH Computer Architecture News, 33: Pp. 208–219. Madison, WI, USA: IEEE Computer Society. Publisher's VersionAbstract
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
An ultra low power system architecture for sensor network applications
Hanumolu Kumar, Gu Wei, and Moon Ku. 6/2005. “Equalizers for high-speed serial links.” International journal of high speed electronics and systems, 15, 02, Pp. 429–458. Publisher's VersionAbstract

In this tutorial paper we present equalization techniques to mitigate inter-symbol interference (ISI) in high-speed communication links. Both transmit and receive equalizers are analyzed and high-speed circuits implementing them are presented. It is shown that a digital transmit equalizer is the simplest to design, while a continuous-time receive equalizer generally provides better performance. Decision feedback equalizer (DFE) is described and the loop latency problem is addressed. Finally, techniques to set the equalizer parameters adaptively are presented.

Equalizers for high-speed serial links
Xuning Chen, Shiuan Peh, Gu Wei, Kai Huang, and Paul Prucnal. 2/12/2005. “Exploring the design space of power-aware opto-electronic networked systems.” In 11th International Symposium on High-Performance Computer Architecture, Pp. 120–131. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
As microprocessors become increasingly interconnected, the power consumed by the interconnection network can no longer be ignored. Moreover, with demand for link bandwidth increasing, optical links are replacing electrical links in inter-chassis and inter-board environments. As a result, the power dissipation of optical links is becoming as critical as their speed. In this paper, we first explore options for building high speed optoelectronic links and discuss the power characteristics of different link components. Then, we propose circuit and network mechanisms that can realize power-aware optical links -links whose power consumption can be tuned dynamically in response to changes in network traffic. Finally, we incorporate power control policies along with the power characterization of link circuitry into a detailed network simulator to evaluate the performance cost and power savings of building power aware optoelectronic networked systems. Simulation results show that more than 75% savings in power consumption can be achieved with the proposed power aware optoelectronic network.
Exploring the design space of power-aware opto-electronic networked systems
Yingmin Li, K Skadron, David Brooks, and Zhigang Hu. 2/12/2005. “Performance, energy, and thermal considerations for SMT and CMP architectures.” In High-Performance Computer Architecture, 2/12/2005. HPCA-11. 11th International Symposium on, Pp. 71–82. IEEE. Publisher's VersionAbstract
Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their relative energy-efficiency and thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to explore this design space for a POWER4/POWER5-like core. For an equal-area comparison with this style of core, we find CMP to be superior in terms of performance and energy-efficiency for CPU-bound benchmarks, but SMT to be superior for memory-bound benchmarks due to a larger L2 cache. Although both exhibit similar peak operating temperatures and thermal management overheads, the mechanism by which SMT and CMP heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures, CMP heating is mainly caused by the global impact of increased energy output. Because of this difference in heat up mechanism, we found that the best thermal management technique is also different for SMT and CMP Indeed, non-DVS localized thermal-management can outperform DVS for SMT. Finally, we show that CMP and SMT scales differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core's higher temperature and the exponential temperature-dependence of subthreshold leakage.
Performance, energy, and thermal considerations for SMT and CMP architectures
Benjamin Lee and David Brooks. 1/2005. “Effects of pipeline complexity on SMT/CMP power-performance efficiency.” Power, 106, Pp. 1.Abstract
We consider processor core complexity and its impli-cations for the power-performance efficiency of SMT and CMP architectures, exploring fundamental trade-offs be-tween the efficiency of multi-core architectures and the com-plexity of their cores from a power-performance perspec-tive. Taking pipeline depth and width as proxies for core complexity, we conduct power-performance simulations of several SMT and CMP architectures employing cores of varying complexity. Our analyses identify efficient pipeline dimensions and outline the implications of using a power-performance efficiency metric for core complexity. Collectively, our results suggest SMT architectures en-able efficient increases in pipeline dimensions and core complexity. Furthermore, reducing pipeline di-mensions in CMP cores is inefficient, assuming ideal power-performance scaling from voltage/frequency scal-ing and circuit re-tuning. Given these conclusions, we formulate guidelines for complexity effective design.
Effects of pipeline complexity on SMT/CMP power-performance efficiency
Jayanth Srinivasan, Sarita Adve, Pradip Bose, Jude Rivers, Y. Li, David Brooks, Z Hu, K Skadron, V Srinivasan, and M Gschwind. 2005. “The case for microarchitectural awareness of lifetime reliability.” IEEE Micro, 25, 3, Pp. 70–80.
Xiaoyao Liang and David Brooks. 2005. “Highly accurate power modeling method for SRAM structures with simple circuit simulation.” 2nd Watson Conf. Interaction Between Architecture, Circuits, Compilers.
2004
Mark Hempstead, Matt Welsh, and David Brooks. 11/16/2004. “TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices.” In Local Computer Networks, 11/16/2004. 29th Annual IEEE International Conference on, Pp. 585–586. Tampa, FL, USA: IEEE. Publisher's VersionAbstract
The growing wireless sensor network research community lacks a standard method for evaluating hardware platforms. Traditional benchmark suites do not sufficiently address the needs of sensor network designers. This work provides motivation for a benchmark suite and details an approach for benchmarking TinyOS compatible hardware. To aid the development of future hardware architectures, we propose the creation of a standard single node benchmark suite, based on both real applications and "stressmarks." We present sample benchmark results and call for further work in this area.
TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices
Yau Chin, John Sheu, and David Brooks. 10/11/2004. “Evaluating techniques for exploiting instruction slack.” In Computer Design: VLSI in Computers and Processors, 10/11/2004. ICCD 10/11/2004. Proceedings. IEEE International Conference on, Pp. 375–378. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
In many workloads, 25% to 50% of instructions have slack allowing them to be delayed without impacting performance. To exploit this slack, processors may implement more power-efficient, longer latency pipelines or provide dynamically scaled pipelines using multiple clock domains. Issuing instructions with slack to slower pipelines can result in substantial power savings, with minimal performance loss. Considering both dynamic and static power dissipation, we found that by using longer latency pipelines the power of functional unit pipelines decreases by 20% to 55% with a performance impact of 0% to 3% for SPEC2000 and MediaBench workloads. Dynamic scaling reduces the performance loss in intense multimedia workloads by up to 2%, but achieves lower power savings.
Evaluating techniques for exploiting instruction slack

Pages