Publications by Author: David Brooks

2003
Russ Joseph, David Brooks, and Margaret Martonosi. 2/12/2003. “Control techniques to eliminate voltage emergencies in high performance processors.” In High-Performance Computer Architecture, 2/12/2003. HPCA-9 2/12/2003. Proceedings. The Ninth International Symposium on, Pp. 79–90. Anaheim, CA, USA: IEEE. Publisher's VersionAbstract
Increasing focus on power dissipation issues in current microprocessors has led to a host of proposals for clock gating and other power-saving techniques. While generally effective at reducing average power, many of these techniques have the undesired side-effect of increasing both the variability of power dissipation and the variability of current drawn by the processor This increase in current variability, often referred to as the dI/dt problem, can cause supply voltage fluctuations. Such voltage fluctuations lead to unreliable circuits if not addressed, and increasingly expensive chip packaging techniques are needed to mitigate them. This paper proposes and evaluates a methodology for augmenting packaging techniques for dI/dt with microarchitectural control mechanisms. We discuss the resonant frequencies most relevant to current microprocessor packages, produce and evaluate a "dI/dt stressmark" that exercises the system at its resonant frequency, and characterize the behavior of more mainstream applications. Based on these results plus evaluations of the impact of controller error and delay, our microarchitectural control proposals offer bounds on supply voltage fluctuations, with nearly negligible impact on performance and energy. With the ITRS roadmap predicting aggressive drops in supply voltage and power supply impedances in coming chip generations, novel voltage control techniques will be required to stay on track. Our microarchitectural dI/dt controllers represent a step in this direction.
Control techniques to eliminate voltage emergencies in high performance processors
2002
Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor Zyuban, Philip Strenski, and Philip Emma. 11/18/2002. “Optimizing pipelines for power and performance.” In Microarchitecture, 11/18/2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on, Pp. 333–344. IEEE. Publisher's VersionAbstract
During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.
Optimizing pipelines for power and performance
M Martonosi, David Brooks, and V Tiwari. 2/2002. “Architecture-level power modeling with Wattch.” Computer, 35, 2, Pp. 64–64.
Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and Peter Cook. 1/2002. “Power-efficient issue queue design.” In Power aware computing, Pp. 35–58. Kluwer Academic Publishers. Publisher's VersionAbstract

Increasing levels of power dissipation threaten to limit the performance gains of future high-end, out-of-order issue microprocessors. Therefore, it is imperative that designers devise techniques that significantly reduce the power dissipation of the key hardware structures on the chip without unduly compromising performance. Such a key structure in out-of-order designs is the issue queue. Although crucial in achieving high performance, the issue queues are often a major contributor to the overall power consumption of the chip, potentially affecting both thermal issues related to hot spots and energy issues related to battery life. In this chapter, we present two techniques that significantly reduce issue queue power while maintaining high performance operation. First, we evaluate the power savings achieved by implementing a CAM/RAM structure for the issue queue as an alternative to the more power-hungry latch-based issue queue used in many designs. We then present the microarchitecture and circuit design of an adaptive issue queue that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. We compare two different dynamic adaptation algorithms that use issue queue utilization and parallelism metrics in order to size the issue queue on-the-fly during execution. Together, these two techniques provide over a 70% average reduction in issue queue power dissipation for a collection of the SPEC CPU2000 integer benchmarks, with only a 3% overall performance degradation.

Power-efficient issue queue design
Pradip Bose, David Brooks, Viji Srinivasan, and Philip Emma. 2002. Power-Performance and Power Swing Characterization in Adaptive Microarchitectures. Technical Paper Archive - Research Reports. IBM Research. Publisher's VersionAbstract
In this paper, we present an analysis of some of the fundamental power-performance tradeoffs in processors that employ adaptive techniques to vary sizes, bandwidths, clock-gating modes and clock frequencies. Initial expectations are set using simple analytical reasoning models. Later, simulation-based data is presented in the context of a simple, low-power super scalar processor prototype (called LPX) that is currently under development as a test vehicle. There are three fundamental issues that we attempt to address in this paper: (a) Does dynamic adaptation - in clocking or microarchitectural resources - help extend the power-performance efficiency range of wider-issue superscalars ? (b) What factors of power and power-density reductions are within practical reach in future adaptive processors ? (c) Does the presence of dynamic adaptation modes cause unacceptably large, worst-case power (or current) swings in affected sub-units ?
Power-Performance and Power Swing Characterization in Adaptive Microarchitectures
2001
David Brooks, Margaret Martonosi, John Wellman, and Pradip Bose. 12/2001. “Power-performance modeling and tradeoff analysis for a high end microprocessor.” Power-Aware Computer Systems, Pp. 126–136. Publisher's VersionAbstract
We describe a new power-performance modeling toolkit, developed to aid in the evaluation and definition of future power-efficient, PowerPC TM processors. The base performance models in use in this project are: (a) a fast but cycle-accurate, parameterized research simulator and (b) a slower, pre-RTL reference model that models a specific high-end machine in full, latchaccurate detail. Energy characterizations are derived from real, circuit-level power simulation data. These are then combined to form higher-level energy models that are driven by microarchitecture-level parameters of interest. The overall methodology allows us to conduct power-performance tradeoff studies in defining the follow-on design points within a given product family. We present a few experimental results to illustrate the kinds of tradeoffs one can study using this tool.
Power-performance modeling and tradeoff analysis for a high end microprocessor
David Brooks and Martonosi (advisor) Margaret. 11/2001. “Design and modeling of power-efficient computer architectures.” Princeton University ProQuest Dissertations Publishing. Publisher's VersionAbstract

Power dissipation and thermal issues are increasingly significant in modern processors. As a result, it is crucial that power/performance tradeoffs be made more visible to chip architects and compiler writers, in addition to circuit designers. Most traditional power analysis tools achieve high accuracy by calculating power estimates for designs only after the circuit design, layout, and floorplanning are complete. In addition to being available only late in the design process, such tools are often quite slow, which compounds the difficulty of running them for a large space of design possibilities.

This thesis presents a methodology for estimating power dissipation at a much earlier stage in the design cycle and at a much higher level. Watch and PowerTimer are two working examples of the use of this methodology. Both tools provide a framework for analyzing and optimizing microprocessor power dissipation at the architecture-level. These tools are 1000X or more faster than existing layout-level power tools, and yet maintain accuracy within 10% of their estimates as verified using industry tools on leading-edge designs. These tools can allow architects to explore and cull the design space early on and opens up the field of power-efficient computing to a wider range of researchers by providing a power evaluation methodology within the portable and familiar SimpleScalar framework.

This thesis also considers several applications of architectural-level power modeling to propose specific architectural-level power and temperature saving optimizations—value-based clock gating and dynamic thermal management. Value-based clock gating is a technique which exploits the dynamic bitwidth requirements of typical applications to save power within arithmetic units and the memory hierarchy. We have demonstrated that this technique can save roughly 50% of the power in the integer execution units. With dynamic thermal management, temperature sensors and throttling techniques are combined to adaptively slow down the CPU for extended periods of particularly high-power code sequences. This allows the CPU package and power delivery system to be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

The techniques presented in this thesis represent some of the first work in the area of high-performance, low-power processor design at the architectural level. We hope that this work, and the other research in the area of low-power architectural modeling and design, will help future generations of processor architectures to meet the many new challenges in this area.

Design and modeling of power-efficient computer architectures
Russ Joseph, David Brooks, and Margaret Martonosi. 8/2001. “Live, runtime power measurements as a foundation for evaluating power/performance tradeoffs.” Workshop on Complexity Effectice Design WCED, held in conjunction with ISCA, 28.Abstract
Of the many ways one could gauge the complexity-effectiveness of a design or design element, one candidate approach is to consider a design's power/performance tradeoffs. This paper describes our early-stage results in a broad effort to evaluate the power-performance tradeoffs of a range of benchmarks and microarchitectures. In particular, this paper presents power data collected on-the-fly on real x86 machines as they execute carefully-constructed microbenchmarks. The microbenchmarks exercise aspects of the system such as data cache and branch predictor. They are parametrically-variable to consider how load dependence, cache miss rate, branch mispredict rate, and branch distance all impact the power and performance of a CPU. For example, from these experiments, we learn that CPU performance increases essentially monotonically with cache hit rate, while CPU power encounters a maximum at roughly 80-90% cache hit rates. Likewise, we show results demonstrating that performance-neutral issues such as bit populations in the data cache values can display interesting power trends. While the experimental results are preliminary, we feel that the techniques described in this paper will o er a useful foundation for a broad range of power/performance tradeoffs.
Alper Buyuktosunoglu, Stanley Schuster, David Brooks, Pradip Bose, Peter Cook, and David Albonesi. 6/11/2001. “An adaptive issue queue for reduced power at high performance.” Power-Aware Computer Systems, Pp. 25–39. Publisher's VersionAbstract

Increasing power dissipation has become a major constraint for future performance gains in the design of microprocessors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. A novel circuit structure dynamically gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost negligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size from 32 entries (largest possible) to 8 entries (smallest possible in our design).

An adaptive issue queue for reduced power at high performance
Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and Peter Cook. 3/2001. “A circuit level implementation of an adaptive issue queue for power-aware microprocessors.” In Proceedings of the 11th Great Lakes symposium on VLSI, Pp. 73–78. ACM. Publisher's VersionAbstract
Increasing power dissipation has become a major constraint for future per~brmartce gains in the design of microproces- sors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmis- sion gate insertion to provide dynamic low-cost configura- bility of size and speed. A novel circuit structure dynami- cally gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost neg- ligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size f¥om 32 entries (largest possible) to 8 entries (smallest possible in our design). 
A circuit level implementation of an adaptive issue queue for power-aware microprocessors
David Brooks and Margaret Martonosi. 1/19/2001. “Dynamic thermal management for high-performance microprocessors.” In High-Performance Computer Architecture, 1/19/2001. HPCA. The Seventh International Symposium on, Pp. 171–182. IEEE. Publisher's VersionAbstract

With the increasing clock rate and transistor count of today's microprocessors, power dissipation is becoming a critical component of system design complexity. Thermal and power-delivery issues are becoming especially critical for high-performance computing systems. In this work, we investigate dynamic thermal management as a technique to control CPU power dissipation. With the increasing usage of clock gating techniques, the average power dissipation typically seen by common applications is becoming much less than the chip's rated maximum power dissipation. However system designers still must design thermal heat sinks to withstand the worse-case scenario. We define and investigate the major components of any dynamic thermal management scheme. Specifically we explore the tradeoffs between several mechanisms for responding to periods of thermal trauma and we consider the effects of hardware and software implementations. With approximate dynamic thermal management, the CPU can be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

Pradip Bose, Margaret Martonosi, and David Brooks. 2001. “Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions.” Tutorial, ACM SIGMETRICS. Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions
2000
David Brooks and Margaret Martonosi. 2000. “Adaptive thermal management for high-performance microprocessors.” In In Workshop on Complexity Effective Design. Citeseer.
David Brooks, Pradip Bose, Stanley Schuster, Hans Jacobson, Prabhaka Kudva, Alper Buyuktosunoglu, J Wellman, Victor Zyuban, Manish Gupta, and Peter Cook. 2000. “Power-aware microarchitecture: Design and modeling challenges for next-generation microprocessors.” Micro, IEEE, 20, 6, Pp. 26–44.
David Brooks and Margaret Martonosi. 2000. “Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance.” ACM Transactions on Computer Systems (TOCS), 18, 2, Pp. 89–126.
Margaret Martonosi, Vivek Tiwari, and David Brooks. 2000. “Wattch: A framework for architectural-level power analysis and optimizations.” isca, Pp. 83.
1999
David Brooks and Margaret Martonosi. 1/9/1999. “Dynamically exploiting narrow width operands to improve processor power and performance.” In High-Performance Computer Architecture, 1/9/1999. Proceedings. Fifth International Symposium On, Pp. 13–22. Orlando, FL, USA: IEEE. Publisher's VersionAbstract

In general-purpose microprocessors, recent trends have pushed towards 64 bit word widths, primarily to accommodate the large addressing needs of some programs. Many integer problems, however, rarely need the full 64 bit dynamic range these CPUs provide. In fact, another recent instruction set trend has been increased support for sub-word operations (that is, manipulating data in quantities less than the full word size). In particular, most major processor families have introduced "multimedia" instruction set extensions that operate in parallel on several sub-word quantities in the same ALU. This paper notes that across the SPECint95 benchmarks, over half of the integer operation executions require 16 bits or less. With this as motivation, our work proposes hardware mechanisms that dynamically recognize and capitalize on these "narrow-bitwidth" instances. Both optimizations require little additional hardware, and neither requires compiler support. The first, power-oriented, optimization reduces processor power consumption by using aggressive clock gating to turn off portions of integer arithmetic units that will be unnecessary for narrow bitwidth operations. This optimization results in an over 50% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. The second optimization improves performance by merging together narrow integer operations and allowing them to share a single functional unit. Conceptually akin to a dynamic form of MMX, this optimization offers speedups of 4.3%-6.2% for SPECint95 and 8.0%-10.4% for MediaBench.

Dynamically exploiting narrow width operands to improve processor power and performance
David Brooks and Margaret Martonosi. 1/1999. “Implementing application-specific cache-coherence protocols in configurable hardware.” In Network-Based Parallel Computing. Communication, Architecture, and Applications, Pp. 181–195. Springer. Publisher's VersionAbstract

Streamlining communication is key to achieving good performance in shared-memory parallel programs. While full hardware support for cache coherence generally offers the best performance, not all parallel machines provide it. Instead, software layers using Shared Virtual Memory (SVM) can be built to enforce coherence at a higher level. In prior work, researchers have studied application-specific cache coherence protocols implemented either in SVM systems or as handlers run by programmable protocol processors. Since the protocols are specialized to the needs of a single application, they can be particularly helpful in reducing the long latencies and processing overhead that sometimes degrade performance in SVM systems. This paper studies implementing application-specific protocols in hardware, but not via an instruction-based protocol processor as is typical. Instead, we consider configurable implementations based on Field-Programmable Gate Arrays (FPGAs). This approach can be faster than software-based techniques and less expensive than some hardware-based techniques. We study one application, appbt, in detail, including a VHDL-level design of the configurable protocol design. We sketch out approaches for other applications as well. Implementing protocol operations in configurable hardware improves communication performance by roughly 11X for a 32-node system. While overall speedups are a more modest 12% our method is promising because of its flexibility and because it offers a new way of harnessing configurable hardware at the network interface, where it already exists or could be easily added to current systems.

Implementing application-specific cache-coherence protocols in configurable hardware
1998
Christina Leung, David Brooks, Margaret Martonosi, and Douglas Clark. 1998. “Power-Aware Architecture Studies: Omgoing Work at Princeton.” Power-Driven Microarchitecture Workshop.Abstract
Power dissipation limits have emerged as a major constraint in the design of microprocessors. At the low end of the performance spectrum, namely in the world of handheld and portable devices or systems, power has always dominated over performance (execution time) as the primary design issue. Battery life and system cost constraints drive the design team to consider power over performance in such a scenario. Increasingly, however, power is also a key design issue in the workstation and server markets (see Gowan et al.)1 In this high-end arena the increasing microarchitectural complexities, clock frequencies, and die sizes push the chiplevel—and hence the system-level—power consumption to such levels that traditionally air-cooled multiprocessor server boxes may soon need budgets for liquid-cooling or refrigeration hardware. This need is likely to cause a break point—with a step upward—in the ever-decreasing price-performance ratio curve. As such, a design team that considers power consumption and dissipation limits early in the design cycle and can thereby adopt an inherently lower power microarchitectural line will have a definite edge over competing teams. Thus far, most of the work done in the area of high-level power estimation has been focused at the register-transfer-level (RTL) description in the processor design flow. Only recently have we seen a surge of interest in estimating power at the microarchitecture definition stage, and specific work on power-efficient microarchitecture design has been reported.2-8 Here, we describe the approach of using energy-enabled performance simulators in early design. We examine some of the emerging paradigms in processor design and comment on their inherent power-performance characteristics.
Power-Aware Architecture Studies: Omgoing Work at Princeton

Pages