Publications by Type: Conference Paper

2005
Yingmin Li, Mark Hempstead, Patrick Mauro, David Brooks, Zhigang Hu, and Kevin Skadron. 8/2005. “Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices.” In Proceedings of the 2005 international symposium on Low power electronics and design, Pp. 173–178. ACM. Publisher's VersionAbstract

This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain.We study both SRAM and latch and multiplexer ("latch-mux") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the "unconstrained" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when "stall gating" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit-design choices when performing early-stage design exploration

Power and thermal effects of SRAM vs. Latch-Mux design styles and clock gating choices
Mark Hempstead, Nikhil Tripathi, Patrick Mauro, Gu Wei, and David Brooks. 6/4/2005. “An ultra low power system architecture for sensor network applications.” In ACM SIGARCH Computer Architecture News, 33: Pp. 208–219. Madison, WI, USA: IEEE Computer Society. Publisher's VersionAbstract
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
An ultra low power system architecture for sensor network applications
Xuning Chen, Shiuan Peh, Gu Wei, Kai Huang, and Paul Prucnal. 2/12/2005. “Exploring the design space of power-aware opto-electronic networked systems.” In 11th International Symposium on High-Performance Computer Architecture, Pp. 120–131. San Francisco, CA, USA: IEEE. Publisher's VersionAbstract
As microprocessors become increasingly interconnected, the power consumed by the interconnection network can no longer be ignored. Moreover, with demand for link bandwidth increasing, optical links are replacing electrical links in inter-chassis and inter-board environments. As a result, the power dissipation of optical links is becoming as critical as their speed. In this paper, we first explore options for building high speed optoelectronic links and discuss the power characteristics of different link components. Then, we propose circuit and network mechanisms that can realize power-aware optical links -links whose power consumption can be tuned dynamically in response to changes in network traffic. Finally, we incorporate power control policies along with the power characterization of link circuitry into a detailed network simulator to evaluate the performance cost and power savings of building power aware optoelectronic networked systems. Simulation results show that more than 75% savings in power consumption can be achieved with the proposed power aware optoelectronic network.
Exploring the design space of power-aware opto-electronic networked systems
Yingmin Li, K Skadron, David Brooks, and Zhigang Hu. 2/12/2005. “Performance, energy, and thermal considerations for SMT and CMP architectures.” In High-Performance Computer Architecture, 2/12/2005. HPCA-11. 11th International Symposium on, Pp. 71–82. IEEE. Publisher's VersionAbstract
Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their relative energy-efficiency and thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to explore this design space for a POWER4/POWER5-like core. For an equal-area comparison with this style of core, we find CMP to be superior in terms of performance and energy-efficiency for CPU-bound benchmarks, but SMT to be superior for memory-bound benchmarks due to a larger L2 cache. Although both exhibit similar peak operating temperatures and thermal management overheads, the mechanism by which SMT and CMP heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures, CMP heating is mainly caused by the global impact of increased energy output. Because of this difference in heat up mechanism, we found that the best thermal management technique is also different for SMT and CMP Indeed, non-DVS localized thermal-management can outperform DVS for SMT. Finally, we show that CMP and SMT scales differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core's higher temperature and the exponential temperature-dependence of subthreshold leakage.
Performance, energy, and thermal considerations for SMT and CMP architectures
2004
Mark Hempstead, Matt Welsh, and David Brooks. 11/16/2004. “TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices.” In Local Computer Networks, 11/16/2004. 29th Annual IEEE International Conference on, Pp. 585–586. Tampa, FL, USA: IEEE. Publisher's VersionAbstract
The growing wireless sensor network research community lacks a standard method for evaluating hardware platforms. Traditional benchmark suites do not sufficiently address the needs of sensor network designers. This work provides motivation for a benchmark suite and details an approach for benchmarking TinyOS compatible hardware. To aid the development of future hardware architectures, we propose the creation of a standard single node benchmark suite, based on both real applications and "stressmarks." We present sample benchmark results and call for further work in this area.
TinyBench: The case for a standardized benchmark suite for TinyOS based wireless sensor network devices
Yau Chin, John Sheu, and David Brooks. 10/11/2004. “Evaluating techniques for exploiting instruction slack.” In Computer Design: VLSI in Computers and Processors, 10/11/2004. ICCD 10/11/2004. Proceedings. IEEE International Conference on, Pp. 375–378. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
In many workloads, 25% to 50% of instructions have slack allowing them to be delayed without impacting performance. To exploit this slack, processors may implement more power-efficient, longer latency pipelines or provide dynamically scaled pipelines using multiple clock domains. Issuing instructions with slack to slower pipelines can result in substantial power savings, with minimal performance loss. Considering both dynamic and static power dissipation, we found that by using longer latency pipelines the power of functional unit pipelines decreases by 20% to 55% with a performance impact of 0% to 3% for SPEC2000 and MediaBench workloads. Dynamic scaling reduces the performance loss in intense multimedia workloads by up to 2%, but achieves lower power savings.
Evaluating techniques for exploiting instruction slack
Yingmin Li, David Brooks, Zhigang Hu, Kevin Skadron, and Pradip Bose. 8/11/2004. “Understanding the energy efficiency of simultaneous multithreading.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 44–49. Newport Beach, CA, USA: ACM. Publisher's VersionAbstract
Simultaneous multithreading (SMT) has proven to be an effective method of increasing the performance of microprocessors by extracting additional instruction-level parallelism from multiple threads. In current microprocessor designs, power-efficiency is of critical importance, and we present modeling extensions to an architectural simulator to allow us to study the power-performance efficiency of SMT. After a thorough design space exploration we find that SMT can provide a performance speedup of nearly 20% for a wide range of applications with a power overhead of roughly 24%. Thus, SMT can provide a substantial benefit for energy-efficiency metrics such as ED/sup 2/. We also explore the underlying reasons for the power uplift, analyze the impact of leakage-sensitive process technologies, and discuss our model validation strategy.
Understanding the energy efficiency of simultaneous multithreading
Kim Hazelwood and David Brooks. 8/2004. “Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization.” In Proceedings of the 2004 international symposium on Low power electronics and design, Pp. 326–331. ACM. Publisher's VersionAbstract

Microprocessor designers use techniques such as clock gating to reduce power dissipation. An unfortunate side-effect of these techniques is the processor current fluctuations that stress the power-delivery network. Recent research has focused on hardware-only mechanisms to detect and eliminate these fluctuations. While the solutions have been effective at avoiding operating-range violations, they have done so at a performance penalty to the executing program.

Compilers are well equipped to rearrange instructions such that current fluctuations are less dramatic, with minimal performance implications. Furthermore, a dynamic optimizer can eliminate the problem at run time, avoiding the difficult task of statically predicting voltage emergencies.

This paper proposes complementing existing hardware solutions with additional run-time software to address problematic code sequences that cause recurring voltage swings. Our proposal extends existing hardware techniques to additionally provide feedback to a dynamic optimizer, which can provide a long-term solution, often without impacting the performance of the executing application.

We found that recurring voltage fluctuations do exist in the SPEC2000 benchmarks, and that given very little information from the hardware, a dynamic optimizer can locate and correct many of the recurring voltage emergencies.

Eliminating voltage emergencies via microarchitectural voltage control feedback and dynamic optimization
Ruwan Ratnayake, Gu-Yeon Wei, and Aleksandar Kavcic. 6/2004. “Pipelined parallel architecture for high throughput MAP detectors.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 2: Pp. II–505. IEEE. Publisher's VersionAbstract
A maximum a posteriori probability (MAP) detector based on a forward only algorithm with high throughput is considered in this paper. MAP gives the optimal performance and, with Turbo decoding, can achieve performance close to the channel capacity limits. Deep pipelined architecture for the forward only method is presented and compared with the other throughput-increasing methods. Simulation results based on the iterative MAP-LDPC (low-density parity check) system are shown. Hardware implementation issues that exploit the regularities of the structure are also discussed.
Pipelined parallel architecture for high throughput MAP detectors
Hanumolu Kumar, Casper Bryan, Mooney Randy, Gu Wei, and Moon Ku. 5/26/2004. “Jitter in high-speed serial and parallel links.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 4: Pp. IV–425. Vancouver, BC, Canada: IEEE. Publisher's VersionAbstract
Jitter degrades the performance of both high-speed serial and parallel I/O links by limiting the maximum achievable data-rates. We present analytical expressions to evaluate the effect of jitter on the performance of high-speed links. These expressions enable simple calculation of worst-case voltage and timing margins in the presence of jitter. This analysis is also extended to equalized links. Finally, we show that the limited bandwidth of the channel can amplify high frequency jitter and present means to counteract jitter amplification.
Jitter in high-speed serial and parallel links
Yong Cheol Bae and Gu-Yeon Wei. 5/23/2004. “A mixed PLL/DLL architecture for low jitter clock generation.” In 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No. 04CH37512), 4: Pp. IV–788. Vancouver, BC, Canada: IEEE. Publisher's VersionAbstract
This paper presents a mixed PLL/DLL architecture for low-jitter clock generation that merges phase-locked loop (PLL) and delay-locked loop (DLL) characteristics. It relies on an interpolator to configure the loop to operate more like a PLL or more like a DLL depending on the interpolator's settings. The ability to vary interpolator settings enables wide range control of the clock generator's loop bandwidth. Therefore, the loop bandwidth can readily be adjusted to accommodate different noise conditions. A discrete-time Z-domain analysis is provided to illustrate the noise filtering characteristics of the loop in the presence of various noise sources and highlight the potential advantages of the mixed PLL/DLL architecture. Simulation results verify stable operation of the loop designed for a 0.18 /spl mu/m CMOS process.
A mixed PLL/DLL architecture for low jitter clock generation
Fulford RF, Gu Wei, and Matt Welsh. 2/2004. “A portable, low-power, wireless two-lead EKG system.” In The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1: Pp. 2141–2144. IEEE. Publisher's VersionAbstract

Sensor devices ("motes") which integrate an embedded microprocessor, low-power radio and a limited amount of storage have the potential to significantly enhance the provision of emergency medical care. Wearable vital sign sensors can wirelessly monitor patient condition, alerting healthcare providers to changes in status while simultaneously delivering data to a backend archival system for longer-term storage. As part of the CodeBlue initiative at Harvard University, we previously developed a mote-based pulse oximetry module which gathers data from a noninvasive finger sensor and transmits it wirelessly to a base station. To expand the capabilities of the mote for healthcare applications we now introduce EKG on Mica2, the first custom-designed electrocardiograph sensor board to interface with this platform. We additionally present VitalEKG, a collection of software components which allow the capture and wireless transmission of heart activity traces. We present preliminary test results which validate our approach and suggest the feasibility of future enhancements.

A portable, low-power, wireless two-lead EKG system
2003
Gu Wei, Stonick T, Weinlader Dan, Sonntag Jeff, and Searles Shawn. 12/13/2003. “A 500MHz MP/DLL clock generator for a 5Gb/s backplane transceiver in 0.25/spl mu/m CMOS.” In 2003 IEEE International Solid-State Circuits Conference, 12/13/2003. Digest of Technical Papers. ISSCC., Pp. 464–465. IEEE. Publisher's VersionAbstract
Low-jitter clock generation is a critical component for enabling robust high-speed operation of 5Gb/s backplane transceivers. The implementation of a 500MHz clock synthesizer that operates either as a multiplying phase-locked loop (MPLL) or a multiplying delay-locked loop (MDLL) is described. The choice depends on the noise characteristics of the input clock source. This MP/DLL design is implemented in a 0.25/spl mu/m CMOS process and operates with a 2.5V supply.
Russ Joseph, David Brooks, and Margaret Martonosi. 2/12/2003. “Control techniques to eliminate voltage emergencies in high performance processors.” In High-Performance Computer Architecture, 2/12/2003. HPCA-9 2/12/2003. Proceedings. The Ninth International Symposium on, Pp. 79–90. Anaheim, CA, USA: IEEE. Publisher's VersionAbstract
Increasing focus on power dissipation issues in current microprocessors has led to a host of proposals for clock gating and other power-saving techniques. While generally effective at reducing average power, many of these techniques have the undesired side-effect of increasing both the variability of power dissipation and the variability of current drawn by the processor This increase in current variability, often referred to as the dI/dt problem, can cause supply voltage fluctuations. Such voltage fluctuations lead to unreliable circuits if not addressed, and increasingly expensive chip packaging techniques are needed to mitigate them. This paper proposes and evaluates a methodology for augmenting packaging techniques for dI/dt with microarchitectural control mechanisms. We discuss the resonant frequencies most relevant to current microprocessor packages, produce and evaluate a "dI/dt stressmark" that exercises the system at its resonant frequency, and characterize the behavior of more mainstream applications. Based on these results plus evaluations of the impact of controller error and delay, our microarchitectural control proposals offer bounds on supply voltage fluctuations, with nearly negligible impact on performance and energy. With the ITRS roadmap predicting aggressive drops in supply voltage and power supply impedances in coming chip generations, novel voltage control techniques will be required to stay on track. Our microarchitectural dI/dt controllers represent a step in this direction.
Control techniques to eliminate voltage emergencies in high performance processors
2002
Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor Zyuban, Philip Strenski, and Philip Emma. 11/18/2002. “Optimizing pipelines for power and performance.” In Microarchitecture, 11/18/2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on, Pp. 333–344. IEEE. Publisher's VersionAbstract
During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.
Optimizing pipelines for power and performance
Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and Peter Cook. 1/2002. “Power-efficient issue queue design.” In Power aware computing, Pp. 35–58. Kluwer Academic Publishers. Publisher's VersionAbstract

Increasing levels of power dissipation threaten to limit the performance gains of future high-end, out-of-order issue microprocessors. Therefore, it is imperative that designers devise techniques that significantly reduce the power dissipation of the key hardware structures on the chip without unduly compromising performance. Such a key structure in out-of-order designs is the issue queue. Although crucial in achieving high performance, the issue queues are often a major contributor to the overall power consumption of the chip, potentially affecting both thermal issues related to hot spots and energy issues related to battery life. In this chapter, we present two techniques that significantly reduce issue queue power while maintaining high performance operation. First, we evaluate the power savings achieved by implementing a CAM/RAM structure for the issue queue as an alternative to the more power-hungry latch-based issue queue used in many designs. We then present the microarchitecture and circuit design of an adaptive issue queue that leverages transmission gate insertion to provide dynamic low-cost configurability of size and speed. We compare two different dynamic adaptation algorithms that use issue queue utilization and parallelism metrics in order to size the issue queue on-the-fly during execution. Together, these two techniques provide over a 70% average reduction in issue queue power dissipation for a collection of the SPEC CPU2000 integer benchmarks, with only a 3% overall performance degradation.

Power-efficient issue queue design
2001
Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and Peter Cook. 3/2001. “A circuit level implementation of an adaptive issue queue for power-aware microprocessors.” In Proceedings of the 11th Great Lakes symposium on VLSI, Pp. 73–78. ACM. Publisher's VersionAbstract
Increasing power dissipation has become a major constraint for future per~brmartce gains in the design of microproces- sors. In this paper, we present the circuit design of an issue queue for a superscalar processor that leverages transmis- sion gate insertion to provide dynamic low-cost configura- bility of size and speed. A novel circuit structure dynami- cally gathers statistics of issue queue activity over intervals of instruction execution. These statistics are then used to change the size of an issue queue organization on-the-fly to improve issue queue energy and performance. When applied to a fixed, full-size issue queue structure, the result is up to a 70% reduction in energy dissipation. The complexity of the additional circuitry to achieve this result is almost neg- ligible. Furthermore, self-timed techniques embedded in the adaptive scheme can provide a 56% decrease in cycle time of the CAM array read of the issue queue when we change the adaptive issue queue size f¥om 32 entries (largest possible) to 8 entries (smallest possible in our design). 
A circuit level implementation of an adaptive issue queue for power-aware microprocessors
David Brooks and Margaret Martonosi. 1/19/2001. “Dynamic thermal management for high-performance microprocessors.” In High-Performance Computer Architecture, 1/19/2001. HPCA. The Seventh International Symposium on, Pp. 171–182. IEEE. Publisher's VersionAbstract

With the increasing clock rate and transistor count of today's microprocessors, power dissipation is becoming a critical component of system design complexity. Thermal and power-delivery issues are becoming especially critical for high-performance computing systems. In this work, we investigate dynamic thermal management as a technique to control CPU power dissipation. With the increasing usage of clock gating techniques, the average power dissipation typically seen by common applications is becoming much less than the chip's rated maximum power dissipation. However system designers still must design thermal heat sinks to withstand the worse-case scenario. We define and investigate the major components of any dynamic thermal management scheme. Specifically we explore the tradeoffs between several mechanisms for responding to periods of thermal trauma and we consider the effects of hardware and software implementations. With approximate dynamic thermal management, the CPU can be designed for a much lower maximum power rating, with minimal performance impact for typical applications.

2000
Sidiropoulos Stefanos, Liu Dean, Kim Jaeha, Gu Wei, and Horowitz Mark. 2000. “Adaptive bandwidth DLLs and PLLs using regulated supply CMOS buffers.” In 2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No. 00CH37103), Pp. 124–127. IEEE.
David Brooks and Margaret Martonosi. 2000. “Adaptive thermal management for high-performance microprocessors.” In In Workshop on Complexity Effective Design. Citeseer.

Pages