Publications by Year: 2006

2006
Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 12/22/2006. “Impact of thermal constraints on multi-core architectures.” 10th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronics Systems, San Diego. Publisher's VersionAbstract
This paper shows how thermal constraints affect the multidimensional design space for chip multiprocessors, considering the inter-related variables of CPU count, pipeline depth, superscalar width, L2 cache size, and operating voltage and frequency. The results show the importance of thermal modeling and the need for new thermal modeling capabilities and hence the need for collaboration between the thermal engineeringand computerarchitecturecommunities. Thermalconstraints both shift the optimal intra- and inter-core organization, and dominate other physical constraints such as pinbandwidth and power delivery. Different thermal constraints also require different optimization strategies. For aggressive cooling solutions, reducing power density is at least as important as reducing total power, while for low-cost cooling solutions, reducing total power is more important.
Impact of thermal constraints on multi-core architectures
Fang Chi, Sharon Kedar, Susan Owen, Gu Wei, David Brooks, and Jonathan Lees. 12/18/2006. “System-on-chip architecture design for intelligent sensor networks.” In 2006 International Conference on Intelligent Information Hiding and Multimedia, Pp. 579–582. IEEE. Publisher's VersionAbstract
While wireless sensor networks can generically be used for a wide variety of applications, breakthrough innovations are most often achieved when driven by a genuine need or application, with its specific system-level and science-related requirements and objectives. Hence, our work focuses on the development of wireless sensor network system-on-chip devices and supporting software for volcano monitoring, which we call Sensor Network for Active Volcanoes (SNAV). In this paper we present preliminary results of our research and development work on intelligent sensor networks for monitoring hazardous environments especially the SNAV system-on-chip design for active volcanoes monitoring.
Xiaoyao Liang and David Brooks. 12/9/2006. “Mitigating the impact of process variations on processor register files and execution units.” In Microarchitecture, 12/9/2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, Pp. 504–514. IEEE. Publisher's VersionAbstract
Design variability due to die-to-die and within-die process variations has the potential to significantly reduce the maximum operating frequency and the effective yield of high-performance microprocessors in future process technology generations. One serious manifestation of this increased variability is a reduction in the mean frequency of fabricated chips due to fluctuations in device characteristics causing reduced circuit performance. In this paper, we propose to mitigate the impact of variations through variable-latency register files and execution units which are key architectural components that may encounter variability problems. We also illustrate the importance of closing the gap in expected delay of these distinct structures. A post fabrication test and configuration strategy is proposed. We find that 23% mean frequency improvement with an average IPC loss of 3% (and never exceeding 5% for worst case chips) is possible for the 65nm technology node by properly adopting the proposed schemes
Mitigating the impact of process variations on processor register files and execution units
Benjamin Lee and David Brooks. 12/2006. “Accurate and efficient regression modeling for microarchitectural performance and power prediction.” In ACM SIGOPS Operating Systems Review, 5th ed., 40: Pp. 185–194. ACM. Publisher's VersionAbstract

We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental challenges in microarchitectural simulation cost by reducing the number of required simulations and using simulated results more effectively via statistical modeling and inference.Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving median error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) application-specific models to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing sensitivity to the number of samples simulated for model formulation, we find fewer than 4,000 samples from a design space of approximately 22 billion points are sufficient. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

Accurate and efficient regression modeling for microarchitectural performance and power prediction
Xiaoyao Liang and David Brooks. 11/5/2006. “Microarchitecture parameter selection to optimize system performance under process variation.” In Computer-Aided Design, 11/5/2006. ICCAD'06. IEEE/ACM International Conference on, Pp. 429–436. IEEE. Publisher's VersionAbstract

Design variability due to within-die and die-to-die process variations has the potential to significantly reduce the maximum operating frequency and the effective yield of high-performance microprocessors in future process technology generations. This variability manifests itself by increasing the number and criticality of long delay paths. To quantify this impact, we use an architectural process variation model that is appropriate for the analysis of system performance in the early-stages of the design process. We propose a method of selecting microarchitectural parameters to mitigate the frequency impact due to process variability for distinct structures, while minimizing IPC (instructions-per-cycle) loss. We propose an optimization procedure to be used for system-level design decisions, and we find that joint architecture and statistical timing analysis can be more advantageous than pure circuit level optimization. Overall, the technique can improve the 90% yield frequency by about 14% with 3% IPC loss for a baseline machine with a 20FO4 logic depth per pipestage. This approach is sensitive to the selection of processor pipeline depth, and we demonstrate that machines with aggressive pipelines will experience greater challenges in coping with process variability.

Microarchitecture parameter selection to optimize system performance under process variation
Lukasz Strozek and David Brooks. 10/22/2006. “Efficient architectures through application clustering and architectural heterogeneity.” In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 190–200. ACM. Publisher's VersionAbstract

Customizing architectures for particular applications is a promising approach to yield highly energy-efficient designs for embedded systems. This work explores the benefits of architectural customization for a class of embedded architectures typically used in energy-constrained application domains such as sensor node and multimedia processing. We implement a process flow that analyzes runtime profiles of applications and combines this information with a model for our architectural design space providing a robust customization engine built upon a fully automated method for determining an efficient architecture (together with appropriate application transformations). By profiling embedded benchmarks from a variety of sensor and multimedia applications, the paper shows the relative energy savings resulting from various architectural optimizations and identifies the number of architectures that achieves near-optimal savings for a group of applications. This paper proposes the use of heterogeneous chip-multiprocessors as a cost-effective approach to capitalize on the potential energy savings provided by application customization while executing a range of applications efficiently.

Efficient architectures through application clustering and heterogeneity
BC Lee and David Brooks. 10/20/2006. “Wild and Crazy Ideas Session-Session 5-Estimation and Prediction of Power and Performance-Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction.” SIGOPS Operating Systems Review, 40, 5, Pp. 185–194. Publisher's VersionAbstract

We propose regression modeling as an efficient approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This paper addresses fundamental challenges in microarchitectural simulation cost by reducing the number of required simulations and using simulated results more effectively via statistical modeling and inference.Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving median error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) application-specific models to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing sensitivity to the number of samples simulated for model formulation, we find fewer than 4,000 samples from a design space of approximately 22 billion points are sufficient. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

Wild and Crazy Ideas Session-Session 5-Estimation and Prediction of Power and Performance-Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction
Mark Hempstead, Gu Wei, and David Brooks. 10/2006. “Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations.” In Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 368–378. ACM. Publisher's VersionAbstract

Rising interest in the applications of wireless sensor networks has spurred research in the development of computing systems for low-throughput, energy-constrained applications. Unlike traditional performance oriented applications, sensor network nodes are primarily constrained by operation lifetime, which is limited by power consumption. Advanced CMOS process technologies provide ever increasing transistor density and improved performance characteristics. However, shrinking feature size and decreasing threshold voltages also lead to significant increases in leakage current, which is especially troublesome for applications with significant idle times. This work investigates tradeoffs between leakage and active power for low-throughput applications. We study these issues across a range of process technologies on a computing architecture that provides explicit support for fine-grain leakage-control techniques such as Vdd-gating and adaptive body bias. We present a methodology for selecting design parameters, including choice of process technology, that makes the optimal tradeoff between active power and leakage power for a given workload. Our results show that leakage power will dominate the selection of process technology, and architectures that support advanced leakage control techniques at the circuit level will be essential. We argue that without advanced low-power architectures future nano-scale process technologies will not be suited for sensor network applications.

Architecture and circuit techniques for low-throughput, energy-constrained systems across technology generations
Benjamin Lee, David Brooks, Bronis Supinski, and Martin Schulz. 9/29/2006. “Regression Modeling Strategies for Parameter Space Exploration”.Abstract
Increasing system and algorithmic complexity, combined with a growing number of tuanble application parameters, pose significant challenges for analytical performance modeling. This report outlines a series of robust techniques that enable efficient parameter space exploration based on empirical statistical modeling. In particular, this report applies statistical techniques such as clustering, association, correlation analyses to understand the parameter space better. Results from these statistical techniques guide the construction of piecewise polynomial regression models. Residual and significance tests ensure the resulting model is unbiased and efficient We demonstrate these techniques in R, a statistical computing environment, for predicting the performance of semicoarsening multigrid. 50 and 75 percent of predictions achieve error rates of 5.5 and 10.0 percent or less, respectively.
Regression Modeling Strategies for Parameter Space Exploration
Hanumolu Kumar, Gyu Kim, Gu Wei, and Moon Ku. 9/10/2006. “A 1.6 Gbps digital clock and data recovery circuit.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 603–606. IEEE. Publisher's VersionAbstract
A digital clock and data recovery circuit employs simple 3-level digital-to-analog converters to interface the digital loop filter to the voltage controlled oscillator and achieves low jitter performance. Test chip fabricated in a 0.13mum CMOS process achieves BER < 10 -12 , plusmn1500ppm lock-in range, plusmn2500ppm tracking range, recovered clock jitter of 8.9ps rms and consumes 12mW power from a single-pin 1.2V supply, while operating at 1.6Gbps
TanYuan and Gu Wei. 9/10/2006. “Adaptive-bandwidth mixing PLL/DLL based multi-phase clock generator for optimal jitter performance.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 749–752. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
This paper presents an adaptive-bandwidth mixing PLL/DLL (MX-PDLL) based multi-phase clock generator that can operate as a PLL, DLL, or a mixture of the two. Moreover, this clock generator can be used in a proposed dual-loop CDR to minimize output clock jitter under various noise environments. A test-chip prototype of the MX-PDLL and a 360deg phase rotator was fabricated in a 0.18mum CMOS process, operating off of a 1.8V supply. Experimentally measured results verify that while PLL-mode operation offers the ability to better filter quantization noise from the digital CDR control, shifting towards DLL-mode operation offers the ability to reduce jitter as the amount of on-chip noise increases
TanYuan and Gu Wei. 9/10/2006. “Phase mismatch detection and compensation for PLL/DLL based multi-phase clock generator.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 417–420. San Jose, CA, USA: IEEE. Publisher's VersionAbstract
Device mismatch and systematic imbalances in the physical design can cause static phase mismatch in a PLL/DLL based multi-phase clock generator and degrade performance. This problem gets worse in deep sub-micron technologies. Interleaved transceiver architectures require precise clocking to maximize data rate and minimize bit errors. In this paper, a static phase mismatch compensation scheme for multiple sampling clocks is proposed and tested in an adaptive-bandwidth mixing PLL/DLL based multi-phase clock generator. The proposed charge pump compensator and power efficient phase-averaging network together reduce the static phase mismatch standard deviation by 37% when operating in DLL mode. A simple and robust duty-cycle correction circuit exhibits a small residual error of 0.65% across a wide range (36% to 49%) of input clock duty-cycle values
Howard W, Gu Wei, Dally J, and Horowitz Paul. 9/10/2006. “Pulsenet-A Parallel Flash Sampler and Digital Processor IC for Optical SETI.” In IEEE Custom Integrated Circuits Conference 2006, Pp. 261–264. IEEE. Publisher's VersionAbstract

PulseNet is a full-custom IC with parallel flash ADC and digital processing that enables an all-sky optical search for extraterrestrial intelligence. It integrates 448 sense amplifiers that digitize 32 analog signals at 1GS/s, and other circuits that filter samples, store candidate signals, and perform astronomical observations. Its ~250,000 CMOS transistors (TSMC 0.25μm) dissipate 1.1W at 400MHz and 2.5V.

Pulsenet-A Parallel Flash Sampler and Digital Processor IC for Optical SETI
B Lee and David Brooks. 6/18/2006. “Statistically rigorous regression modeling for the microprocessor design space.” ISCA-33: Workshop on Modeling, Benchmarking, and Simulation.Abstract
Regression models enhance existing techniques in detailed microarchitectural simulation by reducing the number of required simulations and using simulation data more efficiently to identify trends and trade-offs. We present a rigorous derivation of such models for microprocessor performanceandpowerprediction, emphasizing the need to apply domain-specific knowledge when performing statistical inference. In particular, we propose sampling observations uniformly at random from a large design space, discuss approaches for identifying statistically significant predictors, and detail strategies for effectively modeling predictor interaction and non-linearity. The resulting models enable computationally efficient statistical inference, requiring the simulation of only 1 in every 5 million points of a joint microarchitecture-application design space while achieving median prediction error rates as low as 4.1 percent for performance and 4.3 percent for power.
Statistically rigorous regression modeling for the microprocessor design space
Hanumolu P, Wei Y, and U-K Moon. 6/15/2006. “A wide tracking range 0.2-4Gbps clock and data recovery circuit.” In 2006 Symposium on VLSI Circuits, 6/15/2006. Digest of Technical Papers., Pp. 71–72. IEEE. Publisher's VersionAbstract

A hybrid analog and digital quarter-rate clock and data recovery circuit employs a second-order digital loop filter with delta-sigma truncation to achieve sub-ps phase resolution and better than 2ppm frequency resolution. A test chip fabricated in a 0.18mum CMOS process achieves BER < 10 -12 and consumes 14mW power while operating at 2Gbps. The tracking range is greater than plusmn5000 ppm and plusmn2500 ppm at 10kHz and 20kHz modulation frequencies respectively, thus, making this CDR suitable for systems with spread spectrum clocking

A wide tracking range 0.2-4Gbps clock and data recovery circuit
Yingmin Li, Benjamin Lee, David Brooks, Zhigang Hu, and Kevin Skadron. 2/11/2006. “CMP design space exploration subject to physical constraints.” In High-Performance Computer Architecture, 2/11/2006. The Twelfth International Symposium on, Pp. 17–28. Austin, TX, USA: IEEE. Publisher's VersionAbstract
This paper explores the multi-dimensional design space for chip multiprocessors, exploring the inter-related variables of core count, pipeline depth, superscalar width, L2 cache size, and operating voltage and frequency, under various area and thermal constraints. The results show the importance of joint optimization. Thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery, demonstrating the importance of considering thermal constraints while optimizing these other parameters. For aggressive cooling solutions, reducing power density is at least as important as reducing total power, while for low-cost cooling solutions, reducing total power is more important. Finally, the paper shows the challenges of accommodating both CPU-bound and memory-bound workloads on the same design. Their respective preferences for more cores and larger caches lead to increasingly irreconcilable configurations as area and other constraints are relaxed; rather than accommodating a happy medium, the extra resources simply encourage more extreme optimization points.
CMP design space exploration subject to physical constraints
Qiang Wu, Margaret Martonosi, Douglas Clark, Vijay Reddi, Dan Connors, Youfeng Wu, Jin Lee, and David Brooks. 1/2006. “Dynamic-compiler-driven control for microprocessor energy and performance.” Micro, IEEE, 26, 1, Pp. 119–129. Publisher's VersionAbstract
A general dynamic-compilation environment offers power and performance control opportunities for microprocessors. The authors propose a dynamic-compiler-driven runtime voltage and frequency optimizer. A prototype of their design, implemented and deployed in a real system, achieves energy savings of up to 70 percent
Xiaoyao Liang and David Brooks. 2006. “Latency adaptation for multiported register files to mitigate the impact of process variations.” Workshop on Architectural Support for Gigascale Integration (ASGI-06, held in conjuction with ISCA-33).Abstract

Design variability due to die-to-die and within-die process variations has the potential to significantly reduce the maximum operating frequency and the effective yield of high-performance microprocessors in future process technology generations. This variability manifests itself by increasing the frequency variance and decreasing the mean frequency of fabricated chips. In this paper we develop a model for the impact of variability on the performance of multiported SRAM-based structures such as physical register files which are key architectural components that may encounter variability problems. We find that naively resizing or increasing the access latency of these performance critical datapath resources can have frequency benefits, but may incur a significant IPC loss that limits overall system performance. We propose an extension to latency adaptation called port switching which more efficiently exploits the technique to remedy the IPC loss. We find that even under a conservative, worst case study, 18 % mean frequency improvement with less than 5 % IPC loss is possible for the 65nm technology node. Finally, we contrast the impact of die-to-die and within-die variations on chip performance and demonstrate that the proposed technique can compensate the frequency loss mainly due to within-die variations.

Latency adaptation for multiported register files to mitigate the impact of process variations
Yingmin Li, Kevin Skadron, Benjamin Lee, and David Brooks. 2006. “Quantifying Latency and Throughput Compromises in CMP Design”. Publisher's VersionAbstract

Designers of chip multiprocessors will increasingly be called upon to optimize for a combination of design metrics under a variety of design constraints. The adoption of chip multiprocessors has also led to a shift in design metrics toward aggregate throughput and away from single thread latency. We examine the compromises between latency and throughput under various power, thermal, area, and bandwidth constraints to quantify the latency penalties of a purely throughput optimized design. We consider a large chip multiprocessor design space that includes core count, core complexity (pipeline dimensions, in-order versus out-of-order execution), and cache hierarchy sizes. We demonstrate an approach to effectively assess trade-offs given a comprehensive core model, a set of optimization criteria, and a set of design constraints. We perform a number of case studies to evaluate these trade-offs, exposing significant single thread latency penalties when optimizing solely for throughput and neglecting other measures of performance. As single thread latency continues to be one of several design metrics, any choice to compromise latency should be well understood before implementation. Collectively, our results suggest single thread latency is still a design metric of importance given that optimizing throughput alone will significantly compromise latency. Furthermore, the case for simple, in-order cores should be taken with caution given this balanced view of performance.

Quantifying Latency and Throughput Compromises in CMP Design
Benjamin Lee and David Brooks. 2006. “Regression modeling strategies for microarchitectural performance and power prediction.” Proceedings of the 2006 ASPLOS Conference, Pp. 185–194.Abstract

We propose regression modeling as an effective approach for accurately predicting performance and power for various applications executing on any microprocessor configuration in a large microarchitectural design space. This report addresses fundamental challenges in microarchitectural simulation costs via statistical modeling. Specifically, we derive and validate regression models for performance and power. Such models enable computationally efficient statistical inference, requiring the simulation of only 1 in 5 million points of a joint microarchitecture-application design space while achieving error rates as low as 4.1 percent for performance and 4.3 percent for power. Although both models achieve similar accuracy, the sources of accuracy are strikingly different. We present optimizations for a baseline regression model to obtain (1) per benchmark application-specific models designed to maximize accuracy in performance prediction and (2) regional power models leveraging only the most relevant samples from the microarchitectural design space to maximize accuracy in power prediction. Assessing model sensitivity to sample and region sizes, we find 4,000 samples from a design space of approximately 22 billion points, are sufficient for both application-specific and regional modeling and prediction. Collectively, our results suggest significant potential in accurate and efficient statistical inference for microarchitectural design space exploration via regression models.

Regression modeling strategies for microarchitectural performance and power prediction

Pages