Dynamic voltage and frequency scaling (DVFS) is an effective technique for controlling microprocessor energy and performance. Existing DVFS techniques are primarily based on hardware, OS time-interrupts, or static-compiler techniques. However, substantially greater gains can be realized when control opportunities are also explored in a dynamic compilation environment. There are several advantages to deploying DVFS and managing energy/performance tradeoffs through the use of a dynamic compiler. Most importantly, dynamic compiler driven DVFS is fine-grained, code-aware, and adaptive to the current microarchitecture environment. This paper presents a design framework of the run-time DVFS optimizer in a general dynamic compilation system. A prototype of the DVFS optimizer is implemented and integrated into an industrial-strength dynamic compilation system. The obtained optimization system is deployed in a real hardware platform that directly measures CPU voltage and current for accurate power and energy readings. Experimental results, based on physical measurements for over 40 SPEC or Olden benchmarks, show that significant energy savings are achieved with little performance degradation. SPEC2K FP benchmarks benefit with energy savings of up to 70% (with 0.5% performance loss). In addition, SPEC2K INT show up to 44% energy savings (with 5% performance loss), SPEC95 FP save up to 64% (with 4.9% performance loss), and Olden save up to 61% (with 4.5% performance loss). On average, the technique leads to an energy delay product (EDP) improvement that is 3times-5times better than static voltage scaling, and is more than 2times (22% vs. 9%) better than the reported DVFS results of prior static compiler work. While the proposed technique is an effective method for microprocessor voltage and frequency control, the design framework and methodology described in this paper have broader potential to address other energy and power issues such as di/dt and thermal control
This paper studies the impact on energy efficiency and thermal behavior of design style and clock-gating style in queue and array structures. These structures are major sources of power dissipation, and both design styles and various clock gating schemes can be found in modern, high-performance processors. Although some work in the circuits domain has explored these issues from a power perspective, thermal treatments are less common, and we are not aware of any work in the architecture domain.We study both SRAM and latch and multiplexer ("latch-mux") designs and their associated clock-gating options. Using circuit-level simulations of both design styles, we derive power-dissipation ratios which are then used in cycle-level power/performance/thermal simulations. We find that even though the "unconstrained" power of SRAM designs is always better than latch-mux designs, latch-mux designs dissipate less power in practice when a structure's average occupancy is low but access rate is high, especially when "stall gating" is used to minimize switching power. We also find that latch-mux designs with stall gating are especially promising from a thermal perspective, because they exhibit lower power density than SRAM designs. Overall, when combined with implementation and verification challenges for SRAMs, latch-mux designs with stall gating appear especially promising for designs with thermal constraints. This paper also shows the importance of considering the interaction between architectural and circuit-design choices when performing early-stage design exploration
Recent years have seen a burgeoning interest in embedded wireless sensor networks with applications ranging from habitat monitoring to medical applications. Wireless sensor networks have several important attributes that require special attention to device design. These include the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements. Ultimately, the "holy grail" of this design space is a truly untethered device that operates off of energy scavenged from the ambient environment. In this paper, we describe an application-driven approach to the architectural design and implementation of a wireless sensor device that recognizes the event-driven nature of many sensor-network workloads. We have developed a full-system simulator for our sensor node design to verify and explore our architecture. Our simulation results suggest one to two orders of magnitude reduction in power dissipation over existing commodity-based systems for an important class of sensor network applications. We are currently in the implementation stage of design, and plan to tape out the first version of our system within the next year.
In this tutorial paper we present equalization techniques to mitigate inter-symbol interference (ISI) in high-speed communication links. Both transmit and receive equalizers are analyzed and high-speed circuits implementing them are presented. It is shown that a digital transmit equalizer is the simplest to design, while a continuous-time receive equalizer generally provides better performance. Decision feedback equalizer (DFE) is described and the loop latency problem is addressed. Finally, techniques to set the equalizer parameters adaptively are presented.
As microprocessors become increasingly interconnected, the power consumed by the interconnection network can no longer be ignored. Moreover, with demand for link bandwidth increasing, optical links are replacing electrical links in inter-chassis and inter-board environments. As a result, the power dissipation of optical links is becoming as critical as their speed. In this paper, we first explore options for building high speed optoelectronic links and discuss the power characteristics of different link components. Then, we propose circuit and network mechanisms that can realize power-aware optical links -links whose power consumption can be tuned dynamically in response to changes in network traffic. Finally, we incorporate power control policies along with the power characterization of link circuitry into a detailed network simulator to evaluate the performance cost and power savings of building power aware optoelectronic networked systems. Simulation results show that more than 75% savings in power consumption can be achieved with the proposed power aware optoelectronic network.
Simultaneous multithreading (SMT) and chip multiprocessing (CMP) both allow a chip to achieve greater throughput, but their relative energy-efficiency and thermal properties are still poorly understood. This paper uses Turandot, PowerTimer, and HotSpot to explore this design space for a POWER4/POWER5-like core. For an equal-area comparison with this style of core, we find CMP to be superior in terms of performance and energy-efficiency for CPU-bound benchmarks, but SMT to be superior for memory-bound benchmarks due to a larger L2 cache. Although both exhibit similar peak operating temperatures and thermal management overheads, the mechanism by which SMT and CMP heat up are quite different. More specifically, SMT heating is primarily caused by localized heating in certain key structures, CMP heating is mainly caused by the global impact of increased energy output. Because of this difference in heat up mechanism, we found that the best thermal management technique is also different for SMT and CMP Indeed, non-DVS localized thermal-management can outperform DVS for SMT. Finally, we show that CMP and SMT scales differently as the contribution of leakage power grows, with CMP suffering from higher leakage due to the second core's higher temperature and the exponential temperature-dependence of subthreshold leakage.
We consider processor core complexity and its impli-cations for the power-performance efficiency of SMT and CMP architectures, exploring fundamental trade-offs be-tween the efficiency of multi-core architectures and the com-plexity of their cores from a power-performance perspec-tive. Taking pipeline depth and width as proxies for core complexity, we conduct power-performance simulations of several SMT and CMP architectures employing cores of varying complexity. Our analyses identify efficient pipeline dimensions and outline the implications of using a power-performance efficiency metric for core complexity. Collectively, our results suggest SMT architectures en-able efficient increases in pipeline dimensions and core complexity. Furthermore, reducing pipeline di-mensions in CMP cores is inefficient, assuming ideal power-performance scaling from voltage/frequency scal-ing and circuit re-tuning. Given these conclusions, we formulate guidelines for complexity effective design.