Publications

2009
Michael Karpelson, Gu Wei, and Wood J. 5/12/2009. “Milligram-scale high-voltage power electronics for piezoelectric microrobots.” In 2009 IEEE international conference on robotics and automation, Pp. 2217–2224. IEEE. Publisher's VersionAbstract

Compact yet powerful actuators are vital in many robotic applications, particularly small-scale autonomous systems such as bio-inspired microrobots. In recent years, a number of actuation methods have been proposed or applied in a microrobotic context, including piezoelectric [1], electrostatic [2], and dielectric elastomer actuators [3]. These actuation methods have the potential to achieve high efficiencies and power densities in very small geometries. Piezoelectric actuators in particular have shown promise in applications with very stringent weight and power density requirements, such as the Harvard Microrobotic Fly (HMF)—a flapping-wing robotic insect capable of liftoff with external power [4].

In order to produce mechanical output, the actuation methods mentioned above rely on the presence of electric charge on various electrodes in order to either generate high electric fields, as in the case of piezoelectric actuators, or high electrostatic forces, as in the case of electrostatic and dielectric elastomer actuators. Moreover, the geometries of such actuators inherently produce significant electrical capacitance, and therefore high operating voltages are usually necessary to accumulate a sufficient amount of charge on the actuator electrodes, ranging from tens to thousands of volts. For example, the piezoelectric actuators used in the HMF require drive voltages in the range of 200–300V. There are two major challenges in the design of power electronics capable of driving capacitive actuators: generating high voltages from low-voltage sources and recovering unused energy from the actuator.

Most compact energy sources suitable for microrobotic applications, such as lithium batteries, supercapacitors [5], solar cells [6], and fuel cells [7], generate output voltages below 5V. Connecting many such cells in series to obtain high voltage is generally not practical because the packaging overhead causes a significant reduction in energy density. Consequently, the generation of high voltages for HMF actuators requires voltage conversion circuits with step-up ratios ranging from 50 to 100. While there are a number of circuit topologies with high step-up ratios, many of them cannot be easily miniaturized and/or suffer from poor efficiency at the low output power levels common in microrobotic applications. Careful selection and optimization of the conversion circuit is necessary to ensure that heavy, inefficient electronics do not compromise system performance.

In addition to the voltage step-up functionality, the power electronics circuitry must generate a time-varying signal on the input electrodes of the actuator. The second challenge stems from the fact that, depending on the properties of the actuator, the nature of the mechanical load, and the characteristics of the drive signal, only a small fraction of the electrical energy stored in the actuator is converted into useful mechanical output [8]. In order to maximize overall system efficiency, it is highly desirable to both generate an appropriate drive signal and recover as much of the unused energy as possible, which imposes additional requirements on the drive circuitry.

This paper describes promising power electronics circuits that can generate the high, time-varying voltages necessary for the operation of piezoelectric actuators, while meeting the stringent weight requirements of microrobotic systems and maximizing system efficiency. Although the analysis focuses on piezoelectric actuators, many of the concepts described here can easily be adapted to other high-voltage capacitive actuators, such as electrostatic comb drives or dielectric elastomer actuators. This work reviews the electrical properties and drive requirements of piezoelectric actuators (Section II), and presents power electronics circuits applicable to various types and configurations of piezoelectric actuators (Sections III and IV). Experimental realizations of the drive circuits are described (Section V), including applications to milligram-scale microrobots, such as flapping-wing robotic insects.

Milligram-scale high-voltage power electronics for piezoelectric microrobots
Meeta Gupta, Vijay Reddi, Glenn Holloway, Gu Wei, and David Brooks. 4/20/2009. “An event-guided approach to reducing voltage noise in processors.” In Design, Automation & Test in Europe Conference & Exhibition, 4/20/2009. DATE'09., Pp. 160–165. Nice, France: IEEE. Publisher's Version An event-guided approach to reducing voltage noise in processors
Kevin Brownell, Ali Khan, David Brooks, and Gu Wei. 3/16/2009. “Place and route considerations for voltage interpolated designs.” In Quality of Electronic Design, 3/16/2009. ISQED 3/16/2009. Quality Electronic Design, Pp. 594–600. IEEE. Publisher's VersionAbstract

Voltage interpolation is a promising post fabrication technique for combating the effects of process variations. The benefits of voltage interpolation are well understood. Its implementation in a VLSI-CAD flow has been considered through the synthesis stage. In this paper we study the implications of place and route on voltage interpolation. We evaluate multiple placement strategies, and conclude that a hybridization of forced placement and cluster boxing techniques results in minimum overhead.

Place and route considerations for voltage interpolated designs
Lukasz Strozek and David Brooks. 3/2009. “Energy-and area-efficient architectures through application clustering and architectural heterogeneity.” ACM Transactions on Architecture and Code Optimization (TACO), 6, 1, Pp. 4. Publisher's VersionAbstract

Customizing architectures for particular applications is a promising approach to yield highly energy-efficient designs for embedded systems. This work explores the benefits of architectural customization for a class of embedded architectures typically used in energy- and area-constrained application domains, such as sensor nodes and multimedia processing. We implement a process flow that performs an automatic synthesis and evaluation of the different architectures based on runtime profiles of applications and determines an efficient architecture, with consideration for both energy and area constraints. An expressive architectural model, used by our engine, is introduced that takes advantage of efficient opcode allocation, several memory addressing modes, and operand types. By profiling embedded benchmarks from a variety of sensor and multimedia applications, we show that the energy savings resulting from various architectural optimizations relative to the base architectures (e.g., MIPS and MSP430) are significant and can reach 50%, depending on the application. We then identify the set of architectures that achieves near-optimal savings for a group of applications. Finally, we propose the use of heterogeneous ISA processors implementing those architectures as a solution to capitalize on energy savings provided by application customization while executing a range of applications efficiently.

Energy-and area-efficient architectures through application clustering and architectural heterogeneity
Vijay Reddi, Meeta Gupta, Glenn Holloway, Michael Smith, Gu-Yeon Wei, and David Brooks. 2/14/2009. “Voltage emergency prediction: Using Signatures to Reduce Operating Margins.” In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. Publisher's VersionAbstract

Inductive noise forces microprocessor designers to sacrifice performance in order to ensure correct and reliable operation of their designs. The possibility of wide fluctuations in supply voltage means that timing margins throughout the processor must be set pessimistically to protect against worst-case droops and surges. While sensor-based reactive schemes have been proposed to deal with voltage noise, inherent sensor delays limit their effectiveness. Instead, this paper describes a voltage emergency predictor that learns the signatures of voltage emergencies (the combinations of control flow and microarchitectural events leading up to them) and uses these signatures to prevent recurrence of the corresponding emergencies. In simulations of a representative superscalar microprocessor in which fluctuations beyond 4% of nominal voltage are treated as emergencies (an aggressive configuration), these signatures can pinpoint the likelihood of an emergency some 16 cycles ahead of time with 90% accuracy. This lead time allows machines to operate with much tighter voltage margins (4% instead of 13%) and up to 13.5% higher performance, which closely approaches the 14.2% performance improvement possible with an ideal oracle-based predictor.

Voltage emergency prediction: Using Signatures to Reduce Operating Margins
Vijay Reddi, Meeta Gupta, Glenn Holloway, Gu Wei, Michael Smith, and David Brooks. 2/14/2009. “Voltage emergency prediction: Using signatures to reduce operating margins.” In 2009 IEEE 15th International Symposium on High Performance Computer Architecture, Pp. 18–29. Raleigh, NC, USA: IEEE. Publisher's VersionAbstract
Inductive noise forces microprocessor designers to sacrifice performance in order to ensure correct and reliable operation of their designs. The possibility of wide fluctuations in supply voltage means that timing margins throughout the processor must be set pessimistically to protect against worst-case droops and surges. While sensor-based reactive schemes have been proposed to deal with voltage noise, inherent sensor delays limit their effectiveness. Instead, this paper describes a voltage emergency predictor that learns the signatures of voltage emergencies (the combinations of control flow and microarchitectural events leading up to them) and uses these signatures to prevent recurrence of the corresponding emergencies. In simulations of a representative superscalar microprocessor in which fluctuations beyond 4% of nominal voltage are treated as emergencies (an aggressive configuration), these signatures can pinpoint the likelihood of an emergency some 16 cycles ahead of time with 90% accuracy. This lead time allows machines to operate with much tighter voltage margins (4% instead of 13%) and up to 13.5% higher performance, which closely approaches the 14.2% performance improvement possible with an ideal oracle-based predictor.
Voltage emergency prediction: Using signatures to reduce operating margins
Vijay Reddi, Meeta Gupta, Krishna Rangan, Simone Campanoni, Glenn Holloway, Michael Smith, Gu Wei, and David Brooks. 1/2009. “Voltage noise: Why it’s bad, and what to do about it.” 5th IEEE Workshop on Silicon Errors in Logic-System Effects (SELSE), Palo Alto, CA.Abstract
Power constrained designs are becoming increasingly sensitive to supply voltage noise. We propose hardware-software collaboration to enable aggressive voltage margins: a fail-safe hardware mechanism tolerates margin violations in order to train a run-time software layer that reschedules instructions to avoid recurring violations. Additionally, the software controls an emergency signature-based predictor that throttles to suppress emergencies that code rescheduling cannot eliminate.
Voltage noise: Why it’s bad, and what to do about it
Xiaoyao Liang, Benjamin Lee, Gu Wei, and David Brooks. 2009. “Design and test strategies for microarchitectural post-fabrication tuning.” In Computer Design, 2009. ICCD 2009. IEEE International Conference on, Pp. 84–90. IEEE. Publisher's VersionAbstract
Process variations are a major hurdle for continued technology scaling. Both systematic and random variations will affect the critical delay of fabricated chips, causing a wide frequency and power distribution. Tuning techniques adapt the microarchitecture to mitigate the impact of variations at post-fabrication testing time. This paper proposes a new post-fabrication testing framework that accounts for testing costs. This framework uses on-chip canary circuits to capture systematic variation while using statistical analysis to estimate random variation. We derive regression models to predict chip performance and power. These techniques comprise an integrated framework that identifies the most energy efficient post-fabrication tuning configuration for each chip.
Design and test strategies for microarchitectural post-fabrication tuning
Lukasz Strozek and David Brooks. 2009. “Efficient architectures through application clustering and heterogeneity.” In CASES '06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, Pp. 190–200. Citeseer. Publisher's VersionAbstract

Customizing architectures for particular applications is a promising approach to yield highly energy-efficient designs for embedded systems. This work explores the benefits of architectural customization for a class of embedded architectures typically used in energy-constrained application domains such as sensor node and multimedia processing. We implement a process flow that analyzes runtime profiles of applications and combines this information with a model for our architectural design space providing a robust customization engine built upon a fully automated method for determining an efficient architecture (together with appropriate application transformations). By profiling embedded benchmarks from a variety of sensor and multimedia applications, the paper shows the relative energy savings resulting from various architectural optimizations and identifies the number of architectures that achieves near-optimal savings for a group of applications. This paper proposes the use of heterogeneous chip-multiprocessors as a cost-effective approach to capitalize on the potential energy savings provided by application customization while executing a range of applications efficiently.

Efficient architectures through application clustering and heterogeneity
Mark Hempstead, Gu Wei, and David Brooks. 2009. “Navigo: An early-stage model to study power-contrained architectures and specialization.” Workshop on Modeling, Benchmarking, and Simulation.Abstract
As the number of transistors double, it becomes difficult to power all of them within a strict power budget and still achieve the performance gains of that the industry has achieved historically. This work presents, Navigo, a modeling framework for architecture exploration across future process technology generations. The model includes support for voltage and frequency scaling based on ITRS and PTM models. This work is designed to aid architects in the planning stages of next generation microprocessors, by addressing the space between early-stage back-of-the-envelope calculations and later stage cycle accurate simulators. Using parameters from existing commercial processor cores, we show how power consumption limits the theoretical throughput of future processors. Navigo shows that specialization is the answer to circumvent the power density limit that curbs performance gains and resume traditional 1.58x performance growth trends. We present analysis, using next generation of process technologies, that shows the fraction of area that must be allocated for specialization to maintain performance growth must increase with each new generation of process technology.
Navigo: An early-stage model to study power-contrained architectures and specialization
2008
Benjamin Lee, Jamison Collins, Hong Wang, and David Brooks. 11/8/2008. “CPR: Composable performance regression for scalable multiprocessor models.” In 2008 41st IEEE/ACM International Symposium on Microarchitecture, Pp. 270–281. IEEE. Publisher's VersionAbstract
Uniprocessor simulators track resource utilization cycle by cycle to estimate performance. Multiprocessor simulators, however, must account for synchronization events that increase the cost of every cycle simulated and shared resource contention that increases the total number of cycles simulated. These effects cause multiprocessor simulation times to scale superlinearly with the number of cores. Composable performance regression (CPR) fundamentally addresses these intractable multiprocessor simulation times, estimating multiprocessor performance with a combination of uniprocessor, contention, and penalty models. The uniprocessor model predicts baseline performance of each core while the contention models predict interfering accesses from other cores. Uniprocessor and contention model outputs are composed by a penalty model to produce the final multiprocessor performance estimate. Trained with a production quality simulator, CPR is accurate with median errors of 6.63, 4.83 percent for dual-, quad-core multiprocessors. Furthermore, composable regression is scalable, requiring 0.33x the simulations required by prior regression strategies.
cpr_composable_performance_regression_for_scalable_multiprocessor_models.pdf
Benjamin Lee and David Brooks. 10/24/2008. “Roughness of microarchitectural design topologies and its implications for optimization.” In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, Pp. 240–251. IEEE. Publisher's VersionAbstract
Recent advances in statistical inference and machine learning close the divide between simulation and classical optimization, thereby enabling more rigorous and robust microarchitectural studies. To most effectively utilize these now computationally tractable techniques, we characterize design topology roughness and leverage this characterization to guide our usage of analysis and optimization methods. In particular, we compute roughness metrics that require high-order derivatives and multi-dimensional integrals of design metrics, such as performance and power. These roughness metrics exhibit noteworthy correlations (1) against regression model error, (2) against non-linearities and non-monotonicities of contour maps, and (3) against the effectiveness of optimization heuristics such as gradient ascent. Thus, this work quantifies the implications of design topology roughness for commonly used methods and practices in microarchitectural analysis. 
Roughness of microarchitectural design topologies and its implications for optimization
Chung Hayun, Liu Andrew, and Gu Wei. 9/21/2008. “A 12.5-Gbps, 7-bit transmit DAC with 4-tap LUT-based equalization in 0.13 $μ$m CMOS.” In 2008 IEEE Custom Integrated Circuits Conference, Pp. 563–566. IEEE. Publisher's VersionAbstract
This paper presents a 12.5-Gbps transmitter that uses a lookup table (LUT)-based equalizer to compensate for within-die imperfections. An equalization technique with 2x sampling is proposed to accommodate timing offsets in the multiphase clocks used for 8:1 serialization. LUT code remapping is also demonstrated to compensate for mismatch effects that introduce nonlinearity in the transmit DAC. Experimental results of a 7-bit resolution transmitter with 4-tap equalization, implemented in 0.13 mum CMOS, show the LUT-based equalizer can significantly improve the signal integrity of an otherwise closed eye for data transmitted at 12.5-Gbps.
A 12.5-Gbps, 7-bit transmit DAC with 4-tap LUT-based equalization in 0.13 $μ$m CMOS
Ankur Agrawal, Pavan Kumar Hanumolu, and Gu-Yeon Wei. 9/21/2008. “A 8 x 5 Gb/s source-synchronous receiver with clock generator phase error correction.” In 2008 IEEE Custom Integrated Circuits Conference, Pp. 459–462. IEEE. Publisher's VersionAbstract
This paper describes the design and implementation of a 8times5 Gb/s source-synchronous receiver in a 0.13 mum CMOS technology. The receiver employs a cascaded-DLL architecture that avoids filtering of the jitter on the received clock to enhance jitter tolerance bandwidth. A technique is proposed to correct phase spacing mismatch in DLLs that reduces the error standard deviations by more than 40% and improves receiver timing margins.
A 8 x 5 Gb/s source-synchronous receiver with clock generator phase error correction
Ruwan Ratnayake, Aleksandar Kavcic, and Gu Wei. 9/16/2008. “A high-throughput maximum a posteriori probability detector.” IEEE Journal of solid-state circuits, 43, 8, Pp. 1846–1858. Publisher's VersionAbstract
This paper presents a maximum a posteriori probability (MAP) detector, based on a forward-only algorithm that can achieve high throughputs. The MAP algorithm is optimal in terms of bit error rate (BER) performance and, with Turbo decoding, can approach performance close to the channel capacity limit. The proposed detector utilizes a deep-pipelined architecture implemented in skew-tolerant domino and experimentally measured results verify the detector can achieve throughputs greater than 750MHz while consuming 2.4W. The detector is implemented in a 0.13μm CMOS technology and has a die area of 9.9 mm 2 .
A high-throughput maximum a posteriori probability detector
Xuning Chen, Gu Wei, and Peh Shiuan. 8/11/2008. “Design of low-power short-distance opto-electronic transceiver front-ends with scalable supply voltages and frequencies.” In Proceedings of the 2008 international symposium on Low Power Electronics & Design, Pp. 277–282. Publisher's VersionAbstract
The need for low-power I/Os is widely recognized, as I/Os take up a significant portion of total chip power. In recent years, researchers have pointed to the potential system-level power savings that can be realized if dynamic voltage scalable I/Os are available. However, substantial challenges remain in building such links. This paper presents the design and implementation details of opto-electronic transceiver front-end blocks where supply voltage can scale from 1.2V to 0.6V with almost linearly scalable bandwidth from 8Gb/s to 4Gb/s, and power consumption from 36mW to 5mW in a 130nm CMOS process. To the best of our knowledge, this is the first circuit demonstration of voltage-scalable optical links. It demonstrates the feasibility of dynamic voltage scalable optical I/Os.
Design of low-power short-distance opto-electronic transceiver front-ends with scalable supply voltages and frequencies
Gu Wei, David Brooks, Ali Khan, and Xiaoyao Liang. 8/11/2008. “Instruction-driven clock scheduling with glitch mitigation.” In Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08), Pp. 357–362. ACM. Publisher's VersionAbstract
Instruction-driven clock scheduling is a mechanism that minimizes clock power in deeply-pipelined datapaths. Analysis of realistic processor workloads shows a preponderance of bubbles persist through pipelines like the floating point unit. Clock scheduling ostensibly adapts pipeline depth with respect to bubbles in the instruction stream without performance loss. Unfortunately, shallower pipelines (i.e. longer pipe stages) are prone to larger amounts of glitches propagating through logic, increasing dynamic power. Experimentally measured results from a 130 nm FPU test chip with flexible clocking capabilities show a super-linear increase in glitch-induced dynamic power for shallower pipelines. While higher glitch power can severely diminish the power savings offered by clock scheduling, judicious clocking of intermediate stages offers glitch mitigation to recover power savings for worst-case scenarios. Detailed analysis of clock scheduling applied to a FPU in a POWER4-like processor running realistic workloads shows an average net power savings of 15% compared to an aggressively clock-gated design.
Instruction-driven clock scheduling with glitch mitigation
Xiaoyao Liang, Gu Wei, and David Brooks. 6/21/2008. “Revival: A variation-tolerant architecture using voltage interpolation and variable latency.” In Computer Architecture, 6/21/2008. ISCA'08. 35th International Symposium on, Pp. 191–202. IEEE. Publisher's VersionAbstract
Process variations are poised to significantly degrade performance benefits sought by moving to the next nanoscale technology node. Parameter fluctuations in devices can introduce large variations in peak operation among chips, among cores on a single chip, and among microarchitectural blocks within one core. Hence, it will be difficult to only rely on traditional frequency binning to efficiently cover the large variations that are expected. Furthermore, multiple voltage/frequency domains introduce significant hardware overhead and alone cannot address the full extent of delay variations expected in future multi-core systems. In this paper, we present ReVIVaL, which combines two fine-grained post-fabrication tuning techniques---voltage interpolation(VI) and variable latency(VL). We show that the frequency variation between chips, between cores on one chip, and between functional units within cores can be reduced to a very small range. The effectiveness of these techniques are further verified through experiments on test chips fabricated in a 130 nm CMOS process. Detailed architectural simulations of multi-core processors demonstrate significant performance and power advantages are possible by combining variable latency with voltage interpolation.
Revival: A variation-tolerant architecture using voltage interpolation and variable latency
Michael Karpelson, Gu Wei, and Wood J. 5/19/2008. “A review of actuation and power electronics options for flapping-wing robotic insects.” In 2008 IEEE international conference on robotics and automation, Pp. 779–786. IEEE. Publisher's VersionAbstract
Flapping-wing robotic insects require actuators with high power densities at centimeter to micrometer scales. Due to the low weight budget, the selection and design of the actuation mechanism needs to be considered in parallel with the design of the power electronics required to drive it. This paper explores the design space of flapping-wing microrobots weighing 1g and under by determining mechanical requirements for the actuation mechanism, analyzing potential actuation technologies, and discussing the design and realization of the required power electronics. Promising combinations of actuators and power circuits are identified and used to estimate microrobot performance.
\
A review of actuation and power electronics options for flapping-wing robotic insects
Mark Hempstead, Gu Wei, and David Brooks. 5/18/2008. “System design considerations for sensor network applications.” In 2008 IEEE International Symposium on Circuits and Systems (ISCAS), Pp. 2566–2569. Seattle, WA: IEEE. Publisher's VersionAbstract
Systems research in the emerging space of wireless sensor networks has exploded. Researchers have deployed nodes composed of a wireless radio, MEMS sensors and low power computation for applications from medical sensing to volcanic monitoring. We must consider several requirements - including the need for inexpensive, long-lasting, highly reliable devices coupled with very low performance requirements - when designing devices for wireless sensor networks. An untethered, fully-integrated node that operates off of energy scavenged from the ambient environment is the ultimate goal. We take an application-driven approach to the design of a wireless sensor network node. Our approach addresses the event-driven nature that is characteristic of many sensor network workloads. We have completed a detailed architectural analysis of this space using a full-system simulator and RTL model. From this analysis, we chose to implement a design that best achieves the power goals and performance requirements of wireless sensor network applications.
System design considerations for sensor network applications

Pages