The HELIX Project
The HELIX project is an automatic parallelization framework. It consists of four major components:
- HELIX, a parallelizing compiler that uncovers parallelism among loop iterations.
- ILDJIT, a compilation framework using a high-level intermediate representation for easy code analyses and transformations.
- RingCache, a lightweight microarchitectural enhancement that enables fast core to core communication of machine words.
- XIOSim, a multicore x86 performance simulator that models in-order and out-of-order cores with RingCache.
By conventional definition, a parallel program is either expressed in terms of explicit parallel threads or tasks or is heavily annotated to guide compilers in mapping its data and control structures to parallel hardware. Research in recent years (such as the Liberty project at Princeton), however, has shown that in a very practical sense, every program is a parallel program, even one that has been designed and implemented with sequential semantics. Every long-running program depends on loops, and an increasing body of work demonstrates that automatic parallelization of loops, without help from the programmer, can lead to substantial speedup of the overall program.
HELIX runs loops in parallel by assigning its separate iterations to separate processing elements (cores). In general, the cores that handle separate iterations must communicate, both to synchronize and to exchange data. Thus, successful parallelization of a loop depends on whether the benefit of running it in parallel outweighs the communication costs. Historically, this approach has not been widely used because the cost of communication greatly outweighed the benefits of parallel execution. However, on a modern multicore chip, intercommunication costs have decreased dramatically, making this approach more viable. We couple this with heuristics that select the best loops for parallelization and many code transformations to automatically parallelize sequential code. Real system measurements show an average speedup of over 2x on six cores for the SPEC CPU2000 benchmark suite.
Since our original CGO publication, we have dramatically improved our compiler framework, thereby improving speedups. Please refer to the publications on this page for more details.
- (2015): HELIX-UP: Relaxing Program Semantics to Unleash Parallelization. International Symposium on Code Generation and Optimization (CGO), 2015.
- (2012): HELIX: Automatic parallelization of irregular programs for chip multiprocessing. In: International Symposium on Code Generation and Optimization (CGO), ACM 2012.
- (2012): The HELIX project: overview and directions. In: Design Automation Conference (DAC), ACM 2012.
ILDJIT (Intermediate Language Distributed Just In Time) is a unified compilation framework for bytecode languages designed and developed by Simone Campanoni. It includes static, dynamic and ahead-of-time compilers. ILDJIT has been designed for multicores from the ground up: it can exploit multicore machines, taking advantage of physical parallelism to hide both static and dynamic compilation latencies. It is also hightly extensible — all compiler components are plugins, making it simple to add new code analyses or transformations. ILDJIT starts from CIL bytecode and uses LLVM for backend code generation.
- (2008): A parallel dynamic compiler for CIL bytecode. ACM Sigplan Notices, ACM 2008.
Our original HELIX implementation was only able to achieve meaningful speedups on SPECfp benchmarks, because these tend to be numerical in nature and more naturally data-parallel. For SPECint workloads, the control flow complexity in larger loops was too great for our code analysis techniques to achieve sufficient accuracy. Code analysis is much more accurate for small loops (averaging less than 50 cycles per iteration), and this could potentially unlock a large amount of parallelism, but intercommunication via the cache coherence protocol was measured to be at least 75 cycles, which means it would take more time to communicate data than to complete a single loop iteration. To solve this problem, we designed HELIX-RC, which combines HELIX with a lightweight microarchitectural enhancement that sits between the core and its private L1 dcache called RingCache. RingCache connects all cores in the system in a ring topology and proactively forwards potentially shared data across the network with single cycle latencies, which largely erases communication costs. HELIX-RC is able to achieve significant speedups for SPECint benchmarks not previously demonstrated in the literature.
- (2014): HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs. In: International Symposium on Computer Architecture (ISCA), 2014.
XIOSim is a detailed microarchitectural x86 simulator. Its main design objective is detail in simulation, leading to high accuracy, sometimes at the sacrifice of simulation speed. It includes in-order (Atom-style) and out-of-order (Nehalem-style) core models, an extensive cache model, as well as DRAM models (DRAMSim2), power models (McPAT) and voltage models. XIOSim is a user-level simulator using Pin as a functional model, but can also handle multi-programmed workloads, modelling OS details like thread scheduling and core allocation.
People: Svilen Kanev, Kevin Brownell, Simone Campanoni, Sam Xi
- (2012): XIOSim: power-performance modeling of mobile x86 cores. In: International symposium on Low power Electronics and Design (ISLPED), ACM 2012.