Increasing computational demand on mobile devices calls for energy-friendly solutions for accelerating single programs. In the multicore era, thread level parallelism (TLP) can accelerate single-threaded programs without requiring power-hungry cores. HELIX-RC, a recently proposed co-design between the HELIX parallelizing compiler and its target architecture, shows that substantial TLP can be extracted from loops with small bodies by optimizing core-to-core communication. Previously, the effectiveness of the HELIX-RC approach has been demonstrated through simulation. In this paper, we evaluate a HELIXRC-like solution on a real platform. We have developed a simplified version of the HELIX-RC architecture that we call unified cache, and we have implemented it on an FPGA board. Our design augments a multicore platform with a simplified ring cache—the architectural component of the HELIX-RC co-design. With the aid of microbenchmarks, our FPGA prototype confirms the HELIX-RC findings. David Brooks After describing both the ring cache and the parallel code generated by the HELIX compiler, we sketch the design of the unified cache and we evaluate its implementation on a Xilinx VC707 FPGA board.