Archive for the ‘Publications’ Category

Reducing communication overhead (1) — shared memory approach

May 26th, 2012 Comments off

In a computer system employing on-chip accelerator processors such as streaming co-processor or reconfigurable hardware co-processor, it is vital to minimize communication overhead lest the communication overhead cancel out any performance improvement by the co-processor accelerator.

Our first cut approach to this problem, tailored to reconfigurable hardware co-processors that we’ve been working with, is an on-chip shared memory that is little more intelligent than just a scratch-pad memory but less so than a cache. We call it Configurable Range Memory (CRM). Here is an article introducing its basic ideas.

Application-specific hardware and reconfigurable processors can dramatically speed up compute-intensive kernels of  applications, offloading the burden of main processor.  To minimize the communication overhead in such a coprocessor  approach, the two processors can share an on-chip memory, which may be considered by each processor as a scratchpad  memory.  However, this setup poses a significant challenge to the main processor, which now must manage data on the  scratchpad explicitly, often resulting in superfluous data copy.  This paper presents an enhancement to scratchpad,  called Configurable Range Memory (CRM), that can reduce the need for explicit management and thus reduce data  copy and promote data reuse on the shared memory.  Our experimental results using benchmarks from DSP and multimedia  applications demonstrate that our CRM architecture can significantly reduce the communication overhead compared to the  architecture without shared memory, while not requiring explicit data management.


Read the full text via this.

Categories: Publications Tags:

PICA: Processor Idle Cycle Aggregation for Low Power

February 2nd, 2011 Comments off

Processor Idle Cycle Aggregation (PICA) is a promising approach for low power execution of processors, in which small memory stalls are aggregated to create a large one, and the processor is switched to low-power mode in it.  We extend the previous proposed approach in two dimensions.  i) We develop static analysis for the PICA technique and present optimum parameters for five common types of loops based on steady-state analysis.  ii) We show that software only control is unable to guarantee its correctness in a varying runtime environment, potentially causing deadlocks.  We enhance the robustness of PICA with minimal hardware extension, ensuring correct execution for any loops and parameters, which greatly facilitates exploration based parameter optimization.  The combined use of our static analysis and exploration based fine-tuning makes the PICA technique applicable, to any memory-bound loop, with energy reduction.  We validate our analytical models against simulation based optimization and also show through our experiments on embedded application benchmarks, that our technique can be applied to a wide range of loops with average 20% energy reductions compared to executions without PICA.
This work was presented in the CODES+ISSS conference, which was held in Atlanta, Georgia, as part of 2008 ESWEEK.  Read the paper…
Categories: Publications Tags:

Pure Software Approach to Reducing Transient Faults in Register Files

January 3rd, 2011 Comments off

Device miniaturization is causing significant problems in semiconductor reliability. One particularly nasty problem is what is called transient fault — transient as opposed to permanent because these kinds of faults or errors happen only temporarily. So you may experience this kind of error one time, but you may not experience the same error when you do the same operation again, thus no reproducibility. This can pose a very serious challenge to “testing”, and equally challenging is how to mitigate the effects of such transient errors at runtime. The question that is traditionally asked is i) how to detect such errors and ii) how to correct computation once they are detected.

A very different approach to the same problem is, to try to reduce the rate of such errors.. say to 1/100 times, because if the errors happen very rarely it may not be a problem. This can be done as easily as by recompiling the program… Sounds intriguing? For m0re detail, please check this out: “A compiler optimization to reduce soft errors in register files,” ACM SIGPLAN Notices, Vol. 44, No. 7, pp. 41-49, by Jongeun Lee and Aviral Shrivastava, 2009.

Register file (RF) is extremely vulnerable to soft errors, and traditional redundancy based schemes to protect the RF are prohibitive not only because RF is often in the timing critical path of the processor, but also since it is one of the hottest blocks on the chip, and therefore adding any extra circuitry to it is not desirable.  Pure software approaches would be ideal in this case, but previous approaches that are based on program duplication have very significant runtime overheads, and others based on instruction scheduling are only moderately effective due to local scope.  We show that the problem of protecting registers inherently requires inter-procedural analysis, and intra-procedural optimization are ineffective.  This paper presents a pure compiler approach, based on inter-procedural code analysis to reduce the vulnerability of registers by temporarily writing live variables to protected memory.  We formulate the problem as an integer linear programming problem and also present a very efficient heuristic algorithm.  Our experiments demonstrate that our proposed technique can reduce the vulnerability of the RF by 33~37% on average and up to 66%, with a small 2% increase in runtime.  In addition, our overhead reduction optimizations can effectively reduce the code size overhead, by more than 40% on average, to a mere 5~6%, as compared to highly optimized binaries.

Dynamic vs. static view of a program, used to analyze the effect of compilation on the transient errors.

Dynamic vs. Static view of a program. Transient error can be best defined/understood in the dynamic view (left) of the program, but compilers can only see the static view (right), thus the challenge of this approach.

Categories: Publications Tags:

Memory-Aware Mapping for Reconfigurable Architectures

October 13th, 2010 Comments off

We presented a first approach to optimizing software for the memory architecture of the target reconfigurable computing system, in HiPEAC 2010, held in Pisa, Italy. The motivation of this work is that often in multimedia applications the performance bottleneck is in data transfer, not in computation per se. While previously we have tried only to maximize computation rate only, it may be better to sacrifice computation rate a little to increase data transfer rate if there is a trade-off between computation and data transfer as in the case of CGRA mapping. We targeted our compiler for a ADRES-like architecture, with a slightly simplified local memory subsystem:  double-buffered, multi-banked, and the banks of the local memory are one-to-one mapped to the load-store units of the reconfigurable architecture.

Coarse-Grained Reconfigurable Arrays (CGRAs) are a very promising platform, providing both, up to 10-100 MOps/mW of power efficiency and are software programmable. However, this cardinal promise of CGRAs critically hinges on the effectiveness of application mapping onto CGRA platforms. While previous solutions have greatly improved the computation speed, they have largely ignored the impact of the local memory architecture on the achievable power and performance. This paper motivates the need for memory-aware application mapping for CGRAs, and proposes an effective solution for application mapping that considers the effects of various memory architecture parameters including the number of banks, local memory size, and the communication bandwidth between the local memory and the external main memory. Our proposed solution achieves 62% reduction in the energy-delay product, which factors into about 47% and 28% reduction in the energy consumption and runtime, respectively, as compared to memory-unaware mapping for realistic local memory architectures. We also show that our scheme scales across a range of applications, and memory parameters.

Read the full paper: “Memory-Aware Application Mapping on Coarse-Grained Reconfigurable Arrays,” Lecture Notes in Computer Science (HiPEAC ’10), Vol. 5952, pp. 171-185, by Yongjoo Kim, Jongeun Lee, Aviral Shrivastava, Jonghee W. Yoon and Yunheung Paek, 2010.

Categories: Publications Tags: , ,

Circular Stack Management for Scratchpad Memory

August 20th, 2010 Comments off
A nice Stack Management technique for a limited local memory architecture like IBM Cell, is presented at ASP-DAC and in IEEE Trans. CAD in 2009. The pointer problem was extremely difficult to handle correctly, but we managed to do it, at least from the correctness point of view.

A dynamic scratch pad memory (SPM) management scheme for program stack data with the objective of processor power  reduction is presented.  Basic  technique  does  not  need the SPM size at compile time, does not mandate any hardware changes, does not need profile information, and seamlessly integrates support for recursive functions. Stack frames are managed using a software SPM manager, integrated into the application binary, and shows average energy savings of 32% along with a performance improvement of 13%, on benchmarks from MiBench. SPM management  can  be further optimized and made pointer-safe, by knowing the SPM size.

Read the full paper: “A Software-Only Solution to Use Scratch Pads for Stack Data,” by Aviral Shrivastava, Arun Kannan, and Jongeun Lee*, published in IEEE Transactions on CAD, vol. 28, no. 11, pp. 1719-1727, November 2009.

Categories: Multicore, Publications Tags: , , , ,