You are here: Foswiki>Main Web>SimpleScalar>ResearchTopics>ReconfigurableComputingReadingList>Beck2008 (06 Apr 2011, ToanMai)Edit Attach

Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications

Introduction
Related Work
Proposed approach
Description of the system
Experiment & Benchmark Resutls
- Testbed
- Results

Introduction

What?
- Binary Translation:
  - transforms sequences of instructions @ run-time
  - implemented in coarse-grained reconfigurable array
  - works in parallel to a MIPS processor
Why?
- Novel solutions for
  - high performance of embedded devices
  - low power dissipation.
- Advantages of Reconfigurable architectures:
  - a sequence of code = combinational logic -> performance gains & energy savings
  - explore the ILP of the apps
  - *speed up sequences of data dependent instructions.
- Problems:
  - Reconfigurable systems is oftenfor some target apps.
    - -> reconfigurable systems must also adapt to increasing number of apps.
  - Transformations often modify the source/binary code
    - -> preclude the wide spread usage of reconfigurable systems.
    - -> To reduce the disign cycle & maintain backward compatibility:
      - sustaining binary compatibility
      - allowing legacy code reuse & traditional programming paradigms
Solution:
- Dynamic Instruction Merging (DIM):
  - Binary Translation (BT):
    - detect sequences of instructions at run-time
    - transform & execute in a reconfigurable array.
  - coarse-grained array -> reduce the configuration complexity.
  - transparent process -> allowing full binary code reuse.
  - no distinct kernel subject of optimization.

Related Work

Dynamic detection & reconfiguration
- To avoid recompilation
- Two ideas:
  - Binary Translation (BT):
    - monitor, analyzing & transforming parts of a running program
  - Trace Reuse:
    - sequences of instructions with the same operands will be repeated constantly during the execution of the program.

Proposed approach

DIM:
- Detect & transform instruction groups
- The configuration:
  - saved in a special cache
  - indexed by the program counter (PC)
- Next time the saved sequence is found:
  - no more analysis
  - The processor:
    - loads the previously stored configuration from the special cache.
    - loads the operands from the register bank
    - activate the reconfigurable HW as FU.
  - The reconfigurable array executes the configuration (including write back of the results).
  - The PC is updated -> continue with the execution fo the normal instructions.
- Entire app can be optimized depending on the size of the special cache.

Description of the system

Architecture of the array

2D Dynamic coarse-grain array
- is tightly coupled to the processor
- works as an additional FU similar to Chimaera.
- -> need no external accesses to the array
Instruction is allocated in an intersection between 1 row & 1 column.
2 data-independent instructions can be in the same row & execute in parallel
Each column:
- is homogeneous
- contain ordinary FUs of a particular type (e.g. ALUs, shifters, etc.)
About Input Operands:
- There's a set of buses
  - receiving values from the registers
  - connected to each FU
- Multiplexer is responsible for choosing the correct input/output value

The binary translation algorithm

Starts working on the 1st instruction found after a branch execution
Stops the translation when detecting an unsupported instruction or another branch
If > 3 instructions found:
- a new entry in the cache (based on FIFO) is created
- data of the special buffer keepping the temporary translation is saved
The translation needs a set of tables:
- only for the detection phase
- used to keep the information about the sequence of instructions being processed.
  - The routing of the operands
  - The configuration of the FUs
  - other intermediate tables
For each incoming instruction:
- 1st task: verification of RAW dependencies
  - Source operands are compared to a bitmap of target regs of each line composing the dependence table
    - if no equal target reg in the {current line, all above} -> allocate on that line (1st position from the left).

Reconfiguration and Execution

Reconfiguration phase invloves:
- Loading of the conf bits for the MUX, FUs & immediate values from the special cache.
- Fetching of the operands from the reg bank
A configuration:
- is indexed by the PC of its 1st instruction.
- obtained in the 1st stage
- 3 cycles for reconfiguration (exe stage: 4th)
  - -> if not enough?
    - the processor will be stalled & wait for the reconf
Execution
- Mem accesses
  - done by LD/ST units.
  - addresses are calculated by ALUs in previous lines.
- Operations depending on a load:
  - cache hit = total load delay
  - if miss? -> the whole array operation stops & wait until the miss is resolved.
- When operands are not used anymore
  - written back either in the mem or the local regs.

Experiment & Benchmark Resutls

Testbed

VHDL version of Minimips processor based on R3000 version.
Area evaluation: Mentor Leonardo Spectrum
Power Estimations: Synopsis PowerCompiler
Library: TSMC 0.18u
System evaluation: Mibench Benchmark Suite
- larger range of different app behaviors comparing to other benchmark sets (SPEC2000, etc.).

Results

More instructions per BB is better for reconfigurable architectures, why?
- exploiting parallelism.
More branches is worse, why?
- additional paths increase execution time & area for configuration
Best for reconfigurable systems: few BBs are responsible for most of the program execution time:
- just need to focus on those BBs

Topic revision: r9 - 06 Apr 2011, ToanMai

Main

Webs
Main
Sandbox
System

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback