Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications

Introduction

  • What?
    • Binary Translation:
      • transforms sequences of instructions @ run-time
      • implemented in coarse-grained reconfigurable array
      • works in parallel to a MIPS processor
  • Why?
    • Novel solutions for
      • high performance of embedded devices
      • low power dissipation.
    • Advantages of Reconfigurable architectures:
      • a sequence of code = combinational logic -> performance gains & energy savings
      • explore the ILP of the apps
      • *speed up sequences of data dependent instructions.
    • Problems:
      • Reconfigurable systems is oftenfor some target apps.
        • -> reconfigurable systems must also adapt to increasing number of apps.
      • Transformations often modify the source/binary code
        • -> preclude the wide spread usage of reconfigurable systems.
        • -> To reduce the disign cycle & maintain backward compatibility:
          • sustaining binary compatibility
          • allowing legacy code reuse & traditional programming paradigms
  • Solution:
    • Dynamic Instruction Merging (DIM):
      • Binary Translation (BT):
        • detect sequences of instructions at run-time
        • transform & execute in a reconfigurable array.
      • coarse-grained array -> reduce the configuration complexity.
      • transparent process -> allowing full binary code reuse.
      • no distinct kernel subject of optimization.
  • Dynamic detection & reconfiguration
    • To avoid recompilation
    • Two ideas:
      • Binary Translation (BT):
        • monitor, analyzing & transforming parts of a running program
      • Trace Reuse:
        • sequences of instructions with the same operands will be repeated constantly during the execution of the program.

Proposed approach

  • DIM:
    • Detect & transform instruction groups
    • The configuration:
      • saved in a special cache
      • indexed by the program counter (PC)
    • Next time the saved sequence is found:
      • no more analysis
      • The processor:
        • loads the previously stored configuration from the special cache.
        • loads the operands from the register bank
        • activate the reconfigurable HW as FU.
      • The reconfigurable array executes the configuration (including write back of the results).
      • The PC is updated -> continue with the execution fo the normal instructions.
    • Entire app can be optimized depending on the size of the special cache.

Description of the system

Architecture of the array

  • 2D Dynamic coarse-grain array
    • is tightly coupled to the processor
    • works as an additional FU similar to Chimaera.
    • -> need no external accesses to the array
  • Instruction is allocated in an intersection between 1 row & 1 column.
  • 2 data-independent instructions can be in the same row & execute in parallel
  • Each column:
    • is homogeneous
    • contain ordinary FUs of a particular type (e.g. ALUs, shifters, etc.)
  • About Input Operands:
    • There's a set of buses
      • receiving values from the registers
      • connected to each FU
    • Multiplexer is responsible for choosing the correct input/output value

The binary translation algorithm

  • Starts working on the 1st instruction found after a branch execution
  • Stops the translation when detecting an unsupported instruction or another branch
  • If > 3 instructions found:
    • a new entry in the cache (based on FIFO) is created
    • data of the special buffer keepping the temporary translation is saved
  • The translation needs a set of tables:
    • only for the detection phase
    • used to keep the information about the sequence of instructions being processed.
      • The routing of the operands
      • The configuration of the FUs
      • other intermediate tables
  • For each incoming instruction:
    • 1st task: verification of RAW dependencies
      • Source operands are compared to a bitmap of target regs of each line composing the dependence table
        • if no equal target reg in the {current line, all above} -> allocate on that line (1st position from the left).

Reconfiguration and Execution

  • Reconfiguration phase invloves:
    • Loading of the conf bits for the MUX, FUs & immediate values from the special cache.
    • Fetching of the operands from the reg bank
  • A configuration:
    • is indexed by the PC of its 1st instruction.
    • obtained in the 1st stage
    • 3 cycles for reconfiguration (exe stage: 4th)
      • -> if not enough?
        • the processor will be stalled & wait for the reconf
  • Execution
    • Mem accesses
      • done by LD/ST units.
      • addresses are calculated by ALUs in previous lines.
    • Operations depending on a load:
      • cache hit = total load delay
      • if miss? -> the whole array operation stops & wait until the miss is resolved.
    • When operands are not used anymore
      • written back either in the mem or the local regs.

Experiment & Benchmark Resutls

Testbed

  • VHDL version of Minimips processor based on R3000 version.
  • Area evaluation: Mentor Leonardo Spectrum
  • Power Estimations: Synopsis PowerCompiler
  • Library: TSMC 0.18u
  • System evaluation: Mibench Benchmark Suite
    • larger range of different app behaviors comparing to other benchmark sets (SPEC2000, etc.).

Results

  • More instructions per BB is better for reconfigurable architectures, why?
    • exploiting parallelism.
  • More branches is worse, why?
    • additional paths increase execution time & area for configuration
  • Best for reconfigurable systems: few BBs are responsible for most of the program execution time:
    • just need to focus on those BBs
Topic revision: r9 - 06 Apr 2011, ToanMai
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback