You are here:
Foswiki
>
Main Web
>
SimpleScalar
>
ResearchTopics
>
ReconfigurableComputingReadingList
>
Beck2008
(06 Apr 2011,
ToanMai
)
Edit
Attach
Transparent Reconfigurable Acceleration for Heterogeneous Embedded Applications
Introduction
Related Work
Proposed approach
Description of the system
Architecture of the array
The binary translation algorithm
Reconfiguration and Execution
Experiment & Benchmark Resutls
Testbed
Results
Introduction
What?
Binary Translation:
transforms sequences of instructions @ run-time
implemented in coarse-grained reconfigurable array
works in parallel to a MIPS processor
Why?
Novel solutions for
high performance of embedded devices
low power dissipation.
Advantages of Reconfigurable architectures:
a sequence of code = combinational logic -> performance gains & energy savings
explore the ILP of the apps
*speed up sequences of data dependent instructions.
Problems:
Reconfigurable systems is oftenfor
some target apps
.
-> reconfigurable systems must also adapt to increasing number of apps.
Transformations often modify the source/binary code
-> preclude the wide spread usage of reconfigurable systems.
-> To
reduce the disign cycle & maintain backward compatibility:
sustaining binary compatibility
allowing legacy code reuse & traditional programming paradigms
Solution:
Dynamic Instruction Merging (DIM):
Binary Translation (BT):
detect sequences of instructions at run-time
transform & execute in a reconfigurable array.
coarse-grained array -> reduce the configuration complexity.
transparent process ->
allowing full binary code reuse.
no
distinct kernel subject of optimization.
Related Work
Dynamic detection & reconfiguration
To avoid recompilation
Two ideas:
Binary Translation (BT):
monitor, analyzing & transforming parts of a running program
Trace Reuse:
sequences of instructions with the same operands will be repeated constantly during the execution of the program.
Proposed approach
DIM:
Detect & transform instruction groups
The configuration:
saved in a
special cache
indexed by the program counter (PC)
Next time the saved sequence is
found
:
no more analysis
The processor:
loads the previously stored configuration from the special cache.
loads the operands from the register bank
activate the reconfigurable HW as FU.
The reconfigurable array
executes the configuration (including write back of the results).
The PC is updated -> continue with the execution fo the normal instructions.
Entire app can be optimized depending on the size of the
special cache.
Description of the system
Architecture of the array
2D Dynamic coarse-grain array
is tightly coupled to the processor
works as an additional FU similar to Chimaera.
-> need no external accesses to the array
Instruction is allocated in an intersection between 1 row & 1 column.
2 data-independent instructions can
be in the same row & execute in parallel
Each column:
is homogeneous
contain ordinary FUs of a particular type (e.g. ALUs, shifters, etc.)
About Input Operands:
There's a set of buses
receiving values from the registers
connected to each FU
Multiplexer
is responsible for choosing the correct
input/output
value
The binary translation algorithm
Starts working on the 1st instruction found after a
branch execution
Stops the translation when detecting an
unsupported instruction
or
another branch
If > 3 instructions found:
a new entry in the cache (based on FIFO) is
created
data of the special buffer keepping the temporary translation is
saved
The translation
needs a set of tables:
only for the
detection phase
used to keep the
information
about the sequence of instructions being processed.
The routing of the operands
The configuration of the FUs
other intermediate tables
For each incoming instruction:
1st task: verification of
RAW dependencies
Source operands
are compared to a bitmap of
target regs
of each line composing the
dependence table
if no equal target reg in the {current line, all above} -> allocate on that line (1st position from the
left
).
Reconfiguration and Execution
Reconfiguration phase invloves:
Loading
of the conf bits for the MUX, FUs & immediate values from the special cache.
Fetching
of the operands from the reg bank
A
configuration
:
is indexed by the PC of its
1st instruction
.
obtained in the
1st stage
3 cycles for reconfiguration (exe stage: 4th)
->
if not enough?
the processor will be
stalled & wait
for the reconf
Execution
Mem accesses
done by LD/ST units.
addresses are calculated by ALUs in previous lines.
Operations depending on a load:
cache hit = total load delay
if
miss?
-> the whole array operation
stops & wait
until the miss is resolved.
When
operands
are not used anymore
written back either in the mem or the local regs.
Experiment & Benchmark Resutls
Testbed
VHDL version of Minimips processor based on R3000 version.
Area evaluation: Mentor Leonardo Spectrum
Power Estimations: Synopsis
PowerCompiler
Library: TSMC 0.18u
System evaluation:
Mibench Benchmark Suite
larger range of different
app behaviors
comparing to other benchmark sets (SPEC2000, etc.).
Results
More instructions per BB is better
for reconfigurable architectures, why?
exploiting parallelism.
More branches is worse
, why?
additional paths increase
execution time & area for configuration
Best
for reconfigurable systems:
few
BBs are responsible for
most
of the program execution time:
just need to focus on those BBs
E
dit
|
A
ttach
|
P
rint version
|
H
istory
: r9
<
r8
<
r7
<
r6
|
B
acklinks
|
V
iew wiki text
|
Edit
w
iki text
|
M
ore topic actions
Topic revision: r9 - 06 Apr 2011,
ToanMai
Main
Log In
Main Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
Webs
Main
Sandbox
System
Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki?
Send feedback