ADRES (Architecture for Dynamically Reconfigurable Embedded System): a novel architecture with tightly coupled VLIW processor & a Coarse-Grained Reconfigurable Matrix.
Why?
not much attention is paid to the integration of Instruction Set Processor (ISP) & Reconfigurable Matrix (RM)
loose coupling -> programming difficulty & communication overhead
How?
VLIW processer & coarse-grained RM:
are integrated into one single architecture
2 virtual functional views
Advantages:
improved performance
simplified programming model
reduced communication costs
substantial resource sharing
But need a good support from mapping tools.
ADRES Architecture
Architecture Description
ADRES core connected to a memory hierarchy
ADRES core
consists of basic components (FUs & RFs) connected in a certain topology
FUs execute word-level ops
RFs store intermediate data
ADRES matrix has 2 functional views
share physical resources
but their executions never overlap
VLIW processor:
FUs connected through 1 multi-port RF
these FUs are more powerful than that of the matrix in terms of functionality & speed
capable of executing more ops such as branch
some are connected to the mem hierarchy depending on available ports
-> mem access is done through ld/st op available on those FU
RM:
shares FUs & RFs with the VLIW processor
has a number of Reconfigurable Cells (RCs) basically comprising FUs & RFs too
RC
FU:
can be heterogeneous supporting different operation sets
support predicated ops to remove the control flow inside loops
RF: small with less ports
MUXes are used to direct data from different sources
configuration:
configuration RAM stores a few local configurations
local configurations can be loaded on cycle-by-cycle basis
configurations can be loaded from the mem hierarchy with higher delay
configurations control behavior of the basic components by selecting operations & multiplexors
The matrix also includes the FUs & RF of the VLIW processor
access to the mem also perform through FUs of VLIW processor
ADRES is a template of architectures, not a fixed one.
even the actual organization of the RC is not fixed
for ex: 2 FUs can share 1 RF
an XML-based architecture description language is used to define
communication topology
supported operation set
resource allocation
timing of the target architecture
The specified architecture will be translated to an internal architecture representation to facilitate compilation techniques
Improved Performance with the VLIW Processor
(1) is the equation of Amdahl's Law
Example
kernels represent 90% of execution time are mapped to the RM to obtain 30x acceleration
Overall Speedup ~ 7.69
-> high kernel speedup (30x) doesn't mean a high overall speedup (7.69)
The unaccelerated part
is often irregular & control-intensive
becomes a bottleneck
-> speeding up this part is essential for overall performance
possible to discover ILP using a VLIW processor
if 3x acceleration for the unaccelerated code
overall speedup ~ 15.8
-> the importance of a balanced system
Simplified Programming Model & Reduced Communication Cost
VLIW processor & RM share access to the mem
-> achieve simplified programming model & reduced communication cost
Traditional Reconfigurable architectures
Processor & RM are separated
communicate through explicit data copying
normal execution steps
(1) copy data from processor mem -> RM mem
(2) RM computes the kernel
(3) the results are copied back from RM mem -> processor mem
programming point of view:
the separation requires
significant code rewriting from software implementation -> mapping kernels to the matrix
identifying the data structures used for communication
then replace them with communication primitives
-> complex & error-prone
ADRES architecture
Data communication is performed through the shared RFs & memory
Easily mapping high-level language code like C to ADRES:
when the code is compiled to a processor
local vars are allocated in the RF
static vars & arrays are allocated in the mem space
when the control is transferred between the VLIW processor & the RM
all those vars used for communication still stay where they were
-> the copying is unnecessary
Programming point of view
shared-mem architecture is more compiler-friendly than the message-passing one
Processor & RM alternately access to RFs & mem
-> eliminating data synchronizing & integrity problems
-> code can be handled by compiler easily instead of rewriting