ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained RM

Introduction

  • What?
    • ADRES (Architecture for Dynamically Reconfigurable Embedded System): a novel architecture with tightly coupled VLIW processor & a Coarse-Grained Reconfigurable Matrix.
  • Why?
    • not much attention is paid to the integration of Instruction Set Processor (ISP) & Reconfigurable Matrix (RM)
      • loose coupling -> programming difficulty & communication overhead
  • How?
    • VLIW processer & coarse-grained RM:
      • are integrated into one single architecture
      • 2 virtual functional views
    • Advantages:
      • improved performance
      • simplified programming model
      • reduced communication costs
      • substantial resource sharing
    • But need a good support from mapping tools.

ADRES Architecture

Architecture Description

  • ADRES core connected to a memory hierarchy
  • ADRES core
    • consists of basic components (FUs & RFs) connected in a certain topology
    • FUs execute word-level ops
    • RFs store intermediate data
    • ADRES matrix has 2 functional views
      • share physical resources
      • but their executions never overlap
    • VLIW processor:
      • FUs connected through 1 multi-port RF
      • these FUs are more powerful than that of the matrix in terms of functionality & speed
        • capable of executing more ops such as branch
        • some are connected to the mem hierarchy depending on available ports
          • -> mem access is done through ld/st op available on those FU
    • RM:
      • shares FUs & RFs with the VLIW processor
      • has a number of Reconfigurable Cells (RCs) basically comprising FUs & RFs too
      • RC
        • FU:
          • can be heterogeneous supporting different operation sets
          • support predicated ops to remove the control flow inside loops
        • RF: small with less ports
        • MUXes are used to direct data from different sources
        • configuration:
          • configuration RAM stores a few local configurations
          • local configurations can be loaded on cycle-by-cycle basis
          • configurations can be loaded from the mem hierarchy with higher delay
          • configurations control behavior of the basic components by selecting operations & multiplexors
        • The matrix also includes the FUs & RF of the VLIW processor
          • access to the mem also perform through FUs of VLIW processor
  • ADRES is a template of architectures, not a fixed one.
    • even the actual organization of the RC is not fixed
      • for ex: 2 FUs can share 1 RF
    • an XML-based architecture description language is used to define
      • communication topology
      • supported operation set
      • resource allocation
      • timing of the target architecture
    • The specified architecture will be translated to an internal architecture representation to facilitate compilation techniques

Improved Performance with the VLIW Processor

  • (1) is the equation of Amdahl's Law
  • Example
    • kernels represent 90% of execution time are mapped to the RM to obtain 30x acceleration
    • Overall Speedup ~ 7.69
    • -> high kernel speedup (30x) doesn't mean a high overall speedup (7.69)
  • The unaccelerated part
    • is often irregular & control-intensive
    • becomes a bottleneck
    • -> speeding up this part is essential for overall performance
      • possible to discover ILP using a VLIW processor
        • if 3x acceleration for the unaccelerated code
          • overall speedup ~ 15.8
          • -> the importance of a balanced system

Simplified Programming Model & Reduced Communication Cost

  • VLIW processor & RM share access to the mem
    • -> achieve simplified programming model & reduced communication cost

Traditional Reconfigurable architectures

    • Processor & RM are separated
      • communicate through explicit data copying
        • normal execution steps
          • (1) copy data from processor mem -> RM mem
          • (2) RM computes the kernel
          • (3) the results are copied back from RM mem -> processor mem
      • programming point of view:
        • the separation requires
          • significant code rewriting from software implementation -> mapping kernels to the matrix
          • identifying the data structures used for communication
            • then replace them with communication primitives
        • -> complex & error-prone

ADRES architecture

  • Data communication is performed through the shared RFs & memory
  • Easily mapping high-level language code like C to ADRES:
    • when the code is compiled to a processor
      • local vars are allocated in the RF
      • static vars & arrays are allocated in the mem space
    • when the control is transferred between the VLIW processor & the RM
      • all those vars used for communication still stay where they were
      • -> the copying is unnecessary
  • Programming point of view
    • shared-mem architecture is more compiler-friendly than the message-passing one
    • Processor & RM alternately access to RFs & mem
      • -> eliminating data synchronizing & integrity problems
        • -> code can be handled by compiler easily instead of rewriting

Substantial Resource Sharing

  • Processor & RM are only virtually separated
    • -> Substantial Resource Sharing with cost-saving
Topic revision: r4 - 10 Apr 2011, ToanMai
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback