Abstract
The recent emergence of large-scale knowledge discovery, data mining and social network analysis, irregular applications have gained renewed interest. Cache-based architectures do not provide optimal performances with such workloads, mainly due to the low spatial and temporal locality of their control and memory access patterns. This paper presents a multi-node, multi-core, multi-threaded shared-memory system architecture designed for the execution of large-scale irregular applications, and built on top of three pillars that support these workloads. First, transparent hardware support for Partitioned Global Address Space (PGAS) provides a large globally-shared address space with no software library overhead. Second, multithreaded multi-core processing nodes achieve the necessary latency tolerance required when accessing physically distributed global memory. Third, hardware support is provided for inter-thread synchronization on the global address space. An analytical performance model that accounts for the main architecture and application characteristics is presented. The hardware design of the proposed custom architectural building blocks is then described. Finally, a multi-board FPGA prototype of the proposed system with typical irregular kernels and benchmarks is presented. The experimental evaluation demonstrates the architecture performance scalability for different configurations of the whole system.