Abstract
We consider a non-uniform access latency cache architecture (NUCA) design for 3D chip multi-processors (CMPs) where cache structures are divided into small banks interconnected by a network-on-chip (NoC). In earlier NUCA designs, data is placed in banks either statically (S-NUCA) or dynamically (D-NUCA). In both S-NUCA and D-NUCA designs, scaling to hundreds of cores can pose several challenges. Thus, we propose a new NUCA architecture with an inclusive, octal tree-based, hierarchical directory (T-NUCA-8), with the potential to scale to hundreds of cores with performance comparable to D-NUCA at a fraction of the energy cost. Our evaluations indicate that relative to D-NUCA, our T-NUCA-8 reduces network usage by 92%, energy by 87%, and EDP by 87%, at performance cost of 10%.