Abstract
The most widely used programming models expect hardware to guarantee coherent shared memory accesses. However, with the increasing number of integrated cores on chip, resource and performance efficient scalable cache coherence protocols are needed. To address the scalability issues due to the size of the sharing set, we propose to encode, on a fixed size bit-vector, a rectangular cluster whose goal is to cover most of the sharers. The cluster size is fixed but its height, width and position are determined for each cache block and can change during execution. We use a fixed size linked list for the first few outliers, and resort to broadcast when the list overflows. We compare our solution to snoop, directory-based full bit-vector, and Ackwise. It leads to similar mean latency and 10% less traffic than Ackwise, and only a few percent more than the complete sharing set on these metrics. More importantly, it generates ten times less broadcasts than Ackwise while using similar hardware resources for a 64 cores architecture.