Optimised Global Reduction on QsNet^ⅠⅠ
In this paper we describe how QsNet^II supports reduction, a key collective for massively parallel applications. Results from jobs run on a 512-node quad CPU cluster show excellent scaling, with the average time to execute a 2048 process global sum being 22 microsecs.