Abstract
Boosting performance and energy efficiency of scientific applications running on high performance computing systems arise cruicially nowadays. Software and hardware based solutions for improving communication performance have been recognized as significant means of achieving performance gain and thus energy savings for such applications. As a fundamental component of most numerical linear algebra algorithms, improving performance and energy efficiency of distributed matrix multiplication is of major concerns. For such purposes, we propose a high performance communication scheme that fully exploits network bandwidth via non-blocking pipeline broadcast with tuned chunk size. Empirically, substantial performance gain up to 8.4% and energy savings up to 6.9% are achieved compared to blocking pipeline broadcast, and against binomial tree broadcast, performance gain up to 6.5% and energy savings up to 6.1% are observed on a 64-core cluster.