Abstract
Sparse Matrix-Transpose Vector Product (SMTVP) is a frequently used computation pattern in High Performance Computing applications. It is typically solved by transposition followed by a Sparse Matrix-Vector Product (SMVP) in current linear algebra packages. However, the transposition process can be a serious bottleneck on modern parallel computing platforms. A previous work proposed a relatively complex data structure for efficiently computing SMTVP with multi-core CPUs, but it proved to be inefficient on GPUs. In this work, we show that the Compressed Sparse Row (CSR) based SMVP algorithm can also be efficient for SMTVP computation on modern GPUs. The proposed method exploits atomic operations to perform the reduce operation in the computation of each inner product of a row in the transposed matrix and the vector. Experimental results show that the simple technique can outperform the SMTVP flow of transposition plus SMVP released in the CUSPARSE package by up to 405-fold.