Abstract
Many of the modern networks used to interconnect nodes in cluster-based computing systems provide network-interface cards (NICs) that offer programmable processors. Substantial research has been done with the focus of offloading processing from the host to the NIC processor. However, the research has primarily focused on the static offload of specific features to the NIC, mainly to support the optimization of common collective and synchronization-based communications. We describe the design and implementation of a framework based on MP1CH-GM to support the dynamic NIC-based offload of user-defined modules for Myrinet clusters. We evaluate our implementation on a 16-node cluster using a NIC-based version of the common broadcast operation and we find a maximum factor of improvement of 1.2 with respect to total latency as well as a maximum factor of improvement of 2.2 with respect to average CPU utilization under conditions of process skew. In addition, we see that these improvements increase with system size, indicating that our NIC-based framework offers enhanced scalability when compared to a purely host-based approach.