Hybrid MPI-CUDA Programming

overview | project | team | publications

Overview: Heterogeneous Cluster Programming for Scientific Computing

Scientific computations require a huge amount of resources in order to solve complex problems for a wide range of disciplines. Many such problems are modeled through discretization processes which require the solution of linear systems or eigenvalue problems involving matrices which are huge and sparse.

A common approach adopted to solve large and sparse linear systems is the use of iterative solvers capable of exploiting the sparsity of the problem. The "huge" size of the problem is faced by applying parallel programming techniques on a cluster of computational units. This approach requires a very different design compared with traditional serial programming and introduces more complex issues due to the parallel scheme, mostly in communication issues among the computational nodes.

In this scenario, hybrid GPU-CPU clusters are becoming very popular as attested by the number of such systems present in the Top 500 list.


Our Project: Sparse Matrix Computations on GPUs

In the last few years our team has developed a set of matrix storage formats and CUDA kernels for sparse linear algebra computation on GPU. We named the library SPGPU and it has shown good performance results when compared with the CuSPARSE library provided by NVIDIA.

In order to employ SPGPU on real problems, we have recently integrated it into the Parallel Sparse BLAS (PSBLAS) project, a scientific library of Basic Linear Algebra Subroutines for parallel sparse applications that facilitates the porting of complex computations on multicomputers. PSBLAS is written in Fortran2003 with an object-oriented (OO) design aimed at achieving at the same time maximal flexibility and optimal performance. We have recently exploited in PSBLAS the use of design patterns, that turn out to be very useful on heterogeneous platforms especially to handle efficiently the computations on general purpose GPU (GPGPU) devices.

An impressive amount of work about heterogeneous programming techniques has been done both by the scientific and commercial communities but how to efficiently use heterogeneus cluster platforms still remains an open challenge. Many issues arise when using an heterogeneous cluster, among them:

The first issue focuses on the optimal data partitioning between CPU and GPU within the cluster nodes. Such a problem has been recently investigated but every solution requires a not negligible amount of computation time in order to produce the optimal solution. Our approach is to study various heuristics through the use of a model of the heterogeneous cluster built from the data gathered by a specific benchmark. The goal iof our approach is to execute such benchmark during the installation phase of PSBLAS and then selects the most efficient heuristic for the data partitioning between CPU and GPU.

The second issue is a fundamental one that needs to be faced in a cluster of computational units and a large amount of literature covers the optimal mapping between processes and nodes in order to minimize the communication overhead. In a heterogeneous cluster this issue is critical due to the bottleneck between CPU and GPU which nullifies every profit provided by the GPU calculus power. Thus, our project plans to explore various communication techniques within the cluster nodes by employing various technologies such as Open MPI + CUDA support and GPU Direct.

The third issue is a very popular research topic in the numerical analysis community, where ways are being explored to trade increased local computations for reduced communications. This is a trend that is only likely to expand in the near future, since data movement is the most important factor in the performance of the overall system.

Preconditioning is the term indicating a set of techniques for improving and accelerating the convergence of iterative methods; not all preconditioning techniques are suited to platforms such as hybrid CPU/GPU nodes, and a lot of effort will be needed to identify the best ones in various application contexts, such as for example fluid dynamics simulations.


Team


Publications



http://www.ce.uniroma2.it/hybrid
This page was last modified: January 8, 2014.
For comments, you are welcome to send email to: Salvatore Filippone