Hybrid MPI-CUDA Programming

overview | project | team | publications

Overview: Heterogeneous Cluster Programming for Scientific Computing

Scientific computations require a huge amount of resources in order to solve complex problems for a wide range of disciplines. Many such problems are modeled through discretization processes which require the solution of linear systems or eigenvalue problems involving matrices which are huge and sparse.

A common approach adopted to solve large and sparse linear systems is the use of iterative solvers capable of exploiting the sparsity of the problem. The "huge" size of the problem is faced by applying parallel programming techniques on a cluster of computational units. This approach requires a very different design compared with traditional serial programming and introduces more complex issues due to the parallel scheme, mostly in communication issues among the computational nodes.

In this scenario, hybrid GPU-CPU clusters are becoming very popular as attested by the number of such systems present in the Top 500 list.

Our Project: Sparse Matrix Computations on GPUs

In the last few years our team has developed a set of matrix storage formats and CUDA kernels for sparse linear algebra computation on GPU. We named the library SPGPU and it has shown good performance results when compared with the CuSPARSE library provided by NVIDIA.

In order to employ SPGPU on real problems, we have recently integrated it into the Parallel Sparse BLAS (PSBLAS) project, a scientific library of Basic Linear Algebra Subroutines for parallel sparse applications that facilitates the porting of complex computations on multicomputers. PSBLAS is written in Fortran2003 with an object-oriented (OO) design aimed at achieving at the same time maximal flexibility and optimal performance. We have recently exploited in PSBLAS the use of design patterns, that turn out to be very useful on heterogeneous platforms especially to handle efficiently the computations on general purpose GPU (GPGPU) devices.

An impressive amount of work about heterogeneous programming techniques has been done both by the scientific and commercial communities but how to efficiently use heterogeneus cluster platforms still remains an open challenge. Many issues arise when using an heterogeneous cluster, among them:

How to distribute data within the heterogeneous computational units of the cluster;
How those units can communicate efficiently;
How the algorithms can be restructured to minimize the communication load;
What preconditioning techniques can be employed effectively on the GPU plaftorm.

The first issue focuses on the optimal data partitioning between CPU and GPU within the cluster nodes. Such a problem has been recently investigated but every solution requires a not negligible amount of computation time in order to produce the optimal solution. Our approach is to study various heuristics through the use of a model of the heterogeneous cluster built from the data gathered by a specific benchmark. The goal iof our approach is to execute such benchmark during the installation phase of PSBLAS and then selects the most efficient heuristic for the data partitioning between CPU and GPU.

The second issue is a fundamental one that needs to be faced in a cluster of computational units and a large amount of literature covers the optimal mapping between processes and nodes in order to minimize the communication overhead. In a heterogeneous cluster this issue is critical due to the bottleneck between CPU and GPU which nullifies every profit provided by the GPU calculus power. Thus, our project plans to explore various communication techniques within the cluster nodes by employing various technologies such as Open MPI + CUDA support and GPU Direct.

The third issue is a very popular research topic in the numerical analysis community, where ways are being explored to trade increased local computations for reduced communications. This is a trend that is only likely to expand in the near future, since data movement is the most important factor in the performance of the overall system.

Preconditioning is the term indicating a set of techniques for improving and accelerating the convergence of iterative methods; not all preconditioning techniques are suited to platforms such as hybrid CPU/GPU nodes, and a lot of effort will be needed to identify the best ones in various application contexts, such as for example fluid dynamics simulations.

Team

Davide Barbieri
Valeria Cardellini
Alessandro Fanfarillo
Salvatore Filippone

Publications

V. Cardellini, D. Rouson, S. Filippone, "Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms", Scientific Programming, accepted for publication, to appear in 2013.
Daniele Bertaccini and Salvatore Filippone, Approximate Inverse Preconditioners for Krylov Methods on Heterogeneous Parallel Computers , Proceedings of PARCO 2013, Munich, Sep. 2013.
Valeria Cardellini, Alessandro Fanfarillo and Salvatore Filippone, Heterogeneous Sparse Matrix Computations on Hybrid GPU/CPU, Platforms Proceedings of PARCO 2013, Munich, Sep. 2013.
Davide Barbieri, Valeria Cardellini and Salvatore Filippone, Fast Uniform Grid Construction on GPGPUs Using Atomic Operations, Proceedings of PARCO 2013, Munich, Sep. 2013.
P. D'Ambra, D. di Serafino, S. Filippone, "Performance analysis of parallel Schwarz preconditioners in the LES of turbulent channel flows", Computers and Mathematics with Applications, Vol. 65, pp. 352-361, 2013.
S. Filippone, A. Buttari, "Object-oriented techniques for sparse matrix computation in Fortran 2003", ACM Trans. on Mathematical Software, Vol. 38, No. 4, Aug. 2012. (ACM DL)
D. Barbieri, V. Cardellini, S. Filippone, D. Rouson, "Design patterns for scientific computations on sparse matrices", Proceedings of Euro-Par 2011 Workshops, Part I, Lecture Notes in Computer Science Vol. 7155, Springer, pp. 366-376, 2012. (pdf, SpringerLink)
D. Barbieri, V. Cardellini, S. Filippone, "Sparse computations on GPGPUs", Technical Report RR-12.90, Dip. di Informatica, Sistemi e Produzione, Università di Roma "Tor Vergata", Italy, Jan. 2012.
D. Barbieri, V. Cardellini, S. Filippone, "Generalized GEMM kernels on GPGPUs: experiments and applications", Proc. of Int'l Conf. on Parallel Computing (ParCo 2009), Lyon, France, Sept. 2009. Published in Parallel Computing: From Multicores and GPU's to Petascale, Advances in Parallel Computing Vol. 19, IOS Press, pp. 307-314, Apr. 2010. (pdf, IOS Press Ebooks)
P. D'Ambra, D. di Serafino, S. Filippone: "MLD2P4: a Package of Parallel Algebraic Multilevel Domain Decomposition Preconditioners in Fortran 95", ACM Trans. on Mathematical Software, Vol. 37, No. 3, 2010.
A. Buttari, V. Eijkhout, J. Langou and S. Filippone, "Performance optimization and modeling of sparse kernels", International Journal of High Performance Computing Applications, Volume 21, No. 4, pp. 467-484, Nov. 2007.

http://www.ce.uniroma2.it/hybrid
This page was last modified: January 8, 2014.
For comments, you are welcome to send email to: Salvatore Filippone