StrategiesRecent progress and challenges in exploiting graphics processors in computational fluid dynamics provides some general strategies for using multiple levels of parallelism accross GPUs, CPU cores and cluster nodes based on that review of the literature:
- Global memory should be arranged to coalesce read/write requests, which can improve performance by an order of magnitude (theoretically, up to 32 times: the number of threads in a warp)
- Shared memory should be used for global reduction operations (e.g., summing up residual values, finding maximum values) such that only one value per block needs to be returned
- Use asynchronous memory transfer, as shown by Phillips et al. and DeLeon et al. when parallelizing solvers across multiple GPUs, to limit the idle time of either the CPU or GPU.
- Minimize slow CPU-GPU communication during a simulation by performing all possible calculations on the GPU.
There are two example implementations on github that were used to illustrate the scaling with grid size for some simple 2D problems:
- Laplace solver running on GPU using CUDA, with CPU version for comparison
- Solves lid-driven cavity problem using finite difference method on GPU, with equivalent CPU version for comparison.
One of the interesting references from the paper mentioned above is Hybridizing S3D into an Exascale Application using OpenACC. They take an approach to use a combination of OpenACC directives for GPU processing, OpenMP directives for multi-core processing, and MPI for multi-node processing. Their three-level hybrid approach performs better than any single approach alone, and by making some clever algorithm tweaks they are able to run the same code on a node without a GPU without too much performance hit.