TY - GEN
T1 - Toward efficient architecture-independent algorithms for dynamic programs
AU - Javanmard, Mohammad Mahdi
AU - Ganapathi, Pramod
AU - Das, Rathish
AU - Ahmad, Zafar
AU - Tschudi, Stephen
AU - Chowdhury, Rezaul
N1 - Publisher Copyright: © Springer Nature Switzerland AG 2019.
PY - 2019
Y1 - 2019
N2 - We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure. Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines. Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r (≥ 2) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds. We also report empirical results for our GPU and distribute memory algorithms.
AB - We argue that the recursive divide-and-conquer paradigm is highly suited for designing algorithms to run efficiently under both shared-memory (multi- and manycores) and distributed-memory settings. The depth-first recursive decomposition of tasks and data is known to allow computations with potentially high temporal locality, and automatic adaptivity when resource availability (e.g., available space in shared caches) changes during runtime. Higher data locality leads to better intra-node I/O and cache performance and lower inter-node communication complexity, which in turn can reduce running times and energy consumption. Indeed, we show that a class of grid-based parallel recursive divide-and-conquer algorithms (for dynamic programs) can be run with provably optimal or near-optimal performance bounds on fat cores (cache complexity), thin cores (data movements), and purely distributed-memory machines (communication complexity) without changing the algorithm’s basic structure. Two-way recursive divide-and-conquer algorithms are known for solving dynamic programming (DP) problems on shared-memory multicore machines. In this paper, we show how to extend them to run efficiently also on manycore GPUs and distributed-memory machines. Our GPU algorithms work efficiently even when the data is too large to fit into the host RAM. These are external-memory algorithms based on recursive r-way divide and conquer, where r (≥ 2) varies based on the current depth of the recursion. Our distributed-memory algorithms are also based on multi-way recursive divide and conquer that extends naturally inside each shared-memory multicore/manycore compute node. We show that these algorithms are work-optimal and have low latency and bandwidth bounds. We also report empirical results for our GPU and distribute memory algorithms.
KW - Communication efficiency
KW - Distributed memory
KW - Dynamic programming
KW - Exascale
KW - GPU
KW - I/O efficiency
KW - Recursive divide & conquer
KW - Shared memory
UR - https://www.scopus.com/pages/publications/85067508267
U2 - 10.1007/978-3-030-20656-7_8
DO - 10.1007/978-3-030-20656-7_8
M3 - Conference contribution
SN - 9783030206550
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 143
EP - 164
BT - High Performance Computing - 34th International Conference, ISC High Performance 2019, Proceedings
A2 - Juckeland, Guido
A2 - Trinitis, Carsten
A2 - Sadayappan, Ponnuswamy
A2 - Weiland, Michèle
PB - Springer Verlag
T2 - 34th International Conference on High Performance Computing, ISC High Performance 2019
Y2 - 16 June 2019 through 20 June 2019
ER -