TY - GEN
T1 - Comparing and Contrasting User and Runtime Directed Data Placement Strategies for Owner-Compute, Multi-accelerator Distributed Task Based Scheduling
AU - Bouteiller, Aurelien
AU - Cao, Qinglei
AU - Schuchart, Joseph
AU - Herault, Thomas
N1 - Publisher Copyright: © The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Given GPU accelerators’ high arithmetic capacity, reducing data motion and optimizing locality are critical to achieving high performance. The task-based programming paradigm, as employed in the PaRSEC micro-task runtime system, enables the decoupling of data distribution and computation mapping to resources from the algorithm’s base expression. In this paper, we leverage this capability to explore the performance impact of several data placement strategies–some automatic and runtime-directed, and some user-directed–for the owner-compute scheduling model in the context of split-memory accelerators. We implement three different strategies for data and task mapping: a randomized first-touch policy that assigns data randomly to an accelerator, a load-balancing strategy that assigns data to the accelerator with the lowest load, and we compare it to a user-directed strategy that minimizes cross-accelerator traffic by placing tasks according to a cross-memory bandwidth minimizing strategy. We carry the evaluation on a variety of multi-GPU accelerated systems , including the Frontier system, and demonstrate that runtime-directed automatic data placement can improve locality compared to naive strategies, but also highlight that the capability of easily having modifiable user-directed data placement is of crucial importance to achieve peak performance.
AB - Given GPU accelerators’ high arithmetic capacity, reducing data motion and optimizing locality are critical to achieving high performance. The task-based programming paradigm, as employed in the PaRSEC micro-task runtime system, enables the decoupling of data distribution and computation mapping to resources from the algorithm’s base expression. In this paper, we leverage this capability to explore the performance impact of several data placement strategies–some automatic and runtime-directed, and some user-directed–for the owner-compute scheduling model in the context of split-memory accelerators. We implement three different strategies for data and task mapping: a randomized first-touch policy that assigns data randomly to an accelerator, a load-balancing strategy that assigns data to the accelerator with the lowest load, and we compare it to a user-directed strategy that minimizes cross-accelerator traffic by placing tasks according to a cross-memory bandwidth minimizing strategy. We carry the evaluation on a variety of multi-GPU accelerated systems , including the Frontier system, and demonstrate that runtime-directed automatic data placement can improve locality compared to naive strategies, but also highlight that the capability of easily having modifiable user-directed data placement is of crucial importance to achieve peak performance.
KW - Cholesky factorization
KW - Matrix computations
KW - Task-based runtime
KW - accelerator
UR - https://www.scopus.com/pages/publications/105019055223
U2 - 10.1007/978-3-031-97196-9_12
DO - 10.1007/978-3-031-97196-9_12
M3 - Conference contribution
SN - 9783031971952
T3 - Lecture Notes in Computer Science
SP - 140
EP - 153
BT - Asynchronous Many-Task Systems and Applications - 3rd International Workshop, WAMTA 2025, Proceedings
A2 - Diehl, Patrick
A2 - Cao, Qinglei
A2 - Herault, Thomas
A2 - Bosilca, George
PB - Springer Science and Business Media Deutschland GmbH
T2 - 3rd International Workshop on Asynchronous Many-Task Systems and Applications, WAMTA 2025
Y2 - 19 February 2025 through 21 February 2025
ER -