Skip to main navigation Skip to search Skip to main content

A quantitative analysis of node sharing on HPC clusters using XDMoD application kernels

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations

Abstract

In this investigation, we study how application performance is affected when jobs are permitted to share compute nodes. A series of application kernels consisting of a diverse set of benchmark calculations were run in both exclusive and node-sharing modes on the Center for Computational Research's high-performance computing (HPC) cluster. Very little increase in runtime was observed due to job contention among application kernel jobs run on shared nodes. The small differences in runtime were quantitatively modeled in order to characterize the resource contention and attempt to determine the circumstances under which it would or would not be important. A machine learning regression model applied to the runtime data successfully fitted the small differences between the exclusive and shared node runtime data; it also provided insight into the contention for node resources that occurs when jobs are allowed to share nodes. Analysis of a representative job mix shows that runtime of shared jobs is affected primarily by the memory subsystem, in particular by the reduction in the effective cache size due to sharing; this leads to higher utilization of DRAM. Insights such as these are crucial when formulating policies proposing node sharing as a mechanism for improving HPC utilization.

Original languageEnglish
Title of host publicationProceedings of XSEDE 2016
Subtitle of host publicationDiversity, Big Data, and Science at Scale
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450347556
DOIs
StatePublished - Jul 17 2016
EventConference on Diversity, Big Data, and Science at Scale, XSEDE 2016 - Miami, United States
Duration: Jul 17 2016Jul 21 2016

Publication series

NameACM International Conference Proceeding Series
Volume17-21-July-2016

Conference

ConferenceConference on Diversity, Big Data, and Science at Scale, XSEDE 2016
Country/TerritoryUnited States
CityMiami
Period07/17/1607/21/16

Keywords

  • HPC
  • Node sharing
  • Performance co-pilot
  • SUPReMM
  • TACC-Stats
  • XDMoD

Fingerprint

Dive into the research topics of 'A quantitative analysis of node sharing on HPC clusters using XDMoD application kernels'. Together they form a unique fingerprint.

Cite this