Skip to main navigation Skip to search Skip to main content

DynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

High performance computing (HPC) facilities consist of a large number of interconnected computing units (or nodes) that execute highly complex scientific simulations to support scientific research. Monitoring such facilities, in real-time, is essential to ensure that the system operates at peak efficiency. Such systems are typically monitored using a variety of measurements and log data which capture the state of the various components within the system at regular intervals of time. As modern HPC systems grow in capacity and complexity, the data produced by current resource monitoring tools at a scale that is no longer feasible to be visually monitored by analysts. We propose a method that transforms the multidimensional output of resource monitoring tools to a low dimensional representation that facilitates the understanding of the behavior of a High Performance Computing (HPC) system. The proposed method automatically extracts the low-dimensional signal in the data which can be used to track the system efficiency and identify performance anomalies. The method models the resource usage data as a three dimensional tensor (capturing resource usage of all compute nodes for different resources over time). A dynamic matrix factorization algorithm, called dynamicMF, is proposed to extract a low-dimensional temporal signal for each node, which is subsequently fed into an anomaly detector. Results on resource usage data show anomalies identified which are correlated with anomalous events identified over the syslog messages.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsNaoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, Jeffrey Saltz
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1302-1307
Number of pages6
ISBN (Electronic)9781538650356
DOIs
StatePublished - Jul 2 2018
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: Dec 10 2018Dec 13 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
Country/TerritoryUnited States
CitySeattle
Period12/10/1812/13/18

Keywords

  • HPC
  • anomaly detection
  • matrix factorization
  • performance profiling
  • system monitoring

Fingerprint

Dive into the research topics of 'DynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems'. Together they form a unique fingerprint.

Cite this