Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

Rajachandrasekar, Raghunath; Besseron, Xavier; Panda, Dhabaleswar K.

doi:10.1109/IPDPSW.2012.139

Request a copy

Paper published in a book (Scientific congresses, symposiums and conference proceedings)

Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

Rajachandrasekar, Raghunath; Besseron, Xavier; Panda, Dhabaleswar K.

2012 • In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Peer reviewed

Permalink
https://hdl.handle.net/10993/1422

DOI
10.1109/IPDPSW.2012.139

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

paper.final.pdf

Author postprint (245.91 kB)

Request a copy

All documents in ORBilu are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Fault detection; Coordinated fault propogation; IPMI; FTB; Clusters

Abstract :

[en] Fault-detection and prediction in HPC clusters and Cloud-computing systems are increasingly challenging issues. Several system middleware such as job schedulers and MPI implementations provide support for both reactive and proactive mechanisms to tolerate faults. These techniques rely on external components such as system logs and infrastructure monitors to provide information about hardware/software failure either through detection, or as a prediction. However, these middleware work in isolation, without disseminating the knowledge of faults encountered. In this context, we propose a light-weight multi-threaded service, namely FTB-IPMI, which provides distributed fault-monitoring using the Intelligent Platform Management Interface (IPMI) and coordinated propagation of fault information using the Fault-Tolerance Backplane (FTB). In essence, it serves as a middleman between system hardware and the software stack by translating raw hardware events to structured software events and delivering it to any interested component using a publish-subscribe framework. Fault-predictors and other decision-making engines that rely on distributed failure information can benefit from FTB-IPMI to facilitate proactive fault-tolerance mechanisms such as preemptive job migration. We have developed a fault-prediction engine within MVAPICH2, an RDMA-based MPI implementation, to demonstrate this capability. Failure predictions made by this engine are used to trigger migration of processes from failing nodes to healthy spare nodes, thereby providing resilience to the MPI application. Experimental evaluation clearly indicates that a single instance of FTB-IPMI can scale to several hundreds of nodes with a remarkably low resource-utilization footprint. A deployment of FTB-IPMI that services a cluster with 128 compute-nodes, sweeps the entire cluster and collects IPMI sensor information on CPU temperature, system voltages and fan speeds in about 0.75 seconds. The average CPU utilization of this service running on a single node is 0.35%.

Disciplines :

Computer science

Identifiers :

UNILU:UL-CONFERENCE-2012-421

Author, co-author :

Rajachandrasekar, Raghunath; Network-Based Computing Laboratory, The Ohio State University

Besseron, Xavier ; University of Luxembourg > Faculty of Science, Technology and Communication (FSTC) > Computer Science and Communications Research Unit (CSC)

Panda, Dhabaleswar K.; Network-Based Computing Laboratory, The Ohio State University

Language :

English

Title :

Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI

Publication date :

2012

Event name :

International Workshop on System Management Techniques, Processes, and Services (SMTPS'12), held in conjunction with IPDPS'12

Event place :

Shanghai, China

Event date :

May 21-May 25, 2012

Main work title :

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Publisher :

IEEE Computer Society

ISBN/EAN :

978-1-4673-0974-5

Pages :

1136-1443

Peer reviewed :

Peer reviewed

Focus Area :

Computational Sciences

Available on ORBilu :

since 13 May 2013

Statistics

Number of views

82 (8 by Unilu)

Number of downloads

2 (2 by Unilu)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations