Developing and Evaluating Advanced Methods for Resilience at Scale

funded by: SNL
funding level: $49,999
duration: 02/01/2010 - 06/30/2011

For large-scale high-performance computing (HPC) systems with 10s/100s of thousands of cores, faults have become the norm rather than the exception. The objective of the proposed work is to alleviate scalability limitations of current fault tolerant practices on petascale installations, which could pave the path for forthcoming exascale systems. To this end, we propose to develop and evaluate advanced mechanisms to make large-scale HPC jobs resilient to failures. We will combine and then evaluate in-place rollback with redundant computing. We will develop techniques to detect and to recover from silent data corruption.

Publications:

"A Tunable, Software-based DRAM Error Detection and Correction Library for HPC" by D. Fiala, K. Ferreira, F. Mueller, C. Engelmann, Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, Sep 2011.