Developing and Evaluating Advanced Methods for Resilience at Scale


For large-scale high-performance computing (HPC) systems with 10s/100s of thousands of cores, faults have become the norm rather than the exception. The objective of the proposed work is to alleviate scalability limitations of current fault tolerant practices on petascale installations, which could pave the path for forthcoming exascale systems. To this end, we propose to develop and evaluate advanced mechanisms to make large-scale HPC jobs resilient to failures. We will combine and then evaluate in-place rollback with redundant computing. We will develop techniques to detect and to recover from silent data corruption.

Publications: