|
Published Articles >> Table of Contents >> Abstract
2005 International Conference on Dependable Systems and Networks (DSN'05)
pp. 476-485
Filtering Failure Logs for a BlueGene/L Prototype
Yinglung Liang, Rutgers University
Yanyong Zhang, Rutgers University
Anand Sivasubramaniam, Penn State University
Ramendra K. Sahoo, IBM T. J. Watson Research Center
Jose Moreira, IBM T. J. Watson Research Center
Manish Gupta, IBM T. J. Watson Research Center
Full Article Text:
 
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DSN.2005.50
Send link to a friend
| Abstract |
|
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBMs BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.
|
Additional Information
|
Citation:
Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K. Sahoo, Jose Moreira, Manish Gupta,
"Filtering Failure Logs for a BlueGene/L Prototype,"
dsn,
pp. 476-485,
2005 International Conference on Dependable Systems and Networks (DSN'05),
2005
|
|