Loading…
Wednesday, February 27 • 1:30pm - 1:55pm
End-to-end I/O Monitoring on a Leading Supercomputer

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

This paper presents an effort to overcome the complexities of production-use I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked world No.3. It simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes and metadata servers. With mechanisms such as aggressive online+offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. Higher-level per-application I/O performance behaviors are reconstructed from system-level monitoring data to reveal correlations between system performance bottlenecks, utilization symptoms, and application behaviors. Beacon further provides query, statistics, and visualization utilities to users and administrators, allowing comprehensive and in-depth analysis without requiring any code/script modification.

With its deployment on TaihuLight for around 18 months, we demonstrate Beacon's effectiveness with a collection of real-world use cases for I/O performance issue identification and diagnosis. It has successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. In addition, we demonstrate Beacon's generality by its recent extension to monitor interconnection networks, another contention point on supercomputers. Finally, both codes and data collected are to be released.

Speakers
BY

Bin Yang

Shandong University, National Supercomputing Center in Wuxi
XJ

Xu Ji

Tsinghua University, National Supercomputing Center in Wuxi
XM

Xiaosong Ma

Qatar Computing Research institute, HBKU
XW

Xiyang Wang

National Supercomputing Center in Wuxi
TZ

Tianyu Zhang

Shandong University, National Supercomputing Center in Wuxi
XZ

Xiupeng Zhu

Shandong University, National Supercomputing Center in Wuxi
NE

Nosayba El-Sayed

Emory University
HL

Haidong Lan

Shandong University
YY

Yibo Yang

Shandong Unversity
JZ

Jidong Zhai

Tsinghua University
WL

Weiguo Liu

Shandong University, National Supercomputing Center in Wuxi
WX

Wei Xue

Tsinghua University, National Supercomputing Center in Wuxi


Wednesday February 27, 2019 1:30pm - 1:55pm EST
Constitution Ballroom