LLload: Simplifying Real-Time Job Monitoring for HPC Users
Chansup Byun, Julia Mullen, Albert Reuther, William Arcand, William, Bergeron, David Bestor, Daniel Burrill, Vijay Gadepally, Michael Houle,, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo, Morales, Andrew Prout, Antonio Rosa, Charles Yee

TL;DR
LLload is a user-friendly tool developed at MIT Lincoln Laboratory that simplifies real-time resource monitoring for HPC users, aiding performance tuning and resource management.
Contribution
The paper introduces LLload, a new tool built from standard HPC utilities that makes resource monitoring more accessible for researchers, especially newcomers.
Findings
LLload provides real-time snapshots of resource usage per user.
It helps researchers develop performance monitoring skills.
The tool guides resource requests effectively.
Abstract
One of the more complex tasks for researchers using HPC systems is performance monitoring and tuning of their applications. Developing a practice of continuous performance improvement, both for speed-up and efficient use of resources is essential to the long term success of both the HPC practitioner and the research project. Profiling tools provide a nice view of the performance of an application but often have a steep learning curve and rarely provide an easy to interpret view of resource utilization. Lower level tools such as top and htop provide a view of resource utilization for those familiar and comfortable with Linux but a barrier for newer HPC practitioners. To expand the existing profiling and job monitoring options, the MIT Lincoln Laboratory Supercomputing Center created LLoad, a tool that captures a snapshot of the resources being used by a job on a per user basis. LLload is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management
