TL;DR
DMI-Lib is a high-speed, flexible system for internal observability of large language models during inference, significantly reducing latency overhead while maintaining detailed internal state access.
Contribution
It introduces DMI-Lib, a novel asynchronous observability system that decouples internal state monitoring from inference, enabling efficient, detailed internal signals collection.
Findings
Incurs only 0.4%-6.8% overhead in offline inference
Achieves an average of 6% overhead in online serving
Reduces latency overhead by 2x-15x compared to baselines
Abstract
Today's inference-time workloads increasingly depend on timely access to a model's internal states. We present DMI-Lib, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI-Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI-Lib incurs only 0.4%--6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x-15x compared to existing baselines with similar observability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
