Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

Yueyuan Sui; Payal Mohapatra; Do\u{g}a\c{c} Eldenk; Haodong Yang; Yiting Zhang; Haoyan Zhang; Qi Zhu; Stephen Xia

arXiv:2604.08971·cs.LG·April 13, 2026

Modality-Aware Zero-Shot Pruning and Sparse Attention for Efficient Multimodal Edge Inference

Yueyuan Sui, Payal Mohapatra, Do\u{g}a\c{c} Eldenk, Haodong Yang, Yiting Zhang, Haoyan Zhang, Qi Zhu, Stephen Xia

PDF

TL;DR

The paper introduces SentryFuse, a framework for modality-aware zero-shot pruning and sparse attention that improves efficiency and robustness of multimodal models on edge devices without fine-tuning.

Contribution

It proposes a novel approach combining modality-conditioned importance learning and sparse attention to enable zero-shot pruning and efficiency gains.

Findings

01

Achieves 12.7% average accuracy improvement over baseline pruning methods.

02

Reduces memory usage by 28.2% and latency by up to 1.63 times.

03

Yields 15% reduction in GFLOPs across multiple architectures.

Abstract

Edge devices increasingly run multimodal sensing pipelines that must remain accurate despite fluctuating power budgets and unpredictable sensor dropout. Existing pruning methods fail under these conditions: they generally require fine-tuning after compression, consuming over $10 \times$ the deployment energy, and they assign static importance scores that are blind to which sensors are present. We present the SentryFuse framework, which addresses both challenges jointly through two key components. First, SentryGate learns modality-conditioned importance scores during training via first-order saliency supervision and then prunes attention heads and feed-forward channels at deployment without fine-tuning. Second, SentryAttend replaces dense self-attention, a key bottleneck in contemporary multimodal architectures, with sparse grouped-query attention, yielding a net 15% reduction in GFLOPs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.