VLM in a flash: I/O-Efficient Sparsification of Vision-Language Model via Neuron Chunking
Kichang Yang, Seonjun Kim, Minjae Kim, Nairan Zhang, Chi Zhang, Youngki Lee

TL;DR
This paper introduces Neuron Chunking, a novel sparsification method for vision-language models that improves I/O efficiency during flash-based weight offloading by considering storage access patterns.
Contribution
Neuron Chunking is a new I/O-efficient sparsification strategy that groups neurons into chunks and accounts for access latency, enhancing flash performance.
Findings
Up to 4.65x I/O efficiency improvement on Jetson Orin Nano.
Up to 5.76x I/O efficiency improvement on Jetson AGX Orin.
Aligns sparsification with storage access patterns for better performance.
Abstract
Edge deployment of large Vision-Language Models (VLMs) increasingly relies on flash-based weight offloading, where activation sparsification is used to reduce I/O overhead. However, conventional sparsification remains model-centric, selecting neurons solely by activation magnitude and neglecting how access patterns influence flash performance. We present Neuron Chunking, an I/O-efficient sparsification strategy that operates on chunks (i.e., groups of contiguous neurons in memory) and couples neuron importance with storage access cost. The method models I/O latency through a lightweight abstraction of access contiguity and selects chunks with high utility, defined as neuron importance normalized by estimated latency. By aligning sparsification decisions with the underlying storage behavior, Neuron Chunking improves I/O efficiency by up to 4.65x and 5.76x on Jetson Orin Nano and Jetson…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Advanced Neural Network Applications
