TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Vinh-Thuan Ly; Hoang M. Truong; Xuan-Huong Nguyen

arXiv:2508.17595·cs.CV·August 26, 2025

TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Vinh-Thuan Ly, Hoang M. Truong, Xuan-Huong Nguyen

PDF

TL;DR

TinyGiantVLM is a lightweight, modular vision-language model designed for spatial reasoning in industrial environments, effectively combining global and regional features from RGB and depth data to improve understanding of complex scenes.

Contribution

The paper introduces TinyGiantVLM, a novel two-stage framework with a Mixture-of-Experts fusion module, optimized for resource-constrained spatial reasoning in warehouse-scale settings.

Findings

01

Achieved 5th place on AI City Challenge 2025 leaderboard with 66.8861 score.

02

Demonstrated improved spatial reasoning with an 80M-parameter MoE-enhanced model.

03

Effectively encodes multi-modal features for industrial spatial understanding.

Abstract

Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings. In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.