LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish, Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das

TL;DR
This paper introduces LLAVIDAL, a multimodal large language vision model specialized for Activities of Daily Living, trained on a new dataset ADL-X, and demonstrates its superior performance in fine-grained ADL understanding tasks.
Contribution
It presents a semi-automated framework for creating ADL datasets, a new multimodal dataset ADL-X, and a novel training strategy MMPro for improved ADL activity modeling.
Findings
LLAVIDAL achieves state-of-the-art results on ADL benchmarks.
The multimodal curriculum training improves model performance.
The ADL-X dataset enhances fine-grained activity understanding.
Abstract
Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Context-Aware Activity Recognition Systems
