PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding
Vinh Nguyen

TL;DR
PerspectiveNet is a lightweight, multi-view perception model that combines visual encoders, a connector module, and large language models to generate detailed descriptions of dynamic scenes from multiple camera viewpoints.
Contribution
The paper introduces PerspectiveNet, a novel architecture that effectively integrates visual features and LLMs for multi-view scene understanding, with a focus on efficiency and detailed description generation.
Findings
Achieves accurate scene descriptions from multiple camera views.
Efficient training and inference with a lightweight architecture.
Effective in the Traffic Safety Description and Analysis task.
Abstract
Generating detailed descriptions from multiple cameras and viewpoints is challenging due to the complex and inconsistent nature of visual data. In this paper, we introduce PerspectiveNet, a lightweight yet efficient model for generating long descriptions across multiple camera views. Our approach utilizes a vision encoder, a compact connector module to convert visual features into a fixed-size tensor, and large language models (LLMs) to harness the strong natural language generation capabilities of LLMs. The connector module is designed with three main goals: mapping visual features onto LLM embeddings, emphasizing key information needed for description generation, and producing a fixed-size feature matrix. Additionally, we augment our solution with a secondary task, the correct frame sequence detection, enabling the model to search for the correct sequence of frames to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
