PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding

Vinh Nguyen

arXiv:2410.16824·cs.CV·July 22, 2025

PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding

Vinh Nguyen

PDF

Open Access

TL;DR

PerspectiveNet is a lightweight, multi-view perception model that combines visual encoders, a connector module, and large language models to generate detailed descriptions of dynamic scenes from multiple camera viewpoints.

Contribution

The paper introduces PerspectiveNet, a novel architecture that effectively integrates visual features and LLMs for multi-view scene understanding, with a focus on efficiency and detailed description generation.

Findings

01

Achieves accurate scene descriptions from multiple camera views.

02

Efficient training and inference with a lightweight architecture.

03

Effective in the Traffic Safety Description and Analysis task.

Abstract

Generating detailed descriptions from multiple cameras and viewpoints is challenging due to the complex and inconsistent nature of visual data. In this paper, we introduce PerspectiveNet, a lightweight yet efficient model for generating long descriptions across multiple camera views. Our approach utilizes a vision encoder, a compact connector module to convert visual features into a fixed-size tensor, and large language models (LLMs) to harness the strong natural language generation capabilities of LLMs. The connector module is designed with three main goals: mapping visual features onto LLM embeddings, emphasizing key information needed for description generation, and producing a fixed-size feature matrix. Additionally, we augment our solution with a secondary task, the correct frame sequence detection, enabling the model to search for the correct sequence of frames to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques