Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
Sungjune Park, Hyunjun Kim, Yong Man Ro

TL;DR
This paper introduces a method that leverages large language models to extract and incorporate appearance knowledge into pedestrian detection systems, significantly improving detection accuracy across diverse scenes.
Contribution
It presents a novel approach to integrate language-derived appearance elements with visual cues, enhancing pedestrian detection performance and achieving state-of-the-art results.
Findings
Noticeable performance gains on benchmarks
Effective integration of language and visual cues
Achieved state-of-the-art detection results
Abstract
Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is considered one of the crucial tasks directly related to our safety (e.g., intelligent driving systems), it is challenging because of varying appearances and poses in diverse scenes. Therefore, we propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection. To this end, we establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances. By feeding them through an LLM, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Infrastructure Maintenance and Monitoring
