# MSMVD: Exploiting Multi-scale Image Features via Multi-scale BEV Features for Multi-view Pedestrian Detection

**Authors:** Taiga Yamane, Satoshi Suzuki, Ryo Masumura, Shota Orihashi, Tomohiro Tanaka, Mana Ihori, Naoki Makishima, Naotaka Kawata

arXiv: 2508.20447 · 2025-08-29

## TL;DR

MSMVD introduces a multi-scale BEV feature generation method for multi-view pedestrian detection, significantly improving detection accuracy for pedestrians with varying scales across views.

## Contribution

The paper proposes a novel multi-scale BEV feature generation approach that enhances multi-view pedestrian detection accuracy, especially for pedestrians with different scales.

## Key findings

- Outperforms previous methods by 4.5 MODA points on GMVD dataset.
- Effectively detects pedestrians with small or large scales across views.
- Utilizes multi-scale image features to improve BEV feature quality.

## Abstract

Multi-View Pedestrian Detection (MVPD) aims to detect pedestrians in the form of a bird's eye view (BEV) from multi-view images. In MVPD, end-to-end trainable deep learning methods have progressed greatly. However, they often struggle to detect pedestrians with consistently small or large scales in views or with vastly different scales between views. This is because they do not exploit multi-scale image features to generate the BEV feature and detect pedestrians. To overcome this problem, we propose a novel MVPD method, called Multi-Scale Multi-View Detection (MSMVD). MSMVD generates multi-scale BEV features by projecting multi-scale image features extracted from individual views into the BEV space, scale-by-scale. Each of these BEV features inherits the properties of its corresponding scale image features from multiple views. Therefore, these BEV features help the precise detection of pedestrians with consistently small or large scales in views. Then, MSMVD combines information at different scales of multiple views by processing the multi-scale BEV features using a feature pyramid network. This improves the detection of pedestrians with vastly different scales between views. Extensive experiments demonstrate that exploiting multi-scale image features via multi-scale BEV features greatly improves the detection performance, and MSMVD outperforms the previous highest MODA by $4.5$ points on the GMVD dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20447/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20447/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/2508.20447/full.md

---
Source: https://tomesphere.com/paper/2508.20447