UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

Yecheng Zhang; Rong Zhao; Zhizhou Sha; Yong Li; Lei Wang; Ce Hou; Wen Ji; Hao Huang; Yunshan Wan; Jian Yu; Junhao Xia; Yuru Zhang; Chunlei Shi

arXiv:2602.19442·cs.CV·March 12, 2026

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi

PDF

Open Access

TL;DR

UrbanAlign is a post-hoc calibration method that aligns frozen vision-language models with human preferences in urban scene perception tasks without retraining, using a three-stage interpretability-driven pipeline.

Contribution

It introduces a novel three-stage post-hoc calibration pipeline that aligns frozen VLMs with human preferences without weight modification, leveraging interpretability and concept extraction.

Findings

01

Achieves 72.2% accuracy on Place Pulse 2.0 perception categories.

02

Outperforms all baselines by +11.0 percentage points.

03

Zero-shot VLM performance improved by +15.5 percentage points.

Abstract

Vision-language models (VLMs) can describe urban scenes in rich detail, yet consistently fail to produce reliable human preference labels in domain-specific tasks such as safety assessment and aesthetic evaluation. The standard fix, fine-tuning or RLHF, requires large-scale annotations and model retraining. We ask a different question: can a frozen VLM be aligned with human preferences without modifying any weights? Our key insight is that VLMs are strong concept extractors but poor decision calibrators. We propose a three-stage post-hoc pipeline that exploits this asymmetry: (i) interpretable evaluation dimensions are automatically mined from consensus exemplars; (ii) an Observer-Debater-Judge chain extracts robust concept scores from the frozen VLM; and (iii) locally-weighted ridge regression on a hybrid manifold calibrates these scores to human ratings. Applied as UrbanAlign on Place…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications