TL;DR
This paper proposes a multi-dimensional evaluation framework for city-trip recommendations using LLMs as judges, incorporating calibration to address biases and interpretability issues.
Contribution
It introduces a three-phase calibration framework for LLM-based evaluation of travel recommendations across multiple dimensions, improving transparency and bias-awareness.
Findings
Model-specific biases affect evaluation consistency.
Calibration reduces bias and clarifies reasoning per dimension.
Divergent interpretations of sustainability highlight evaluation challenges.
Abstract
Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
