Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Ashmi Banerjee; Adithi Satish; Wolfgang W\"orndl; Yashar Deldjoo

arXiv:2604.24158·cs.AI·April 28, 2026

Multi-Dimensional Evaluation of Sustainable City Trips with LLM-as-a-Judge and Human-in-the-Loop

Ashmi Banerjee, Adithi Satish, Wolfgang W\"orndl, Yashar Deldjoo

PDF

1 Repo

TL;DR

This paper proposes a multi-dimensional evaluation framework for city-trip recommendations using LLMs as judges, incorporating calibration to address biases and interpretability issues.

Contribution

It introduces a three-phase calibration framework for LLM-based evaluation of travel recommendations across multiple dimensions, improving transparency and bias-awareness.

Findings

01

Model-specific biases affect evaluation consistency.

02

Calibration reduces bias and clarifies reasoning per dimension.

03

Divergent interpretations of sustainability highlight evaluation challenges.

Abstract

Evaluating nuanced conversational travel recommendations is challenging when human annotations are costly and standard metrics ignore stakeholder-centric goals. We study LLMs-as-Judges for sustainable city-trip lists across four dimensions -- relevance, diversity, sustainability, and popularity balance, and propose a three-phase calibration framework: (1) baseline judging with multiple LLMs, (2) expert evaluation to identify systematic misalignment, and (3) dimension-specific calibration via rules and few-shot examples. Across two recommendation settings, we observe model-specific biases and high dimension-level variance, even when judges agree on overall rankings. Calibration clarifies reasoning per dimension but exposes divergent interpretations of sustainability, highlighting the need for transparent, bias-aware LLM evaluation. Prompts and code are released for reproducibility:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ashmibanerjee/trs-llm-calibration
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.