Multi-stage Large Language Model Pipelines Can Outperform GPT-4o in Relevance Assessment
Julian A. Schnabel, Johanne R. Trippas, Falk Scholer, Danula, Hettiachchi

TL;DR
This paper introduces a multi-stage LLM pipeline for relevance assessment that outperforms GPT-4o in accuracy and cost-efficiency, providing a scalable alternative to human annotation.
Contribution
The authors develop a modular, multi-stage LLM pipeline that improves relevance assessment accuracy and reduces costs compared to existing GPT-4o methods.
Findings
Achieved 18.4% higher Krippendorff's α accuracy over GPT-4o mini.
Maintained low cost of about 0.2 USD per million tokens.
Enhanced GPT-4o's accuracy by 9.7% using the pipeline approach.
Abstract
The effectiveness of search systems is evaluated using relevance labels that indicate the usefulness of documents for specific queries and users. While obtaining these relevance labels from real users is ideal, scaling such data collection is challenging. Consequently, third-party annotators are employed, but their inconsistent accuracy demands costly auditing, training, and monitoring. We propose an LLM-based modular classification pipeline that divides the relevance assessment task into multiple stages, each utilising different prompts and models of varying sizes and capabilities. Applied to TREC Deep Learning (TREC-DL), one of our approaches showed an 18.4% Krippendorff's accuracy increase over OpenAI's GPT-4o mini while maintaining a cost of about 0.2 USD per million input tokens, offering a more efficient and scalable solution for relevance assessment. This approach beats…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
