ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish, Gupta, Pawan Goyal

TL;DR
ReFeR is a tuning-free, hierarchical evaluation framework for generative models that improves assessment accuracy and reasoning capabilities while offering variants optimized for speed and cost-efficiency.
Contribution
The paper introduces ReFeR, a novel hierarchy-based, tuning-free evaluation framework for generative models that surpasses previous benchmarks and extends to reasoning tasks.
Findings
ReFeR outperforms previous evaluation benchmarks.
ReFeR demonstrates superior reasoning abilities across tasks.
ReFeR-Lite is 7.7 times more efficient with comparable accuracy.
Abstract
Assessing the quality of outputs generated by generative models, such as large language models and vision language models, presents notable challenges. Traditional methods for evaluation typically rely on either human assessments, which are resource-intensive, or automatic metrics that often show a low correlation with human judgment. Another common approach is to use deep learning systems, which not only consume a substantial amount of compute and time but also require extensive training data. In this study, we introduce a tuning-free framework called ReFeR, designed to evaluate generative outputs, including both text and images, by leveraging a 2-level hierarchy of LLMs and VLMs themselves. We rigorously evaluate our framework, ReFeR, across four diverse evaluation tasks. The framework not only improves the accuracy of these evaluations, surpassing previous benchmarks but also…
Peer Reviews
Decision·Submitted to ICLR 2025
- The framework shows improvement in reasoning of multi-agent systems under the framework proposed. - The evaluations/experiments presented are sound and well ablated. - The error analysis section is useful and provides insights on where the framework could be improved further.
- The paper highlights on lines 36-47 that there are not many works on using multiple-LLMs for evaluation, but there do exist some methods using multiple LLMs, for example: https://openreview.net/forum?id=FQepisCUWu and https://arxiv.org/html/2401.16788v1. The authors highlight very early and strongly the novelty of the work of using multiple-LLMs for evaluation but existence of the above works do bring in questions. The paper would benefit much more from better literature review, and inclusion
1. The paper is well written and the method is easy to reproduce; 2. Extending the evaluation tasks to reasoning tasks is a good generalization; 3. Some key issues, such as how to select models, how many models to use, are addressed though detailed ablation study.
1. The distinctions between this method and similar methods, such as multi-agent debate or multi-agent peer review, are not very clear. For example, debating or summarizing are merely different forms of prompts. In fact, for the multi-agent peer review method, if a reviewer can receive information from all other reviewers before refining their own review, they would essentially be playing a role very similar to that of an area chair. 2. The two-level hierarchical structure that summarizes throug
The originality is in adapting academic peer review processes to AI evaluation, offering a different approach to assessing LLM outputs. Quality is evident through comprehensive testing across reasoning tasks and multimodal outputs, with the framework showing consistent performance across different types of evaluation challenges. The paper is nicely structured, presenting both theoretical foundations and practical implementations through its two variants (ReFeR-Turbo and ReFeR-Lite). The framewor
The paper's main weakness is its unclear primary contribution, as it's difficult to distinguish how the ReFeR framework meaningfully differs from other multi-LLM evaluation systems beyond borrowing concepts from academic peer review (MoA). Additionally, the paper's organization is scattered across too many topics, with important elements like the instruction tuning dataset (shown in Figure 1) being buried in the appendix rather than properly discussed in the main text.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Label Smoothing · Linear Layer · Weight Decay · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention
