A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
Yuya Fujisaki, Shiro Takagi, Hideki Asoh, Wataru Kumagai

TL;DR
This paper introduces a new dataset for evaluating LLM-based methods in extracting research questions from scientific papers, revealing current evaluation functions' limitations and supporting future improvements.
Contribution
The paper presents a novel dataset of research papers with extracted research questions and human evaluations, enabling systematic comparison of LLM evaluation functions for RQ extraction.
Findings
Existing LLM evaluation functions do not strongly correlate with human judgments.
The dataset facilitates benchmarking and development of better evaluation methods.
The dataset is publicly available for further research.
Abstract
The progress in text summarization techniques has been remarkable. However the task of accurately extracting and summarizing necessary information from highly specialized documents such as research papers has not been sufficiently investigated. We are focusing on the task of extracting research questions (RQ) from research papers and construct a new dataset consisting of machine learning papers, RQ extracted from these papers by GPT-4, and human evaluations of the extracted RQ from multiple perspectives. Using this dataset, we systematically compared recently proposed LLM-based evaluation functions for summarizations, and found that none of the functions showed sufficiently high correlations with human evaluations. We expect our dataset provides a foundation for further research on developing better evaluation functions tailored to the RQ extraction task, and contribute to enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Advanced Text Analysis Techniques
MethodsByte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Attention Is All You Need · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer
