Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research
Anthea Dathe, Kiran Hoffmann, Aline Mangold

TL;DR
This study evaluates AI tools for academic research, highlighting their usefulness for exploration but limitations in precision, transparency, and reproducibility, emphasizing the need for human-centered assessment.
Contribution
It introduces a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI research tools, revealing their strengths and weaknesses.
Findings
Q&A tools provide useful overviews but lack reliability for precise info.
Explainable AI accuracy was low, affecting trust and verification.
Literature review tools support exploration but lack reproducibility and transparency.
Abstract
Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q&A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q&A tools can offer valuable overviews and generally accurate summaries;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
