Advancing Academic Chatbots: Evaluation of Non Traditional Outputs
Nicole Favero, Francesca Salute, Daniel Hardt

TL;DR
This paper evaluates large language models' ability to produce non-traditional academic outputs like slides and scripts, comparing retrieval strategies and assessing quality through human and AI judgments.
Contribution
It introduces a comprehensive evaluation of LLMs for non-traditional academic tasks and compares retrieval strategies, highlighting GPT 4o mini's superior performance.
Findings
GPT 4o mini with Advanced RAG yields highest accuracy.
Graph RAG increases hallucinations and offers limited improvements.
Human review is essential for assessing layout and style quality.
Abstract
Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Topic Modeling · Artificial Intelligence in Healthcare and Education
