Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems

Md Abdullah Al Kafi; Raka Moni; Sumit Kumar Banshal

arXiv:2601.01341·cs.CL·January 6, 2026

Reasoning Over Recall: Evaluating the Efficacy of Generalist Architectures vs. Specialized Fine-Tunes in RAG-Based Mental Health Dialogue Systems

Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal

PDF

Open Access

TL;DR

This study compares generalist and domain-specific models in RAG-based mental health dialogue systems, finding that generalist models with strong reasoning outperform fine-tuned models in empathy and understanding.

Contribution

It provides a direct comparison of generalist versus fine-tuned models in RAG-based mental health systems, highlighting the importance of reasoning over domain-specific training.

Findings

01

Generalist models outperform domain-specific ones in empathy.

02

All models perform well in safety, but generalists show better contextual understanding.

03

Strong reasoning in models is more crucial than domain-specific vocabulary.

Abstract

The deployment of Large Language Models (LLMs) in mental health counseling faces the dual challenges of hallucinations and lack of empathy. While the former may be mitigated by RAG (retrieval-augmented generation) by anchoring answers in trusted clinical sources, there remains an open question as to whether the most effective model under this paradigm would be one that is fine-tuned on mental health data, or a more general and powerful model that succeeds purely on the basis of reasoning. In this paper, we perform a direct comparison by running four open-source models through the same RAG pipeline using ChromaDB: two generalist reasoners (Qwen2.5-3B and Phi-3-Mini) and two domain-specific fine-tunes (MentalHealthBot-7B and TherapyBot-7B). We use an LLM-as-a-Judge framework to automate evaluation over 50 turns. We find a clear trend: the generalist models outperform the domain-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mental Health via Writing · Machine Learning in Healthcare