Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali

Medha Sharma; Supriya Khadka; Udit Chandra Aryal; Bishnu Hari Bhatta; Bijayan Bhattarai; Santosh Dahal; Kamal Gautam; Pushpa Joshi; Saugat Kafle; Shristi Khadka; Shushila Khadka; Binod Lamichhane; Shilpa Lamichhane; Anusha Parajuli; Sabina Pokharel; Suvekshya Sitaula; Neha Verma; Bishesh Khanal

arXiv:2603.22291·cs.CL·March 25, 2026

Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali

Medha Sharma, Supriya Khadka, Udit Chandra Aryal, Bishnu Hari Bhatta, Bijayan Bhattarai, Santosh Dahal, Kamal Gautam, Pushpa Joshi, Saugat Kafle, Shristi Khadka, Shushila Khadka, Binod Lamichhane, Shilpa Lamichhane, Anusha Parajuli, Sabina Pokharel, Suvekshya Sitaula, Neha Verma

PDF

Open Access

TL;DR

This study introduces the LEAF framework to evaluate large language models on sexual and reproductive health queries in Nepali, revealing significant limitations in accuracy, usability, and safety, and emphasizing the need for improvements in handling sensitive topics.

Contribution

The paper presents the LEAF evaluation framework, a comprehensive tool for assessing LLM responses across accuracy, usability, and safety in low-resource and sensitive domains like SRH in Nepali.

Findings

01

Only 35.1% of responses were proper, meeting accuracy and safety standards.

02

Performance varies between ChatGPT versions in usability and safety.

03

Current LLMs have significant limitations in handling sensitive SRH queries.

Abstract

As Large Language Models (LLMs) become integrated into daily life, they are increasingly used for personal queries, including Sexual and Reproductive Health (SRH), allowing users to chat anonymously without fear of judgment. However, current evaluation methods primarily focus on accuracy, often for objective queries in high-resource languages, and lack criteria to assess usability and safety, especially for low-resource languages and culturally sensitive domains like SRH. This paper introduces LLM Evaluation Framework (LEAF), that conducts assessments across multiple criteria: accuracy, language, usability gaps (including relevance, adequacy, and cultural appropriateness), and safety gaps (safety, sensitivity, and confidentiality). Using the LEAF framework, we assessed 14K SRH queries in Nepali from over 9K users. Responses were manually annotated by SRH experts according to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsICT in Developing Communities · Mobile Health and mHealth Applications · Topic Modeling