# A Framework for Evaluating AI-Powered Virtual Assistants to Support Older Adults’ Information-Seeking Needs

**Authors:** Walter Boot, Emily Langston, Varitnan Hattakitjamroen, Mario Hernandez, Hye Soo Lee, Hannah Mason, Willencia Louis-Charles

PMC · DOI: 10.1093/geroni/igaf122.1640 · 2025-12-31

## TL;DR

This paper introduces a framework to evaluate AI-powered virtual assistants for helping older adults find health and financial information.

## Contribution

The study presents a novel framework and case example for evaluating AI assistants' accuracy and usability for older adults.

## Key findings

- LLM-based assistants like Bard and ChatGPT-4 were more accurate than non-LLM systems like Alexa.
- Bard provided additional information in 79% of responses, compared to 37% for ChatGPT-4.
- Response variability over time highlights the need for refinement and user training.

## Abstract

Older adults often face the challenge of searching for critical health, financial, and resource-related information to make complex decisions, a process further complicated by age-related cognitive changes that impact information processing and decision-making. Artificial intelligence (AI)-powered virtual assistants may help by providing concise, easy-to-understand information, yet their accuracy and effectiveness remain unclear. This presentation will introduce a general framework for evaluating AI’s potential to support important decisions of older adults and provide a case example illustrating this approach. To examine the accuracy and utility of AI-powered virtual assistants, we assessed the responses of Alexa, Google Assistant, Bard, and ChatGPT-4 to queries related to Medicare, long-term care insurance, and resource access. Findings showed that Large Language Model (LLM)-based assistants (Bard, ChatGPT-4) were more accurate than non-LLM systems, with Bard producing 6% inaccurate responses compared to Alexa’s 60%. They also provided more supplemental details, with Bard offering high levels of additional information in 79% of responses, compared to 37% for ChatGPT-4 and under 20% for others. However, response variability was observed over time. While LLM-powered virtual assistants may be useful tools for older adults seeking health and financial information, potential inaccuracies, response complexity, and variability must be considered. We will outline key challenges in conducting this research and implementing AI solutions, emphasizing the need for further refinement and user training to enhance reliability and usability for older users.

---
Source: https://tomesphere.com/paper/PMC12761666