From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Quang-Huy Nguyen; Thanh-Hai Nguyen; Khac-Manh Thai; Duc-Hoang Pham; Huy-Son Nguyen; Cam-Van Thi Nguyen; Masoud Mansoury; Duc-Trong Le; Hoang-Quynh Le

arXiv:2604.19663·cs.IR·April 22, 2026

From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

Quang-Huy Nguyen, Thanh-Hai Nguyen, Khac-Manh Thai, Duc-Hoang Pham, Huy-Son Nguyen, Cam-Van Thi Nguyen, Masoud Mansoury, Duc-Trong Le, Hoang-Quynh Le

PDF

1 Repo

TL;DR

This study systematically reproduces, re-implements, and benchmarks eleven state-of-the-art counterfactual explanation methods for recommender systems, providing a unified evaluation framework and analyzing their effectiveness, sparsity, and scalability across diverse datasets and models.

Contribution

It introduces a comprehensive benchmarking framework for counterfactual explanations in recommender systems and evaluates existing methods under standardized protocols.

Findings

01

Effectiveness-sparsity trade-off varies by method and setting.

02

Performance consistency between item-level and list-level explanations.

03

Graph-based explainers face scalability issues on large graphs.

Abstract

Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

L2R-UET/CFExpRec
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.