NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

Wilka Carvalho; Vikram Goddla; Ishaan Sinha; Hoon Shin; Kunal Jha

arXiv:2508.15693·cs.AI·August 22, 2025

NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

Wilka Carvalho, Vikram Goddla, Ishaan Sinha, Hoon Shin, Kunal Jha

PDF

Open Access 4 Reviews

TL;DR

NiceWebRL is a Python library that transforms RL environments into online interfaces for human experiments, enabling comparisons between human and AI performance across various domains and supporting multi-agent collaboration research.

Contribution

It introduces a versatile Python tool that converts Jax-based RL environments into online interfaces for human subject experiments, facilitating new research in human-AI interaction.

Findings

01

Enabled testing of human-like RL models in grid world and Minecraft domains.

02

Developed a multi-agent RL algorithm that generalizes to human partners in Overcooked.

03

Studied LLM assistance in complex hierarchical tasks within XLand-Minigrid.

Abstract

We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi-agent researchers to develop algorithms for human-AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human-like AI, Human-compatible AI, and Human-assistive AI. In the first case study (Human-like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

- The authors provide compelling use cases of how NiceWebRL would be convinient for reserachers in deploying human-AI experiments. I believe this tool would be valuable to share with the ICLR community. - The paper is well written and easy to follow.

Weaknesses

- The fact that all possible states have to pre-computed on the server-side inherently limits NiceWebRL to only environments with discrete action spaces. - NiceWebRL only supports environments that already has a Jax implementation, though this will be less of an issue as more and more RL environments become Jax-compatible.

Reviewer 02Rating 6Confidence 3

Strengths

This paper is very well written and seems like a useful tool for researchers. The proposed library would be of value to the research community. The technical implementation seem thoughtfully engineered and robust, making use of JAX's unique characteristics for efficiency and lowering latency. This seems to fill a practical gap in the AI and cognitive science ecosystems. The presented case studies enabled by the tool do seem of value to the scientific community.

Weaknesses

Whilst a useful and well engineered tool for research, this work doesn't contribute any new test settings, acting mostly as an interface for existing environments. Therefore the tool itself seems incremental. I think expanding the comparison between LLM agents in 5.3 (case study 3) beyond just gpt-5 and Gemini 2.5 pro would be of interest, although the authors acknowledge this as a proof of concept. Given the memory boundedness as as the number of users increases, the paper could also benefit fr

Reviewer 03Rating 4Confidence 3

Strengths

1. It is a practical tool/library that unifies JAX envs and a web GUI. 2. The experimental validation around latency and memory scaling is good. Architecture often grounded in JAX. 3. Has useful widgets and feedback built-in. 4. Code is provided.

Weaknesses

1. The case studies rely on fairly small samples, limited trial counts and lack reporting of statistical significance tests (e.g. p-values, effect sizes) which limits the strength of the authors' claims. 2. As this is a library / tool, it should be compared to the most similar tools. I would have expected head-to-head system-level benchmarking vs. alternative platforms (e.g., PsychLab, jsPsych+Python bridges) for latency, throughput or developer effort. 3. It is only focused on JAX-based environ

Reviewer 04Rating 4Confidence 4

Strengths

- Comprehensive literature review: The authors provide valuable context on existing tools for human subject experiments, clearly identifying the gap between JavaScript-based experimental frameworks and Python-based ML environments. - Clear conceptual framework (Figure 1): Excellent visual overview of different paradigms for integrating humans into RL environments, spanning human-like, human-compatible, and human-assistive AI research. - Timely integration of LLMs: The incorporation of LLM assist

Weaknesses

- Questionable performance justification: The authors repeatedly emphasize JAX's speed advantages (lines 91, 221, 242), but fail to justify why this matters for human experiments. JavaScript and Unity/C# environments already provide sub-millisecond response times, which far exceed human reaction capabilities (typically 200-300ms). - Artificial JAX limitation: The restriction to JAX-based environments appears unmotivated. The core contribution -- enabling human-subject experiments -- doesn't inhe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)