Semantic-Metric Bayesian Risk Fields: Learning Robot Safety from Human Videos with a VLM Prior
Timothy Chen, Marcus Dominguez-Kuhne, Aiden Swann, Xu Liu, Mac Schwager

TL;DR
This paper introduces a Bayesian framework that leverages vision-language models and human demonstration videos to learn human-like, context-aware risk assessments for robots, improving safety and decision-making in dynamic environments.
Contribution
It presents a novel semantically-conditioned, spatially-varying risk model using a Bayesian approach with VLM priors, enabling generalization and fast adaptation for robot safety.
Findings
Risk estimates align with human preferences.
Framework enables use in robot planning and trajectory optimization.
Model generalizes to new objects and contexts.
Abstract
Humans interpret safety not as a binary signal but as a continuous, context- and spatially-dependent notion of risk. While risk is subjective, humans form rational mental models that guide action selection in dynamic environments. This work proposes a framework for extracting implicit human risk models by introducing a novel, semantically-conditioned and spatially-varying parametrization of risk, supervised directly from safe human demonstration videos and VLM common sense. Notably, we define risk through a Bayesian formulation. The prior is furnished by a pretrained vision-language model. In order to encourage the risk estimate to be more human aligned, a likelihood function modulates the prior to produce a relative metric of risk. Specifically, the likelihood is a learned ViT that maps pretrained features, to pixel-aligned risk values. Our pipeline ingests RGB images and a query…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is well-written and conceptually clear, with strong motivation for modeling semantic safety. * The Bayesian formulation is elegant and aligns well with human-like reasoning about risk. * The integration of VLM features and LLM-derived priors is creative and enables generalization to unseen contexts. * The image processing pipeline and use of Bézier curve fitting for CDFs are technically practical.
* While the framework is novel in its composition, many components (e.g., risk from demonstrations, VLM features, LLM priors) are adapted from existing ideas. * The theoretical contributions (e.g., viability consistency, risk consistency) are intuitive and not particularly deep. * The dataset used for likelihood regression is small, and the evaluation lacks rigorous quantitative comparisons to baselines. * The prior fitting relies heavily on LLM outputs, which may not always align with human pre
- The paper addresses the important question of inferring semantic safety that is aligned with human preferences. - The proposed method does that by learning from safe-only human demonstrations.
- While the work is motivated by semantic safety, it relies heavily on distance between objects snd collision avoidance. - The absolute value of the learned viability in this work has no meaning, only the relative value does, which introduces difficulty in interpreting the value. This is a result of the normalization factor in Bayes inference cannot be computed. More commonly, risk is defined as the probability of failure $\in [0. 1]$ - The experimental results are weak. There is no comparison
1. The core Bayesian decomposition is an elegant way to formalize intangible risk. It provides a clear and interpretable separation between behavior learned from observation and common sense knowledge. 2. A major strength is the data-collection strategy. The framework learns without requiring any unsafe demonstrations. The Likelihood is learned from safe-only human videos and the Prior is generated by a VLM. 3. This paper leverages a suite of modern foundation models to build its system. The fac
1. The entire framework relies on the critical assumption that the evidence term is independent of the semantic context. This assumption is necessary to avoid computing intractable term. However, this seems to contradict the paper's own premise. Is it really true that the general distribution of distances between objects is independent of their semantics? Humans are likely to behave differently (and thus create different distance distributions) around a `knife` vs. a `teddy bear`, even in safe s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Robot Manipulation and Learning
