Submodular Reinforcement Learning

Manish Prajapat; Mojm\'ir Mutn\'y; Melanie N. Zeilinger; Andreas; Krause

arXiv:2307.13372·cs.LG·May 27, 2024

Submodular Reinforcement Learning

Manish Prajapat, Mojm\'ir Mutn\'y, Melanie N. Zeilinger, Andreas, Krause

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces Submodular Reinforcement Learning (SubRL), a framework for optimizing non-additive, diminishing returns rewards in RL using greedy algorithms, with theoretical guarantees and practical applications.

Contribution

The paper proposes SubRL, a new paradigm for RL with submodular rewards, and introduces SubPO, a policy gradient algorithm with approximation guarantees for this setting.

Findings

01

SubPO achieves constant factor approximations in submodular bandits.

02

SubRL is effective in applications like biodiversity monitoring and experiment design.

03

The approach scales to high-dimensional state-action spaces.

Abstract

In reinforcement learning (RL), rewards of states are typically considered additive, and following the Markov assumption, they are $independent$ of states visited previously. In many important applications, such as coverage control, experiment design and informative path planning, rewards naturally have diminishing returns, i.e., their value decreases in light of similar states visited previously. To tackle this, we propose $submodular RL$ (SubRL), a paradigm which seeks to optimize more general, non-additive (and history-dependent) rewards modelled via submodular set functions which capture diminishing returns. Unfortunately, in general, even in tabular settings, we show that the resulting optimization problem is hard to approximate. On the other hand, motivated by the success of greedy algorithms in classical submodular optimization, we propose SubPO, a simple policy…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The considered problem is interesting and significant. - Extensive and rigorous experiment results have been presented in Section 7. - The paper is well-written in general, and easy to read.

Weaknesses

- The idea behind the proposed algorithm, Submodulr Policy Optimization, is quite straightforward. It is just a relatively straightforward extension of the classical policy optimization algorithm. - The analysis in Section 5 seems to be very restricted. Could the authors provide a similar analysis in more general settings?

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

Combining submodularity with reinforcement learning in a generalized way seems highly intuitive that I am surprised it has not been proposed before. This emphasizes the significance of the paper's contribution. The main idea of the paper is a simple yet powerful one. Additionally, the paper is well written and the ideas or conveyed clearly.

Weaknesses

These are more minor suggestions for improvement rather than weaknesses: - On the last paragraph of page 1, the adverbs firstly, secondly, thirdly can be just replaced with first, second, and third. Also, we after the firstly should be lowercase. - I think there can be a broader discussion of using submodular functions in reinforcement learning setups in the related work section. I am aware that the introduction also mentions some examples of submodular rewards, but I believe it is interesting e

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The submission introduces a novel and "mathematically" interesting framework that accounts for diminishing returns of repeated actions. - The view of submodular rewards is fresh. The hardness result is new and interesting. - The selected toy examples sound interesting and well-suited for the proposed framework.

Weaknesses

- I do not see much contribution in positive results. Not only does the assumption sound strong from a practical perspective, but it seems quite contrived only for the sake of analysis. - Literature review: I agree with the motivation from diminishing returns, but a submodular reward design is not the only way to address that. For example, there is a blocking-bandit style framework that discourages repeated actions [1]. Maybe good to discuss why the submodular reward design is better. I also

Code & Models

Repositories

manish-pra/non-additive-rl
pytorchOfficial

Videos

Submodular Reinforcement Learning· slideslive

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Machine Learning and Algorithms · Machine Learning and Data Classification