Permutation Invariant Learning with High-Dimensional Particle Filters
Akhilan Boopathy, Aneesh Muppidi, Peggy Yang, Abhiram Iyer, William, Yue, Ila Fiete

TL;DR
This paper introduces a permutation-invariant learning framework using high-dimensional particle filters to address issues like catastrophic forgetting and plasticity loss in sequential deep learning tasks, demonstrating improved performance and stability.
Contribution
It proposes a novel permutation-invariant particle filter method that combines Bayesian and gradient-based approaches for high-dimensional models, mitigating order dependence in training.
Findings
Improves continual learning performance on benchmarks.
Reduces variance compared to standard methods.
Theoretically invariant to training data order.
Abstract
Sequential learning in deep models often suffers from challenges such as catastrophic forgetting and loss of plasticity, largely due to the permutation dependence of gradient-based algorithms, where the order of training data impacts the learning outcome. In this work, we introduce a novel permutation-invariant learning framework based on high-dimensional particle filters. We theoretically demonstrate that particle filters are invariant to the sequential ordering of training minibatches or tasks, offering a principled solution to mitigate catastrophic forgetting and loss-of-plasticity. We develop an efficient particle filter for optimizing high-dimensional models, combining the strengths of Bayesian methods with gradient-based optimization. Through extensive experiments on continual supervised and reinforcement learning benchmarks, including SplitMNIST, SplitCIFAR100, and ProcGen, we…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper identifies an interesting perspective based on permutation invariance to tackle two challenges observed in sequential learning. Experimentally, the results seem to correlate with this idea.
1. Idealization of the particle filter updates: it is unclear why (6) and (7) hold for suitable constants. I recommend authors to give some examples so that this is more understandable. 2. The bounds in (8), (9), (12) are all exponential in $T$ and are potentially vacuous. These exponential terms in $T$ are considered as some constants and not discussed at all. I recommend authors to explain why they think these constants are small. As an example, consider Theorem 2 and (12). The loss is upper
This is an excellent paper. The authors nicely set up the motivation by explaining prevailing challenges in continual learning, clearly explain why particle filters can address this challenge, and then explain their own method. The benchmark experiments nicely demonstrate the strength of their method. Indeed, I believe that this method could inspire a host of follow-up research, further digging into how to combine particle filters and gradient descent-based optimization. I therefore strongly rec
I have two primary concerns I'd like to see the authors address: **1 Further related work** Your method seems related to the use of ensembles for continual learning, e.g. [1-3]. Could you discuss the relation of your paper to this prior work? **2 Connection between sections 3.2-3.3 and 3.4** In sections 3.2 and 3.3 you provided a set of guarantees about particle filters under assumptions (6) and (7). You then introduce your own method, but don't explain whether this method meets these assump
+ the method outperforms, or performs very well against the baselines. In addition, it is shown that method is complementary, and when combined with other methods, can improve their performance. + the idea of the paper can be followed well (though due to the complexity of the problem some central derivations are in the appendix) + moving towards more real-world problems, the topic of continual learning is an important one to tackle for the community
- The document should contain a section on Limitations. One thing that should be mentioned is that the algorithm needs N times more memory to keep the model parameters (or compute) for N being the particles, than a simple standard method with N=1. In particular, given that the community moved to very large models, this is a potentially very big limitation. If you made any design choices to overcome this limitation this should be discussed (or metnioned as future work). - the writing could be mad
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Machine Learning and ELM · Text and Document Classification Technologies
