On the Lipschitz Continuity of Set Aggregation Functions and Neural Networks for Sets
Giannis Nikolentzos, Konstantinos Skianis

TL;DR
This paper analyzes the Lipschitz continuity of set aggregation functions and neural networks for set-structured data, providing theoretical bounds and empirical validation for their stability and generalization properties.
Contribution
It characterizes Lipschitz continuity of common aggregation functions and neural networks for sets, deriving bounds and highlighting limitations of attention-based methods.
Findings
Aggregation functions are Lipschitz continuous with respect to only one of three distance metrics.
Attention-based aggregation is not Lipschitz continuous under the considered metrics.
Derived upper bounds on neural network Lipschitz constants improve understanding of stability and generalization.
Abstract
The Lipschitz constant of a neural network is connected to several important properties of the network such as its robustness and generalization. It is thus useful in many settings to estimate the Lipschitz constant of a model. Prior work has focused mainly on estimating the Lipschitz constant of multi-layer perceptrons and convolutional neural networks. Here we focus on data modeled as sets or multi-sets of vectors and on neural networks that can handle such data. These models typically apply some permutation invariant aggregation function, such as the sum, mean or max operator, to the input multisets to produce a single vector for each input sample. In this paper, we investigate whether these aggregation functions, along with an attention-based aggregation function, are Lipschitz continuous with respect to three distance functions for unordered multisets, and we compute their…
Peer Reviews
Decision·ICLR 2026 Poster
The authors provide a significant contribution to the analysis of Lipschitz continuity of aggregation functions. The paper is well structured, and the results build on each other, from the fundamental results in Table 1 to the bounds on input perturbations. A small set of benchmarks nicely accompanies the theoretical results.
The paper would benefit from explaining related literature and the connections to it better. Based on the theoretical and experimental results, the conclusions for practitioners should be better spelled out (details see below).
- The authors study the key properties of neural networks for unordered multisets, a context that appears to have been little studied to date. - They provide a comprehensive study of the Lipschitz continuity of these networks and their aggregation functions with respect to three known distances (EMD, Haussdorf, matching distance), and provide new bounds on the Lipschitz constant when available. I'm not expert in this topic but the results seem novel. - Where available, they show theoretica
Overall, the article is interesting and easy to read despite its technical nature and the diversity of all the aggregation function<->distance associations considered. I have listed the following errors that the authors should correct. True weaknesses are labeled as such in the list below. - (weakness) In Sec. 3.1, while the numerical treatment is clear and legit, I didn't get why the authors need 3 trained different neural networks to simply test the Lipschitz continuity of the aggregation f
- Presents a clear overview of which combinations of {SUM/MEAN/MAX} × {EMD, Hausdorff, Matching} are Lipschitz continuous, and additionally provides strengthened results for fixed cardinalities as lemmas. - Proposes a simple yet useful composition rule that allows direct derivation of Lipschitz upper bounds for set neural networks. - Demonstrates the effectiveness of the analysis across both image processing and natural language domains.
- The main results focus on correlation plots, but lack comparisons of task performance (accuracy) and ablation studies (e.g., classification accuracy differences among SUM/MEAN/MAX). - It is unclear what practical benefits this work brings to neural networks for set functions. - The Matching distance is a metric only when “no zero vector is included” (Proposition 2.2), which may be inconsistent with real-world preprocessing (e.g., padding).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Systems and Optimization · Control Systems and Identification · Neural Networks and Applications
MethodsFocus
