SoftStep: Learning Sparse Similarity Powers Deep Neighbor-Based Regression
Aviad Susman, Baihan Lin, Mayte Su\'arez-Fari\~nas, Joseph T Colonel

TL;DR
SoftStep introduces a learnable, sparse similarity module that enhances neighbor-based regression in neural networks, outperforming linear heads across various tasks and architectures by enabling better internal representations.
Contribution
The paper presents SoftStep, a novel parametric module that learns sparse, instance-wise similarities, unlocking the potential of neighbor-based methods in deep learning models.
Findings
SoftStep improves regression accuracy over linear heads.
Neighbor-based prediction with SoftStep induces well-structured embeddings.
Applicable to various deep learning paradigms beyond regression.
Abstract
Neighbor-based methods are a natural alternative to linear prediction for tabular data when relationships between inputs and targets exhibit complexity such as nonlinearity, periodicity, or heteroscedasticity. Yet in deep learning on unstructured data, nonparametric neighbor-based approaches are rarely implemented in lieu of simple linear heads. This is primarily due to the ability of systems equipped with linear regression heads to co-learn internal representations along with the linear head's parameters. To unlock the full potential of neighbor-based methods in neural networks we introduce SoftStep, a parametric module that learns sparse instance-wise similarity measures directly from data. When integrated with existing neighbor-based methods, SoftStep enables regression models that consistently outperform linear heads across diverse architectures, domains, and training scenarios. We…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
This paper proposes a new read-out head architecture that appears to have better performance than linear read-out on a suite of benchmark datasets. The method is novel to the best of my knowledge. The method is well motivated, and generally well presented.
* A nearest neighbor head seems much harder to scale in terms of number of data points than a linear head. New ideas seem to be needed to make this method work for larger datasets. Accordingly, the experiments are on relatively smaller scale benchmarks. * The presentation was generally good, but I was confused with some aspects: see my questions below.
The idea itself is quite neat. As I understand it, the authors allow for learning a smooth function over the number of neighbors to use when doing predictions. It is also nice that it works with both the differentiable knn and the neighborhood component analysis set ups. The presentation in the first three sections in particular is very easy to follow.
I think the paper has three primary weaknesses. The first is that section 4 is quite strange in how it is presented. It seems that the authors are trying to make theoretical statements verifying that their approach works. However, it's not clear what precisely what is being shown and it seems there are some mistakes in this section? For example, the phrase "we demonstrate that a neighbor-based regression model paired with MSE loss yields implicit optimization conditions for structuring pairs of
- Clear, modular formulation: SoftStep is a drop-in, differentiable sparsifier for neighbor heads. - Empirical results are consistently better than linear regression heads (and vanilla neighbor baselines) across multiple regression datasets. - Theoretical intuition is reasonable: neighbor-based MSE induces pair/triplet structure. - The method exposes meaningful knobs ((\ell,u,t); global vs instance-wise) that could be useful for controlling sparsity and locality.
- Backbone/scale generalization is thin: claims of easy applicability are not substantiated across diverse and larger encoders (e.g., ViT/ConvNeXt/BERT) or large-scale datasets; results remain small to mid-scale. - Comparisons are not strong enough: lacks head-to-head against robust modern alternatives for regression/metric learning (e.g., recent metric learning algorithms). - Attribution is unclear: separate SoftStep from soft-rank effects: run soft-rank on/off ablations, and test rank-free NCA
1) SoftStep provides a learnable sparse similarity metric, which differs from existing methods that rely on fixed sparsity patterns. 2) The paper derives implicit geometric constraints from the MSE loss, revealing the structural properties underlying neighbor-based regression. 3) Experimental results show some improvements over linear heads and traditional NCA/kNN methods across multiple tasks.
1) The experimental evaluation appears limited in scope, as it lacks comparisons with contemporary sparse attention mechanisms such as Sparsemax and Sparse Transformer. The current baseline methods—limited to linear attention and NCA—are insufficient to comprehensively validate the method's advantages. Furthermore, the proposed approach includes multiple variants that complicate the interpretation of results and dilute the clarity of the key contributions. 2) The method is only validated on reg
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Face recognition and analysis · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need · k-Nearest Neighbors · Shrink and Fine-Tune
