Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling
Mahdi Karami, Ali Ghodsi

TL;DR
Orchid introduces a flexible, data-dependent convolution architecture that captures long-range dependencies efficiently, outperforming traditional attention models in language and image tasks while enabling longer sequence processing.
Contribution
The paper presents Orchid, a novel data-dependent convolution method that reduces complexity and enhances scalability for sequence modeling, with shift-equivariant conditioning networks.
Findings
Outperforms BERT and Vision Transformers in accuracy with smaller models
Enables processing of longer sequences beyond dense attention limits
Maintains high expressivity and efficiency across domains
Abstract
In the rapidly evolving field of deep learning, the demand for models that are both expressive and computationally efficient has never been more critical. This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms without compromising the ability to capture long-range dependencies and in-context learning. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its kernel conditioned on input sequence using a dedicated conditioning neural network. We design two simple conditioning networks that maintain shift equivariance in our data-dependent convolution operation. The dynamic nature of the proposed convolution kernel grants Orchid high expressivity while maintaining quasilinear scalability for long sequences. We evaluate the proposed model across multiple…
Peer Reviews
Decision·NeurIPS 2024 poster
- Authors introduce an innovative method for increasing the expressivity of subquadratic methods for long-range dependencies. - The paper is well-written, authors explain and motivate their modelling choices well, and provide helpful figures. - The approach of conditioning convolutional kernels based on input data is interesting in its own right and might warrant exploration in architectures not specifically tailored for modelling long-range dependencies.
- Limited set of experiments and comparisons against baselines. Although authors show results also on image data, they do not compare against 2D-convolutional long-range approaches which flimits interpretability of the results. - Authors do not thoroughly explore their shift-invariance constraints, which might not be appropriate in all settings, i.e. I can imagine that for textual data absolute positioning in a sentence does impact semantic meaning. On the other hand, authors provide good motiva
- The idea for input-dependence of convolutional kernels presented in the paper is novel, sound and very appealing. - The empirical evidence shows compelling evidence of the proposed model abilities.
- In my understanding, I am afraid that certain parts of the proposed model oversell what the model is capable of. Specifically, in Line 125, the authors argue that "This allows each input token to attend to the entire sequence with personalized, adaptive weights derived from its specific representation". However, to the best of my understanding, Orchid only considers local information --both spatial and spectral-- for conditioning. See also Line 302-304 and 308. The authors should be clear and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Algorithms and Data Compression · Advanced Computational Techniques and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Dropout · Layer Normalization · Attention Dropout · Multi-Head Attention · Linear Warmup With Linear Decay · Dense Connections · Adam · Attention Is All You Need
