Matrix Product Sketching via Coordinated Sampling
Majid Daliri, Juliana Freire, Danrong Li, Christopher Musco

TL;DR
This paper introduces a coordinated sampling method for matrix product approximation that outperforms classical linear sketching techniques in sparse settings, with practical benefits demonstrated in distributed regression and language models.
Contribution
The paper presents a novel coordinated sampling approach for matrix product sketching that improves efficiency over traditional methods in sparse data scenarios.
Findings
Coordinated sampling reduces sketch size for Frobenius norm error in sparse matrices.
Empirical results show an order of magnitude improvement in real applications.
Method outperforms classical linear sketching in distributed regression and language models.
Abstract
We revisit the well-studied problem of approximating a matrix product, , based on small space sketches and of and . We are interested in the setting where the sketches must be computed independently of each other, except for the use of a shared random seed. We prove that, when and are sparse, methods based on \emph{coordinated random sampling} can outperform classical linear sketching approaches, like Johnson-Lindenstrauss Projection or CountSketch. For example, to obtain Frobenius norm error , coordinated sampling requires sketches of size when and have at most non-zeros per row. In contrast, linear sketching leads to…
Peer Reviews
Decision·ICLR 2025 Poster
1. I found the presentation of this paper to be very clear and engaging. The problem setting, requirements, and notation are all well-defined, and the core idea is explained thoroughly, making it easy for readers to follow. The proof is presented in a clean and structured manner, enhancing readability. 2. The paper provides strong motivation for studying the problem of computing independent sketches and discusses several potential applications, demonstrating their proposed algorithm in a one im
1. One concern I have regarding the experiments is that while vector quantization—a nonlinear compression technique—has been widely studied and applied in practice for approximating the computation of the key matrix in the attention layer, it remains unclear whether using linear compression techniques, such as approximate matrix products, to approximate $QK^T$ or just the key matrix $K$ could degrade model performance significantly in downstream applications. I suggest that the authors cite work
S1: Interesting problem. S2: Elegant solutions. S3: Solid experiments.
W1: The result and the approach are not very surprising, given the prior work of Bessa et al and Daliri et al W2: The analysis of one of the algorithms (Threshold Sampling) seems fairly straightforward.
- The theoretical analysis of this paper is solid. The paper gives a new sketching algorithm with size $O(s^2 /\epsilon^2)$. This bound will be better for sparse matrix compared to the previous methods, which is interesting to me. - The paper gives a detailed experiment that demonstrates the advantage of the proposed algorithms. - The presentation of the paper is good. The paper has a nice introduction section.
- I still do not understand the motivation of the new model the paper discusses well (see the questions below). Maybe the authors can give more explanation about this? - It will be better if the experiments can also give a comparison to the previous sampling-based method.
Videos
Taxonomy
Topics3D Shape Modeling and Analysis · Human Motion and Animation · Manufacturing Process and Optimization
MethodsSoftmax · Attention Is All You Need · Linear Regression
