Characterizing & Finding Good Data Orderings for Fast Convergence of   Sequential Gradient Methods

Amirkeivan Mohtashami; Sebastian Stich; Martin Jaggi

arXiv:2202.01838·cs.LG·February 7, 2022

Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

Amirkeivan Mohtashami, Sebastian Stich, Martin Jaggi

PDF

Open Access

TL;DR

This paper analyzes how data ordering affects the convergence speed of sequential gradient methods like incremental gradient descent, proposing a measure to select optimal data orders and demonstrating improved training performance.

Contribution

It introduces a convergence bound based on data order, develops a greedy algorithm for selecting good permutations, and shows practical benefits over random reshuffling.

Findings

01

Structured shuffling improves convergence in datasets with multiple abstraction levels.

02

The greedy order selection algorithm outperforms random reshuffling by over 14% in accuracy.

03

Theoretical bounds relate data order to convergence speed, guiding better permutation choices.

Abstract

While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Face and Expression Recognition · Face recognition and analysis

MethodsStochastic Gradient Descent