Foundations and Models in Modern Computer Vision: Key Building Blocks in Landmark Architectures

Radu-Andrei Bourceanu; Neil De La Fuente; Jan Grimm; Andrei Jardan; Andriy Manucharyan; Cornelius Weiss; Daniel Cremers; Roman Pflugfelder

arXiv:2507.23357·cs.CV·September 5, 2025

Foundations and Models in Modern Computer Vision: Key Building Blocks in Landmark Architectures

Radu-Andrei Bourceanu, Neil De La Fuente, Jan Grimm, Andrei Jardan, Andriy Manucharyan, Cornelius Weiss, Daniel Cremers, Roman Pflugfelder

PDF

TL;DR

This paper reviews key architectural and methodological advances in modern computer vision, including residual networks, transformers, generative models, and self-supervised learning techniques, highlighting their contributions to the field's evolution.

Contribution

It provides a comprehensive analysis of six influential papers, detailing foundational architectures and innovative models that have shaped current computer vision paradigms.

Findings

01

ResNet enabled training of deeper networks with residual connections.

02

ViT demonstrated the effectiveness of attention mechanisms in image recognition.

03

LDMs achieved high-fidelity image synthesis with improved efficiency.

Abstract

This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analysis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer architecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recognition. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.