Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations
Alexa R. Tartaglini, Sheridan Feucht, Michael A. Lepori, Wai Keen Vong, Charles Lovering, Brenden M. Lake, and Ellie Pavlick

TL;DR
This study demonstrates that with appropriate training and architecture choices, deep neural networks, especially pretrained transformers, can learn and generalize the same-different visual relation beyond the training distribution.
Contribution
It shows that certain pretrained transformer models can learn and generalize the same-different relation both within and out-of-distribution, challenging prior assumptions.
Findings
Pretrained transformers achieve near-perfect out-of-distribution generalization.
Fine-tuning on abstract, textureless shapes enhances generalization.
Deep neural networks can learn abstract relations with the right approach.
Abstract
Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract…
Peer Reviews
Decision·Submitted to ICLR 2024
The work attempts to better understand same-different challenges for current architectures. The work can provide useful conclusions for the same-different task.
The novelty and contributions of this work are limited. The work fine-tuned some pretrained existing architectures to improve the same-different recognition performance. The work is also very incremental in relation with the work: Guillermo Puebla and Jeffrey S Bowers. Can deep convolutional neural networks support relational reasoning in the same-different task? Journal of Vision, 22(10):11–11, 2022. I think the contributions of this work are not enough for publication at such strong venue. I
1. The paper is clearly written. 2. The authors present their empirical results in a sound way. For example, in section 3.2, they provide additional experiments fine-tuning on random noise to illustrate that "closeness" of stimuli is not a perfect correlate of OOD generalization.
1. It is hard to tell how much the paper can contribute to the community. In my opinion, this work is an empirical study. For this category of works, the criteria are usually a) how many new observations are found, how surprising they are, and how useful they are for future works; b) whether the study is systematic and convincing; c) whether a new perspective is proposed or a new methodology is used to study the problem. I think this paper focuses more on a) and b). I am not sure how novel and i
1. In my opinion, understanding the ability to learn generalizable same-different visual relations is important, especially if one would like a DNN to also perform basic logical operations in addition to instance-level perceptions. This paper gives a well-organized experiment-based summary that attempts to answer how and why recent DNN architectures and pretraining datasets can perform generalizable same-different visual relation classification. 2. The paper is well-written and easy to follow.
I am afraid that the definition of same-different visual relation should be beyond comparing objects at a pixel level. Thus, the observations and analyses may be of limited insights. The reasons are explained below: - In the human-visual system, two objects that are considered the same are usually based on some specific semantic and/or attribute similarities, rather than counting how many pixels that exactly have the same values. It means the dataset/task should define same-different visual rela
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
