IO Transformer: Evaluating SwinV2-Based Reward Models for Computer Vision
Maxwell Meyer, Jack Spruyt

TL;DR
This paper introduces SwinV2-based reward models for computer vision, demonstrating their high accuracy in output quality evaluation and expanding transformer applications beyond traditional tasks.
Contribution
It presents novel SwinV2-based reward models for evaluating model outputs, showing their effectiveness across vision tasks and exploring architecture modifications.
Findings
IO Transformer achieves perfect accuracy on CD25
Swin V2 scores 95.41% on IO Segmentation Dataset
Swin V2 outperforms IO Transformer when output isn't solely input-dependent
Abstract
Transformers and their derivatives have achieved state-of-the-art performance across text, vision, and speech recognition tasks. However, minimal effort has been made to train transformers capable of evaluating the output quality of other models. This paper examines SwinV2-based reward models, called the Input-Output Transformer (IO Transformer) and the Output Transformer. These reward models can be leveraged for tasks such as inference quality evaluation, data categorization, and policy optimization. Our experiments demonstrate highly accurate model output quality assessment across domains where the output is entirely dependent on the input, with the IO Transformer achieving perfect evaluation accuracy on the Change Dataset 25 (CD25). We also explore modified Swin V2 architectures. Ultimately Swin V2 remains on top with a score of 95.41 % on the IO Segmentation Dataset, outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · CCD and CMOS Imaging Sensors · Industrial Vision Systems and Defect Detection
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings
