$\color{black}\rule{365px}{3px}$
Link to Paper:
“An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale” - 2021
arxiv.org
Table of Contents
1. Introduction
$\color{black}\rule{365px}{3px}$
Motivations
- Inspired by the Transformer scaling successes in NLP.
- Transformer in NLP with over 100B parameters, there is still no sign of saturating performance. → What if we apply this to Computer Vision? 🤔
Contributions
- Applied a standard Transformer directly to images, with the fewest possible modifications.
- Vision Transformer (ViT) attains excellent results when pre-trained at sufficient scale and transferred to tasks with fewer datapoints.
- 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100,
and 77.63% on the VTAB suite of 19 tasks