vision transformers

an additional CLS token is added in the input sequence which helps in the final classification
self attention in ViT happens without masking unlike GPT which uses Causal masked attention
instead of having a vocab size to predict the next word, it has a list of labels to classify the image!

each image is broken into patches of 16 x 16 x 3 size which is flattened and linearly projected for the transformer!
so each patch is converted to 768 dimensional token - (1 x 768) since 16 x 16 x 3 = 768
to get the patch embedding we multiply the patch token with transpose of weight matrix of dim (1024 x 768) and add a bias term of dim (1 x 1024) to finally get a patch embedding of dim (1 x 1024)
the final patch + positional embedding consists of number of (patches + 1) embeddings, this extra embedding is called class embedding which is actually used for classification in the end! without this the accuracy of the transformer in classification task is low!
positional embedding - learnable params, part of the training just like the weight matrix to create the patch embeddings

ViT uses an encoder only transformer so it uses non masked MHA, unlike GPT with masked MHA!
it has L layers of encoder stacked and the skip connections helps in backward gradient flow during backprop avoiding vanishing and exploding gradients!
it also uses layer norm to stabilize training making the mean of ip=0 and var=1
finally after the transformer encoder blocks, only the CLS embedding is used by the MLP head to make the classification where it is converted into an output of n classes with softmax to decide the class with the highest probability!

requirements:

Links:

202605242041