swin transformer
paper link - https://arxiv.org/pdf/2103.14030
- shifted window transformer
- released by microsoft asia team
- sort of proved that transformer could be used to solve any kind of vision problem - classfication, segmentation or detection.
problems with ViT
- attention is costly as it take O(N^2) complexity to calculate attention which means it is directly proportional to the resolution of the image!

swin transformer architecture

understanding the dimensions:
- Patch partition: input image → H x W x 3 → creating 4 x 4 patches across H and W → (H/4) x (W/4) x 48 (because there are 3 channels and 16 patches and each patch is 4 x 4 so the number of dimensions to represent each patch is = 4 x 4 x 3 = 48) so the final dim after patch partition is (H/4) x (W/4) x 48
- Linear embedding: in linear embedding we just scale the dimension to C (ie. the dimension which the transformer requires for this architecture) so we multiply our input with a matrix of 48 x C to transform it into our desired dimension ie. (H/4) x (W/4) x C (usually C=96 in swin)
- Patch merging: in patch merging we merge the patches to reduce the number of patches and increase the channel dimension! ideally we merge 2 patches along row and column so a patch of size 4 x 4 becomes 2 x 2 so our dimensions become (H/4) x (W/4) x C → (H/8) x (W/8) x 4C but we again multiply with 4C X 2C linear projection to change its dim to (H/8) x (W/8) x 2C
- the same thing keeps repeating again and again in further stages as patches get merged and channels get wider, the model gets a better context of rough and smooth edges in the image similar to CNNs
Links:
202605261605