Why do we divide by the square root of the key dimensions in Scaled Dot-Product Attention? In this video, we dive deep into the intuition and mathematics behind this crucial step.

Understand:
How scaling prevents extreme attention scores.
The impact of dimensionality on softmax.
Why this scaling makes models more stable and efficient.

If you’ve ever wondered about this subtle yet vital detail, this video is for you, where we go in-depth into why this scaling is so important for the stability in training of the model.

NOTES: https://github.com/Coding-Lane/Transformer-notes/blob/main/Variance%20of%20a%20matrix.pdf

A MINOR CORRECTION IN THE SLIDED - https://github.com/Coding-Lane/Transformer-notes/blob/main/Self-Attention%20Correction%20Slide.pdf



Timestamps:
0:00 Intro
1:12 Recap of Self-Attention
4:39 Increase in variance
8:05 Why variance increases?
12:41 Why high variance is a problem in Deep Learning?
15:30 Why divide by square root of dimension
19:07 Outro



Follow my entire Transformers playlist :

Transformers Playlist: https://www.youtube.com/watch?v=lRylkiFdUdk&list=PLuhqtP7jdD8CQTxwVsuiFYGvHtFpNhlR3&index=1&t=0s



RNN Playlist: https://www.youtube.com/watch?v=lWPkNkShNbo&list=PLuhqtP7jdD8ARBnzj8SZwNFhwWT89fAFr&t=0s

CNN Playlist: https://www.youtube.com/watch?v=E5Z7FQp7AQQ&list=PLuhqtP7jdD8CD6rOWy20INGM44kULvrHu&t=0s

Complete Neural Network: https://www.youtube.com/watch?v=mlk0rddP3L4&list=PLuhqtP7jdD8CftMk831qdE8BlIteSaNzD&t=0s

Complete Logistic Regression Playlist: https://www.youtube.com/watch?v=U1omz0B9FTw&list=PLuhqtP7jdD8Chy7QIo5U0zzKP8-emLdny&t=0s

Complete Linear Regression Playlist: https://www.youtube.com/watch?v=nwD5U2WxTdk&list=PLuhqtP7jdD8AFocJuxC6_Zz0HepAWL9cF&t=0s