In the scaled dot-product attention mechanism, queries and keys are vectors representing pieces of information, and their dot product measures how much focus to place on specific parts of the input. When the dimension of these vectors is large, the magnitude of the dot product tends to grow significantly. This large value pushes the output of the subsequent softmax function into regions where th....
Log in to view the answer