The specific architectural design choice is the causal attention mask, which is applied within the self-attention mechanism of the Transformer architecture. In a Decoder-only model, self-attention allows each word, or token, in a sequence to calculate its relationship with every other word in that same sequence. During training, the model receives....
Log in to view the answer