Question

Which specific architectural design choice allows a Decoder-only model to generate text sequentially while preventing the model from &#x27;seeing&#x27; future tokens during the training phase?

Accepted Answer

The specific architectural design choice is the causal attention mask, which is applied within the self-attention mechanism of the Transformer architecture. In a Decoder-only model, self-attention allows each word, or token, in a sequence to calculate its relationship with every other word in that same sequence. During training, the model receives the entire sentence at once to increase computational efficiency. However, because the goal is to predict the next token based only on previous ones, the model must be prevented from looking at tokens that appear later in the sequence. The causal attention mask achieves this by adding a matrix of negative infinity values to the attention scores before they pass through a softmax function. This effectively zeroes out the influence of any future tokens, ensuring that when the model calculates the representation for a specific word, its output depends only on that word and the words that preceded it. This process creates an upper triangular matrix that forces the attention mechanism to be unidirectional, mimicking the sequential nature of text generation while maintaining the speed of parallelized training.

Home → All Courses → Engineering and Technology Courses → Generative AI Application Development → Flashcard

Which specific architectural design choice allows a Decoder-only model to generate text sequentially while preventing the model from 'seeing' future tokens during the training phase?