Question

When applying Transformer models to DNA sequences, what is the mathematical role of the &#x27;attention mask&#x27; in managing variable-length genomic inputs?

Accepted Answer

In Transformer models, DNA sequences are processed in batches where all sequences must be padded with placeholder tokens, such as zeros, to reach a uniform length. The attention mask acts as a mathematical gate that tells the self-attention mechanism to ignore these padding tokens. Mathematically, the attention mechanism calculates scores by taking the dot product of a query vector and a key vector, followed by a scaling factor and a softmax function. Before the softmax operation, the attention mask is added to these raw scores. For positions representing actual DNA nucleotides, the mask contains a zero, leaving the scores unchanged. For positions representing padding, the mask contains a very large negative value, typically negative infinity. When the softmax function is applied, the exponentiation of negative infinity results in zero. Consequently, the attention weights for padding tokens become zero, ensuring that they contribute nothing to the final representation of the sequence. This process allows the model to compute context-aware embeddings for variable-length DNA fragments without the filler tokens corrupting the biological information.

Home → All Courses → Health and Medicine Courses → Biomedical Artificial Intelligence → Flashcard

When applying Transformer models to DNA sequences, what is the mathematical role of the 'attention mask' in managing variable-length genomic inputs?