Question

When employing a cross-encoder for re-ranking retrieved documents, why does the architectural requirement to process the query and document simultaneously preclude the use of pre-computed embeddings?

Accepted Answer

A cross-encoder processes a query and a document by feeding them into a neural network together as a single input sequence. Within this architecture, the model uses self-attention mechanisms to allow every word in the query to interact directly with every word in the document at the earliest layers of the network. This deep, token-level interaction is what makes cross-encoders highly accurate. Pre-computed embeddings, by contrast, are fixed, static vector representations generated for a single piece of text in isolation, without knowledge of any specific query. Because a cross-encoder requires the internal state of the model to be calculated based on the specific, combined influence of both the query and the document simultaneously, it is mathematically impossible to substitute this process with pre-computed vectors. Pre-computed embeddings lack the cross-attention interactions that define the model&#x27;s output. Consequently, the query and document must be concatenated and passed through the transformer layers together at inference time to produce a relevance score, making the pre-computation of document embeddings inapplicable for this specific architectural approach.

Home → All Courses → Programming Courses → Large Language Model (LLM) Engineering → Flashcard

When employing a cross-encoder for re-ranking retrieved documents, why does the architectural requirement to process the query and document simultaneously preclude the use of pre-computed embeddings?

Community Answers