“Self-Attention”, also known as “intra-attention”, is a specific type of attention mechanism used in machine learning models, especially those dealing with sequence data such as sentences in natural language processing tasks.
In typical attention models, the attention is placed between two different sequences (for example, an English sentence and its French translation in a machine translation task). However, in self-attention, the model calculates attention scores within a single sequence.
Here’s how it works:
- Given an input sequence (like a sentence), each word in the sequence is transformed into a vector representation using a process like word embedding.
- For each word in the sequence, the self-attention mechanism calculates a score for all other words in the sequence to determine how much each word should contribute to the new representation of the current word. This score is usually calculated based on the dot product of the vector representations of the words, followed by a softmax function to turn the scores into probabilities that sum up to 1.
- These scores are then used to create a weighted sum of all word vectors in the sequence, where words with higher scores have more influence on the new representation of the current word.
The key advantage of self-attention is that it allows each word in the sequence to interact with every other word, not just the ones close to it. This allows the model to capture long-range dependencies between words, which can be very important for understanding the meaning of a sentence.
The Transformer architecture, which is used in state-of-the-art models like GPT-3 and BERT, heavily relies on self-attention. Instead of using recurrent or convolutional layers to process the sequence, a Transformer uses multiple layers of self-attention, allowing it to process all words in the sequence in parallel and capture complex patterns in the data.
« Back to Glossary Index