AI Term:Skip-grams

·

·

« Back to Glossary Index

Skip-grams is a method used in natural language processing and machine learning for generating pairs of words that occur in a specific context within a given text corpus. It’s often used in the training process of word2vec, a popular algorithm for creating word embeddings.

In a skip-gram model, the goal is to predict the context words (the words surrounding a given word) based on the target word (the word of interest). The “skip” in skip-gram refers to the fact that the model can take into account not just the immediate neighbors of the target word, but also words that are further away.

For example, let’s take the sentence “The cat sat on the mat”. If we choose “sat” as our target word and we use a skip window of size 2, the skip-gram pairs would be:

  • (sat, The)
  • (sat, cat)
  • (sat, on)
  • (sat, the)

The context includes the two words before and the two words after the target word “sat”. The pairs are formed by pairing the target word with each of the context words.

Once these pairs are generated, they can be used to train a model (like word2vec) to create word embeddings. The underlying idea is that words appearing in similar contexts tend to have similar meanings. Therefore, by training the model to predict context words based on a target word, the resulting word embeddings capture the semantic relationships between words.

Skip-grams are one of two architectural choices for training word2vec models, the other being Continuous Bag of Words (CBOW). While skip-grams predict context words from a target word, CBOW does the opposite – it predicts a target word from its context. Skip-grams tend to work better for larger corpora and they are better at capturing infrequent words compared to CBOW.

« Back to Glossary Index