“TF-IDF”, which stands for “Term Frequency-Inverse Document Frequency”, is a numerical statistic used in information retrieval and natural language processing to reflect how important a word is to a document in a collection or corpus. It’s often used as a weighting factor during text analysis and is one of the most popular methods for feature extraction from text.
The TF-IDF value for a word increases proportionally to the number of times a word appears in the document (this is the Term Frequency part), but it is offset by the frequency of the word in the corpus (this is the Inverse Document Frequency part), which helps to adjust for the fact that some words appear more frequently in general.
Here’s a breakdown of the two components:
- Term Frequency (TF): This measures how frequently a term occurs in a document. If a word appears frequently in a document, it’s important. The TF is usually normalized (i.e., divided by the total number of words in the document) so that longer documents do not skew the results.
- Inverse Document Frequency (IDF): This measures the importance of a term in the entire corpus. If a word appears in many documents, it’s not a unique identifier, so it’s less important. The IDF of a word is the logarithmically scaled inverse fraction of the documents that contain the word.
The TF-IDF value of a word in a document is the product of its TF and IDF values. So, words that are common in a single document but rare in the corpus will have a high TF-IDF value. This makes TF-IDF useful for tasks like text classification and clustering, where the goal is to distinguish documents based on their content.
However, while TF-IDF can capture the importance of words, it does not capture the position in text, semantics, co-occurrences in different documents, or word sequence, which can be addressed by other techniques like word embeddings or recurrent neural networks.
« Back to Glossary Index