“Stemming” is a process in natural language processing where words are reduced to their root or base form. The aim is to map related words to the same stem even if they have different surface forms. This is often done by chopping off the ends of words.
For example, the words “running”, “runner”, and “runs” all have the word “run” as their root. Through stemming, these words would all be reduced to “run”.
Stemming is a heuristic process that works by applying a set of rules to the words. For instance, a simple stemming algorithm for English might remove the ‘s’ or ‘es’ at the end of words to change plural nouns to singular, or it might remove ‘ing’ at the end of words to change present participle verbs to their base form.
One of the most widely used stemming algorithms is the Porter’s stemmer, developed by Martin Porter in 1980. This algorithm has a set of predefined rules and steps that are used to reduce a word to its stem.
While stemming helps in reducing dimensionality and computational complexity, it can sometimes be overly simplistic and aggressive, leading to errors. For instance, stemming does not account for irregular forms (such as “went” and “go”) and may produce stems that are not actual words (for instance, “happi” for “happiness”, “happy”, etc.).
An alternative to stemming is lemmatization, which reduces words to their base or root form (lemma) taking into consideration the morphological analysis of the words. Lemmatization tends to be more sophisticated and accurate, but also more computationally intensive.
« Back to Glossary Index