In a previous post, I started an elementary search library using C#. Now, I will improve it adding a Porter Stemming filter to the indexing and searching processes. The resulting source code is available on my Github.
Note: I’ve been learning about it from the “Introduction to Information Retrieval” book.
Stemming?
From the “Introduction to Information Retrieval” book:
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Addiotionaly there are families of derivationally related words with similar meaning, such as democracy, democratic, democratization. In many situations, it seems as if it would be useful to search for one of these words to renturn documents that contain another word in the set.
The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of word to a common base form.
Porter Stemming?
From the “Introduction to Information Retrieval” book:
The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective is the Porter Algorithm. It consists of several phases of word reductions applied sequentially.
Implementing it
Recall the significant steps in the inverted index construction
- Collect the documents to be indexed – I will use simple strings for while;
- Tokenize the text, turning each document into a list of tokens
- Do linguistic preprocessing, producing a list of indexing terms
- Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
The Porter Stemming fits the step 3. Right?
public class PorterStemmerFilter : IFilter { public bool Process(TokenSource source) { PerformStep1(source); PerformStep2(source); PerformStep3(source); PerformStep4(source); PerformStep5(source); PerformStep6(source); return source.Size > 0; } // ..
I used this implementation as a starting point to write my own.
Step 1
In step 1 we remove common suffices and pluralizations.
[Theory] [InlineData("caresses", "caress")] [InlineData("ponies", "poni")] [InlineData("ties", "ti")] [InlineData("caress", "caress")] [InlineData("cats", "cat")] [InlineData("feed", "feed")] [InlineData("agreed", "agree")] [InlineData("disabled", "disable")] [InlineData("matting", "mat")] [InlineData("mating", "mate")] [InlineData("meeting", "meet")] [InlineData("milling", "mill")] [InlineData("messing", "mess")] [InlineData("meetings", "meet")] public void Step1(string from, string to) { var filter = new PorterStemmerFilter(); using (var reader = new StringReader(from)) { var tokenSource = new TokenSource(reader); tokenSource.Next(); filter.PerformStep1(tokenSource); Assert.Equal(to, tokenSource.ToString()); } }
To do that:
public void PerformStep1(TokenSource source) { if (source.EndsWith('s')) { if (source.EndsWith("sses") || source.EndsWith("ies")) { source.Size -= 2; } else if (source.Buffer[source.Size - 2] != 's') { source.Size -= 1; } } if (source.EndsWith("eed")) { var limit = source.Size - 3; // source.Length if (source.NumberOfConsoantSequences(limit) > 0) { source.Size -= 1; } } else { var limit = 0; if (source.EndsWith("ed")) { limit = source.Size - 2; } else if (source.EndsWith("ing")) { limit = source.Size - 3; } if (limit != 0 && source.ContainsVowel(limit)) { source.Size = limit; if ( source.EndsWith("at") || source.EndsWith("bl") || source.EndsWith("iz") ) { source.InsertIntoBuffer('e'); } else if (source.EndsWithDoubleConsonant()) { var ch = source.LastChar; if (ch != 'l' && ch != 's' && ch != 'z') { source.Size--; } } else if ( source.NumberOfConsoantSequences(source.Size - 1) == 1 && source.HasCvcAt(source.Size - 1) ) { source.InsertIntoBuffer('e'); } } } }
Notes:
- The EndsWith method checks if the end of current token matches with the specified string/char.
- The Buffer is a plain old fixed size char array.
- The Size is an integer with the used length of Buffer used to store the current token.
- The NumberOfConsoantsSequences retrieves how many consonants sequences are present in the specified portion of the buffer. It is util to check if we are not doing oversimplification.
- HasCvcAt verifies if there is a sequence of Consonant-Vowel-Consonant before ending in the specified position. Again, it is relevant to guarantee that we are not doing oversimplification.
Step 2
Step 2 turns terminal -y to -i when there is another vowel in the stem.
public void PerformStep2(TokenSource source) { if (source.EndsWith('y') && source.ContainsVowel(source.Size - 2) ) { source.Buffer[source.Size - 1] = 'i'; } }
Step 3
Step 3 maps double suffices to single ones. So -ization maps to -ize, etc.
Testing Step3 in isolation:
[Theory] [InlineData("international", "internate")] [InlineData("rational", "rational")] [InlineData("constitutional", "constitution")] [InlineData("energizer", "energize")] [InlineData("internacionalization", "internacionalize")] [InlineData("enumeration", "enumerate")] [InlineData("consolidator", "consolidate")] [InlineData("tropicalism", "tropical")] [InlineData("vandalism", "vandal")] [InlineData("activeness", "active")] [InlineData("remorsefulness", "remorseful")] public void Step3(string from, string to) { var filter = new PorterStemmerFilter(); using (var reader = new StringReader(from)) { var tokenSource = new TokenSource(reader); tokenSource.Next(); filter.PerformStep3(tokenSource); Assert.Equal(to, tokenSource.ToString()); } }
And now step3’s implementation.
public void PerformStep3(TokenSource source) { if (source.Size == 0) return; switch (source.Buffer[source.Size - 2]) { case 'a': if (source.ChangeSuffix("ational", "ate")) break; source.ChangeSuffix("tional", "tion"); break; case 'c': if (source.ChangeSuffix("enci", "ence")) break; source.ChangeSuffix("anci", "ance"); break; case 'e': source.ChangeSuffix("izer", "ize"); break; case 'l': if (source.ChangeSuffix("bli", "ble")) break; if (source.ChangeSuffix("alli", "al")) break; if (source.ChangeSuffix("entli", "ent")) break; if (source.ChangeSuffix("eli", "e")) break; source.ChangeSuffix("ousli", "ous"); break; case 'o': if (source.ChangeSuffix("ization", "ize")) break; if (source.ChangeSuffix("ation", "ate")) break; source.ChangeSuffix("ator", "ate"); break; case 's': if (source.ChangeSuffix("alism", "al")) break; if (source.ChangeSuffix("iveness", "ive")) break; if (source.ChangeSuffix("fulness", "ful")) break; source.ChangeSuffix("ousness", "ous"); break; case 't': if (source.ChangeSuffix("aliti", "al")) break; if (source.ChangeSuffix("iviti", "ive")) break; source.ChangeSuffix("biliti", "ble"); break; case 'g': source.ChangeSuffix("logi", "log"); break; } }
Notes:
- ChangeSuffix will replace the specified suffix if the prefix has a sequence of consonants.
Step 4
Step 4 deals with -ic-, -full, -ness etc. In a similar fashion of step3.
public void PerformStep4(TokenSource source) { if (source.Size == 0) return; switch (source.LastChar) { case 'e': if (source.ChangeSuffix("icate", "ic")) break; if (source.RemoveSuffix("ative")) break; source.ChangeSuffix("alize", "al"); break; case 'i': source.ChangeSuffix("iciti", "ic"); break; case 'l': if (source.ChangeSuffix("ical", "ic")) break; source.RemoveSuffix("ful"); break; case 's': source.RemoveSuffix("ness"); break; } }
Notes:
- RemoveSuffix will remove the specified suffix if the prefix has a sequence of consonants.
Step 5
Step 5 takes off -ant, -ence, etc.
public void PerformStep5(TokenSource source) { if (source.Size == 0) return; switch (source.Buffer[source.Size - 2]) { case 'a': source.RemoveSuffix("al"); return; case 'c': if (source.RemoveSuffix("ance")) return; source.RemoveSuffix("ence"); return; case 'e': source.RemoveSuffix("er"); return; case 'i': source.RemoveSuffix("ic"); return; case 'l': if (source.RemoveSuffix("able")) return; source.RemoveSuffix("ible"); return; case 'n': if (source.RemoveSuffix("ant")) return; if (source.RemoveSuffix("ement")) return; if (source.RemoveSuffix("ment")) return; source.RemoveSuffix("ent"); return; case 'o': if (source.ChangeSuffix("tion", "t")) return; if (source.ChangeSuffix("sion", "s")) return; source.RemoveSuffix("ou"); return; case 's': source.RemoveSuffix("ism"); return; case 't': if (source.RemoveSuffix("ate")) return; source.RemoveSuffix("iti"); return; case 'u': source.RemoveSuffix("ous"); return; case 'v': source.RemoveSuffix("ive"); return; case 'z': source.RemoveSuffix("ize"); return; default: return; } }
Step 6
Step 6 removes the final -e and -l.
public void PerformStep6(TokenSource source) { switch (source.LastChar) { case 'e': var a = source.NumberOfConsoantSequences(source.Size - 1); if (a > 1 || a == 1 && !source.HasCvcAt(source.Size - 2)) source.Size --; break; case 'l' when source.ContainsDoubleConsonantAt(source.Size - 1) && source.NumberOfConsoantSequences(source.Size - 2) > 0: source.Size--; break; } }
How it affects the indexing and the search results
Adopting the Porter Stemming Algorithm will reduce the number of the distinct tokens and increase the posting list sizes.
Let’s see what happens with the queries.
[Theory] [InlineData( new[] { "Human cannibalism is the act or practice of humans eating the flesh or internal organs of other human beings. ", "There are cannibals in some primitive communities.", "In marketing strategy, cannibalization refers to a reduction in sales volume, sales revenue,... ", }, "Cannibalization", new[] { 0, 1, 2 } )] public void SearchByQuery_TestingPorterStemming( string[] documents, string query, int[] expectedResults ) { var index = new StringIndexer().CreateIndex(documents); var searcher = new Searcher(index); var results = searcher.Search(query, DefaultAnalyzer.Instance); Assert.Equal(expectedResults, results); }
In the test query, all documents are returned. Why? Cannibalization, Cannibals and Cannibalism are all stemmed to “cannib.”
Final Words
In this post, I shared the implementation of the famous Porter Stemming Algorithm. Needless to say how much I enjoy it.
Again, I am still studying Information Retrieval. So, if you found any errors in my implementation, let me know.