Implementing the Porter Stemming Algorithm in C#

In a previous post, I started an elementary search library using C#. Now, I will improve it adding a Porter Stemming filter to the indexing and searching processes. The resulting source code is available on my Github.

Note: I’ve been learning about it from the “Introduction to Information Retrieval” book.

Stemming?

From the “Introduction to Information Retrieval” book:

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Addiotionaly there are families of derivationally related words with similar meaning, such as democracy, democratic, democratization. In many situations, it seems as if it would be useful to search for one of these words to renturn documents that contain another word in the set.
The goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of word to a common base form.

Porter Stemming?

From the “Introduction to Information Retrieval” book:

The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective is the Porter Algorithm. It consists of several phases of word reductions applied sequentially.

Implementing it

Recall the significant steps in the inverted index construction

Collect the documents to be indexed – I will use simple strings for while;
Tokenize the text, turning each document into a list of tokens
Do linguistic preprocessing, producing a list of indexing terms
Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

The Porter Stemming fits the step 3. Right?

public class PorterStemmerFilter : IFilter
{
    public bool Process(TokenSource source)
    {
        PerformStep1(source);
        PerformStep2(source);
        PerformStep3(source);
        PerformStep4(source);
        PerformStep5(source);
        PerformStep6(source);
        return source.Size > 0;
    }

// ..

I used this implementation as a starting point to write my own.

Step 1

In step 1 we remove common suffices and pluralizations.

[Theory]
[InlineData("caresses", "caress")]
[InlineData("ponies", "poni")]
[InlineData("ties", "ti")]
[InlineData("caress", "caress")]
[InlineData("cats", "cat")]
[InlineData("feed", "feed")]
[InlineData("agreed", "agree")]
[InlineData("disabled", "disable")]
[InlineData("matting", "mat")]
[InlineData("mating", "mate")]
[InlineData("meeting", "meet")]
[InlineData("milling", "mill")]
[InlineData("messing", "mess")]
[InlineData("meetings", "meet")]
public void Step1(string from, string to)
{
    var filter = new PorterStemmerFilter();
    using (var reader = new StringReader(from))
    {
        var tokenSource = new TokenSource(reader);
        tokenSource.Next();
        filter.PerformStep1(tokenSource);
        Assert.Equal(to, tokenSource.ToString());
    }
}

To do that:

public void PerformStep1(TokenSource source)
{
    if (source.EndsWith('s'))
    {
        if (source.EndsWith("sses") || source.EndsWith("ies"))
        {
            source.Size -= 2;
        }
        else if (source.Buffer[source.Size - 2] != 's')
        {
            source.Size -= 1;
        }
    }

    if (source.EndsWith("eed"))
    {
        var limit = source.Size - 3; // source.Length
        if (source.NumberOfConsoantSequences(limit) > 0)
        {
            source.Size -= 1;
        }
    }
    else
    {
        var limit = 0;
        if (source.EndsWith("ed"))
        {
            limit = source.Size - 2;
        }
        else if (source.EndsWith("ing"))
        {
            limit = source.Size - 3;
        }

        if (limit != 0 && source.ContainsVowel(limit))
        {
            source.Size = limit;
            if (
                source.EndsWith("at") ||
                source.EndsWith("bl") ||
                source.EndsWith("iz")
            )
            {
                source.InsertIntoBuffer('e');
            }
            else if (source.EndsWithDoubleConsonant())
            {
                var ch = source.LastChar;
                if (ch != 'l' && ch != 's' && ch != 'z')
                {
                    source.Size--;
                }
            }
            else if (
                source.NumberOfConsoantSequences(source.Size - 1) == 1 &&
                source.HasCvcAt(source.Size - 1)
            )
            {
                source.InsertIntoBuffer('e');
            }
        }
    }
}

Notes:

The EndsWith method checks if the end of current token matches with the specified string/char.
The Buffer is a plain old fixed size char array.
The Size is an integer with the used length of Buffer used to store the current token.
The NumberOfConsoantsSequences retrieves how many consonants sequences are present in the specified portion of the buffer. It is util to check if we are not doing oversimplification.
HasCvcAt verifies if there is a sequence of Consonant-Vowel-Consonant before ending in the specified position. Again, it is relevant to guarantee that we are not doing oversimplification.

Step 2

Step 2 turns terminal -y to -i when there is another vowel in the stem.

public void PerformStep2(TokenSource source)
{
    if (source.EndsWith('y')
        && source.ContainsVowel(source.Size - 2)
    )
    {
        source.Buffer[source.Size - 1] = 'i';
    }
}

Step 3

Step 3 maps double suffices to single ones. So -ization maps to -ize, etc.

Testing Step3 in isolation:

[Theory]
[InlineData("international", "internate")]
[InlineData("rational", "rational")]
[InlineData("constitutional", "constitution")]
[InlineData("energizer", "energize")]
[InlineData("internacionalization", "internacionalize")]
[InlineData("enumeration", "enumerate")]
[InlineData("consolidator", "consolidate")]
[InlineData("tropicalism", "tropical")]
[InlineData("vandalism", "vandal")]
[InlineData("activeness", "active")]
[InlineData("remorsefulness", "remorseful")]
public void Step3(string from, string to)
{
    var filter = new PorterStemmerFilter();
    using (var reader = new StringReader(from))
    {
        var tokenSource = new TokenSource(reader);
        tokenSource.Next();
        filter.PerformStep3(tokenSource);
        Assert.Equal(to, tokenSource.ToString());
    }
}

And now step3’s implementation.

public void PerformStep3(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.Buffer[source.Size - 2])
    {
        case 'a':
            if (source.ChangeSuffix("ational", "ate")) break;
            source.ChangeSuffix("tional", "tion");
            break;
        case 'c':
            if (source.ChangeSuffix("enci", "ence")) break;
            source.ChangeSuffix("anci", "ance");
            break;
        case 'e':
            source.ChangeSuffix("izer", "ize");
            break;
        case 'l':
            if (source.ChangeSuffix("bli", "ble")) break;
            if (source.ChangeSuffix("alli", "al")) break;
            if (source.ChangeSuffix("entli", "ent")) break;
            if (source.ChangeSuffix("eli", "e")) break;
            source.ChangeSuffix("ousli", "ous");
            break;
        case 'o':
            if (source.ChangeSuffix("ization", "ize")) break;
            if (source.ChangeSuffix("ation", "ate")) break;
            source.ChangeSuffix("ator", "ate");
            break;
        case 's':
            if (source.ChangeSuffix("alism", "al")) break;
            if (source.ChangeSuffix("iveness", "ive")) break;
            if (source.ChangeSuffix("fulness", "ful")) break;
            source.ChangeSuffix("ousness", "ous");
            break;
        case 't':
            if (source.ChangeSuffix("aliti", "al")) break;
            if (source.ChangeSuffix("iviti", "ive")) break;
            source.ChangeSuffix("biliti", "ble");
            break;
        case 'g':
            source.ChangeSuffix("logi", "log");
            break;
    }
}

Notes:

ChangeSuffix will replace the specified suffix if the prefix has a sequence of consonants.

Step 4

Step 4 deals with -ic-, -full, -ness etc. In a similar fashion of step3.

public void PerformStep4(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.LastChar)
    {
        case 'e':
            if (source.ChangeSuffix("icate", "ic")) break;
            if (source.RemoveSuffix("ative")) break;
            source.ChangeSuffix("alize", "al");
            break;
        case 'i':
            source.ChangeSuffix("iciti", "ic");
            break;
        case 'l':
            if (source.ChangeSuffix("ical", "ic")) break;
            source.RemoveSuffix("ful");
            break;
        case 's':
            source.RemoveSuffix("ness");
            break;
    }
}

Notes:

RemoveSuffix will remove the specified suffix if the prefix has a sequence of consonants.

Step 5

Step 5 takes off -ant, -ence, etc.

public void PerformStep5(TokenSource source)
{
    if (source.Size == 0)
        return;

    switch (source.Buffer[source.Size - 2])
    {
        case 'a':
            source.RemoveSuffix("al");
            return;
        case 'c':
            if (source.RemoveSuffix("ance")) return;
            source.RemoveSuffix("ence");
            return;
        case 'e':
            source.RemoveSuffix("er");
            return;
        case 'i':
            source.RemoveSuffix("ic");
            return;
        case 'l':
            if (source.RemoveSuffix("able")) return;
            source.RemoveSuffix("ible");
            return;
        case 'n':
            if (source.RemoveSuffix("ant")) return;
            if (source.RemoveSuffix("ement")) return;
            if (source.RemoveSuffix("ment")) return;
            source.RemoveSuffix("ent");
            return;
        case 'o':
            if (source.ChangeSuffix("tion", "t")) return;
            if (source.ChangeSuffix("sion", "s")) return;
            source.RemoveSuffix("ou");
            return;
        case 's':
            source.RemoveSuffix("ism");
            return;
        case 't':
            if (source.RemoveSuffix("ate")) return;
            source.RemoveSuffix("iti");
            return;
        case 'u':
            source.RemoveSuffix("ous");
            return;
        case 'v':
            source.RemoveSuffix("ive");
            return;
        case 'z':
            source.RemoveSuffix("ize");
            return;
        default:
            return;
    }
}

Step 6

Step 6 removes the final -e and -l.

public void PerformStep6(TokenSource source)
{
    switch (source.LastChar)
    {
        case 'e':
            var a = source.NumberOfConsoantSequences(source.Size - 1);
            if (a > 1 || a == 1 && !source.HasCvcAt(source.Size - 2))
                source.Size --;
            break;
        case 'l' when source.ContainsDoubleConsonantAt(source.Size - 1) && source.NumberOfConsoantSequences(source.Size - 2) > 0:
            source.Size--;
            break;
    }
}

How it affects the indexing and the search results

Adopting the Porter Stemming Algorithm will reduce the number of the distinct tokens and increase the posting list sizes.

Let’s see what happens with the queries.

[Theory]
[InlineData(
    new[]
    {
        "Human cannibalism is the act or practice of humans eating the flesh or internal organs of other human beings. ",
        "There are cannibals in some primitive communities.",
        "In marketing strategy, cannibalization refers to a reduction in sales volume, sales revenue,... ",
    },
    "Cannibalization",
    new[] { 0, 1, 2 }
)]
public void SearchByQuery_TestingPorterStemming(
    string[] documents,
    string query,
    int[] expectedResults
)
{
    var index = new StringIndexer().CreateIndex(documents);
    var searcher = new Searcher(index);
    var results = searcher.Search(query, DefaultAnalyzer.Instance);
    Assert.Equal(expectedResults, results);
}

In the test query, all documents are returned. Why? Cannibalization, Cannibals and Cannibalism are all stemmed to “cannib.”

Final Words

In this post, I shared the implementation of the famous Porter Stemming Algorithm. Needless to say how much I enjoy it.

Again, I am still studying Information Retrieval. So, if you found any errors in my implementation, let me know.

Implementing the Porter Stemming Algorithm in C#

Stemming?

Porter Stemming?

Implementing it

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

How it affects the indexing and the search results

Final Words

Elemar Júnior

Elemar Júnior

Conecte-se com ElemarJR

Curso Reputação e Marketing Pessoal

Masterclasses

01

Introdução do curso

02

Por que sua “reputação” é importante?

03

Como você se apresenta?

04

Como você apresenta suas ideias?

05

Como usar Storytelling?

06

Você tem uma dor? Eu tenho o alívio!

07

Escrita efetiva para não escritores

08

Como aumentar (e manter) sua audiência?

09

Gatilhos! Gatilhos!

10

Triple Threat: Domine Produto, Embalagem e Distribuição

11

Estratégias Vencedoras: Desbloqueie o Poder da Teoria dos Jogos

12

Análise SWOT de sua marca pessoal

13

Soterrado por informações? Aprenda a fazer gestão do conhecimento pessoal, do jeito certo

14

Vendo além do óbvio com a Pentad de Burkle

15

Construindo Reputação através de Métricas: A Arte de Alinhar Expectativas com Lag e Lead Measures

16

A Tríade da Liderança: Navegando entre Líder, Liderado e Contexto no Mundo do Marketing Pessoal

17

Análise PESTEL para Marketing Pessoal

18

Canvas de Proposta de Valor para Marca Pessoal

19

Método OKR para Objetivos Pessoais

20

Análise de Competências de Gallup

21

Feedback 360 Graus para Autoavaliação

22

Modelo de Cinco Forças de Porter

23

Estratégia Blue Ocean para Diferenciação Pessoal

24

Análise de Tendências para Previsão de Mercado

25

Design Thinking para Inovação Pessoal

26

Metodologia Agile para Desenvolvimento Pessoal

27

Análise de Redes Sociais para Ampliar Conexões

Lições complementares

28

Apresentando-se do Jeito Certo

29

O mercado remunera raridade? Como evidenciar a sua?

30

O que pode estar te impedindo de ter sucesso

Recomendações de Leituras

31