A First Take at Building an Inverted Index and Querying Using C#

In this post, I would like to share my first attempt to create a (still) elementary search library. The source code is available on my GitHub.

I am learning how to do it from the “Introduction to Information Retrieval” book (which is part of my current reading list). Also, all my code is inspired by the Ayende’s Corax project.

Major steps to build an inverted index

Following what I read, I would need:

Collect the documents to be indexed – I will use simple strings for while;
Tokenize the text, turning each document into a list of tokens
Do linguistic preprocessing, producing a list of indexing terms
Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

So, let’s do it.

Collecting and tokenizing documents

Starting simple, I decided that my documents would be simple strings. So, here is a test to do it.

public class TokenSourceTests
{
    [Theory]
    [InlineData(
        "This is a simple string.",
        new[] {"This", "is", "a", "simple", "string"}
    )]
    public void TokenizationWorks(string input, string[] expected)
    {
        using (var reader = new StringReader(input))
        {
            var tokenSource = new TokenSource(reader);
            var result = tokenSource.ReadAll().ToArray();
            Assert.Equal(expected, result);
        }
    }
}

I will ignore punctuation and white spaces.

public sealed class TokenSource
{
    private readonly TextReader _reader;
        
    public TokenSource(TextReader reader)
    {
        _reader = reader;
    }

    public char[] Buffer { get; } = new char[256];
    public int Size { get; private set; } = 0;

    public bool Next()
    {
        Size = 0;
        int r;
        while ((r = _reader.Read()) != -1) // EOF
        {
            var ch = (char)r;

            if (ch == 'r' || ch == 'n' || char.IsWhiteSpace(ch) || char.IsPunctuation(ch))
            {
                if (Size > 0)
                {
                    return true;
                }
            }
            else
            {
                Recognize(ch);
            }
        }

        return Size > 0;
    }

    private void Recognize(char curr)
    {
        Buffer[Size++] = curr;
    }

    public IEnumerable<string> ReadAll()
    {
        while (Next())
        {
            yield return ToString();
        }
    }

    public IEnumerable<string> ReadAll(Func<TokenSource, bool> filter)
    {
        while (Next())
        {
            if (filter(this))
            {
                yield return ToString();
            }
        }
    }

    public override string ToString()
    {
        return new string(Buffer, 0, Size);
    }
}

The idea of using this array of chars looks excellent (that comes from Ayende). The code does not generate “garbage” strings.

Doing linguistic preprocessing

Having tokens, I need to normalize them (I will merely use lower cases) and to ignore stop words.

public class DefaultAnalyzerTests
{
    [Theory]
    [InlineData(
        "This is a simple string.",
        new[] { "simple", "string" }
    )]
    public void TokenizationWorks(string input, string[] expected)
    {
        using (var reader = new StringReader(input))
        {
            var tokenSource = new TokenSource(reader);
            var result = tokenSource
                .ReadAll(DefaultAnalyzer.Instance.Process)
                .ToArray();
            Assert.Equal(expected, result);
        }
    }
}

As you can see, DefaultAnalyzer class do all the filtering/normalizing process.

using System;
using Tupi.Indexing.Filters;

namespace Tupi.Indexing
{
    public class DefaultAnalyzer : IAnalyzer
    {
        private readonly IFilter[] _filters =
        {
            new ToLowerFilter(), 
            new StopWordsFilter()
        };

        public bool Process(TokenSource source)
        {
            for (var i = 0; i < _filters.Length; i++)
            {
                if (!_filters[i].Process(source))
                    return false;
            }
            return true;
        }

        private static readonly Lazy<DefaultAnalyzer> LazyInstance = 
            new Lazy(() => new DefaultAnalyzer());
        public static DefaultAnalyzer Instance => LazyInstance.Value;
    }
}

The analyzer uses two filters. One to make all tokens lower case.

public class ToLowerFilter : IFilter
{
    public bool Process(TokenSource source)
    {
        for (var i = 0; i < source.Size; i++)
        {
            source.Buffer[i] = char.ToLowerInvariant(source.Buffer[i]);
        }
        return true;
    }
}

One to ignore stop words.

using System.Collections.Generic;

namespace Tupi.Indexing.Filters
{
    // based on:
    // https://github.com/ayende/Corax/blob/master/Corax/Indexing/Filters/StopWordsFilter.cs
    public class StopWordsFilter : IFilter
    {
        private static readonly string[] DefaultStopWords =
        {
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
            "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
        };

        private readonly HashSet<ArraySegmentKey> _stopWords
            = new HashSet<ArraySegmentKey>();

        public StopWordsFilter()
        {
            foreach (var word in DefaultStopWords)
            {
                _stopWords.Add(new ArraySegmentKey(word.ToCharArray()));
            }
        }

        public bool Process(TokenSource source)
        {
            var term = new ArraySegmentKey(source.Buffer, source.Size);
            return !_stopWords.Contains(term);
        }
    }
}

Creating an inverted index

Now that we have all the components, the indexing process is simple.

public class StringIndexer
{
    public static InvertedIndex CreateIndex(
        params string[] documents
    )
    {
        var result = new InvertedIndex();

        for (var i = 0; i < documents.Length; i++)
        {
            using (var reader = new StringReader(documents[i]))
            {
                var tokenSource = new TokenSource(reader);

                var tokens = tokenSource
                    .ReadAll(DefaultAnalyzer.Instance.Process)
                    .Distinct()
                    .ToArray();

                foreach (var token in tokens)
                {
                    result.Append(token, i);
                }
            }
        }

        return result;
    }
}

I tokenize all the documents, updating a simple data structure.

public class InvertedIndex
{
    private readonly IDictionary<string, List> _data = 
        new Dictionary<string, List>();
    internal void Append(string term, int documentId)
    {
        if (_data.ContainsKey(term))
        {
            _data[term].Add(documentId);
        }
        else
        {
            var postings = new List {documentId};
            _data.Add(term, postings);
        }
    }
    // ...
}

As you can see, an inverted index is just a dictionary of terms, where the values are lists of documents containing these terms.

Processing Boolean Queries

The technique for boolean queries is simple. Considering that I have to terms to query, I would need:

Locate the first term in the dictionary
Retrieve its postings
Locate the second term in the dictionary
Retrieve its postings
Intersect the two posting lists.

Here is my test:

[Theory]
[InlineData(
    new[]
    {
        "Elemar is learning how to create an inverted index",
        "It is cool to use csharp.",
        "This is a simple string document created by Elemar",
        "This string will be indexed in an inverted index generated by Elemar"
    },
    "Elemar inverted",
    new[] { 0, 3 }
)]
public void SearchByQuery(
    string[] documents,
    string query,
    int[] expectedResults
)
{
    string[] terms;
    using (var reader = new StringReader(query))
    {
        terms = new TokenSource(reader)
            .ReadAll(DefaultAnalyzer.Instance.Process)
            .ToArray();
    }

    var index = StringIndexer.CreateIndex(documents);
    var results = index.Search(terms);
    Assert.Equal(expectedResults, results);
}

I start splitting the query into terms and then using these terms to search.

Here is my implementation:

public IEnumerable<int> Search(params string[] terms)
{
    // no terms
    if (terms.Length == 0)
    {
        return Enumerable.Empty();
    }

    // only one term
    if (terms.Length == 1)
    {
        return _data.ContainsKey(terms[0])
            ? _data[terms[0]]
            : Enumerable.Empty();
    }

    var postings = new List<List>();
    for (var i = 0; i < terms.Length; i++)
    {
        // there is a term with no results
        if (!_data.ContainsKey(terms[i]))
        {
            return Enumerable.Empty();
        }
        postings.Add(_data[terms[i]]);
    }

    // 
    postings.Sort((l1, l2) => l1.Count.CompareTo(l2.Count));

    var results = postings[0];

    for (var i = 1; i < postings.Count; i++)
    {
        results = results.Intersect(postings[1]).ToList();
    }

    return results;
}

That’s it.

Conclusions

It is just my first attempt in years to work with inverted indexes. It’s not that hard, but it is not that simple too.

Ayende’s Corax project was an excellent reference for tokenizing and analyzing documents. Also, the “Information Retrieval” book that I have been reading is straightforward to follow and understand.

Please, share your thoughts about the code. Let’s improve it!

A First Take at Building an Inverted Index and Querying Using C#

Major steps to build an inverted index

Collecting and tokenizing documents

Doing linguistic preprocessing

Creating an inverted index

Processing Boolean Queries

Conclusions

Elemar Júnior

Elemar Júnior

Conecte-se com ElemarJR

Curso Reputação e Marketing Pessoal

Masterclasses

01

Introdução do curso

02

Por que sua “reputação” é importante?

03

Como você se apresenta?

04

Como você apresenta suas ideias?

05

Como usar Storytelling?

06

Você tem uma dor? Eu tenho o alívio!

07

Escrita efetiva para não escritores

08

Como aumentar (e manter) sua audiência?

09

Gatilhos! Gatilhos!

10

Triple Threat: Domine Produto, Embalagem e Distribuição

11

Estratégias Vencedoras: Desbloqueie o Poder da Teoria dos Jogos

12

Análise SWOT de sua marca pessoal

13

Soterrado por informações? Aprenda a fazer gestão do conhecimento pessoal, do jeito certo

14

Vendo além do óbvio com a Pentad de Burkle

15

Construindo Reputação através de Métricas: A Arte de Alinhar Expectativas com Lag e Lead Measures

16

A Tríade da Liderança: Navegando entre Líder, Liderado e Contexto no Mundo do Marketing Pessoal

17

Análise PESTEL para Marketing Pessoal

18

Canvas de Proposta de Valor para Marca Pessoal

19

Método OKR para Objetivos Pessoais

20

Análise de Competências de Gallup

21

Feedback 360 Graus para Autoavaliação

22

Modelo de Cinco Forças de Porter

23

Estratégia Blue Ocean para Diferenciação Pessoal

24

Análise de Tendências para Previsão de Mercado

25

Design Thinking para Inovação Pessoal

26

Metodologia Agile para Desenvolvimento Pessoal

27

Análise de Redes Sociais para Ampliar Conexões

Lições complementares

28

Apresentando-se do Jeito Certo

29

O mercado remunera raridade? Como evidenciar a sua?

30

O que pode estar te impedindo de ter sucesso

Recomendações de Leituras

31

Aprendendo a qualificar sua reputação do jeito certo

32

Quem é você?

33

Qual a sua “IDEIA”?