A First Take at Building an Inverted Index and Querying Using C#

In this post, I would like to share my first attempt to create a (still) elementary search library. The source code is available on my GitHub.

I am learning how to do it from the “Introduction to Information Retrieval” book (which is part of my current reading list). Also, all my code is inspired by the Ayende’s Corax project.

Major steps to build an inverted index

Following what I read, I would need:

  1. Collect the documents to be indexed – I will use simple strings for while;
  2. Tokenize the text, turning each document into a list of tokens
  3. Do linguistic preprocessing, producing a list of indexing terms
  4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

So, let’s do it.

Collecting and tokenizing documents

Starting simple, I decided that my documents would be simple strings. So, here is a test to do it.

public class TokenSourceTests
{
    [Theory]
    [InlineData(
        "This is a simple string.",
        new[] {"This", "is", "a", "simple", "string"}
    )]
    public void TokenizationWorks(string input, string[] expected)
    {
        using (var reader = new StringReader(input))
        {
            var tokenSource = new TokenSource(reader);
            var result = tokenSource.ReadAll().ToArray();
            Assert.Equal(expected, result);
        }
    }
}

I will ignore punctuation and white spaces.

public sealed class TokenSource
{
    private readonly TextReader _reader;
        
    public TokenSource(TextReader reader)
    {
        _reader = reader;
    }

    public char[] Buffer { get; } = new char[256];
    public int Size { get; private set; } = 0;

    public bool Next()
    {
        Size = 0;
        int r;
        while ((r = _reader.Read()) != -1) // EOF
        {
            var ch = (char)r;

            if (ch == 'r' || ch == 'n' || char.IsWhiteSpace(ch) || char.IsPunctuation(ch))
            {
                if (Size > 0)
                {
                    return true;
                }
            }
            else
            {
                Recognize(ch);
            }
        }

        return Size > 0;
    }

    private void Recognize(char curr)
    {
        Buffer[Size++] = curr;
    }

    public IEnumerable<string> ReadAll()
    {
        while (Next())
        {
            yield return ToString();
        }
    }

    public IEnumerable<string> ReadAll(Func<TokenSource, bool> filter)
    {
        while (Next())
        {
            if (filter(this))
            {
                yield return ToString();
            }
        }
    }

    public override string ToString()
    {
        return new string(Buffer, 0, Size);
    }
}

The idea of using this array of chars looks excellent (that comes from Ayende). The code does not generate “garbage” strings.

Doing linguistic preprocessing

Having tokens, I need to normalize them (I will merely use lower cases) and to ignore stop words.

public class DefaultAnalyzerTests
{
    [Theory]
    [InlineData(
        "This is a simple string.",
        new[] { "simple", "string" }
    )]
    public void TokenizationWorks(string input, string[] expected)
    {
        using (var reader = new StringReader(input))
        {
            var tokenSource = new TokenSource(reader);
            var result = tokenSource
                .ReadAll(DefaultAnalyzer.Instance.Process)
                .ToArray();
            Assert.Equal(expected, result);
        }
    }
}

As you can see, DefaultAnalyzer class do all the filtering/normalizing  process.

using System;
using Tupi.Indexing.Filters;

namespace Tupi.Indexing
{
    public class DefaultAnalyzer : IAnalyzer
    {
        private readonly IFilter[] _filters =
        {
            new ToLowerFilter(), 
            new StopWordsFilter()
        };

        public bool Process(TokenSource source)
        {
            for (var i = 0; i < _filters.Length; i++)
            {
                if (!_filters[i].Process(source))
                    return false;
            }
            return true;
        }

        private static readonly Lazy<DefaultAnalyzer> LazyInstance = 
            new Lazy(() => new DefaultAnalyzer());
        public static DefaultAnalyzer Instance => LazyInstance.Value;
    }
}

The analyzer uses two filters. One to make all tokens lower case.

public class ToLowerFilter : IFilter
{
    public bool Process(TokenSource source)
    {
        for (var i = 0; i < source.Size; i++)
        {
            source.Buffer[i] = char.ToLowerInvariant(source.Buffer[i]);
        }
        return true;
    }
}

One to ignore stop words.

using System.Collections.Generic;

namespace Tupi.Indexing.Filters
{
    // based on:
    // https://github.com/ayende/Corax/blob/master/Corax/Indexing/Filters/StopWordsFilter.cs
    public class StopWordsFilter : IFilter
    {
        private static readonly string[] DefaultStopWords =
        {
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the",
            "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"
        };

        private readonly HashSet<ArraySegmentKey> _stopWords
            = new HashSet<ArraySegmentKey>();

        public StopWordsFilter()
        {
            foreach (var word in DefaultStopWords)
            {
                _stopWords.Add(new ArraySegmentKey(word.ToCharArray()));
            }
        }

        public bool Process(TokenSource source)
        {
            var term = new ArraySegmentKey(source.Buffer, source.Size);
            return !_stopWords.Contains(term);
        }
    }
}

Creating an inverted index

Now that we have all the components, the indexing process is simple.

public class StringIndexer
{
    public static InvertedIndex CreateIndex(
        params string[] documents
    )
    {
        var result = new InvertedIndex();

        for (var i = 0; i < documents.Length; i++)
        {
            using (var reader = new StringReader(documents[i]))
            {
                var tokenSource = new TokenSource(reader);

                var tokens = tokenSource
                    .ReadAll(DefaultAnalyzer.Instance.Process)
                    .Distinct()
                    .ToArray();

                foreach (var token in tokens)
                {
                    result.Append(token, i);
                }
            }
        }

        return result;
    }
}

I tokenize all the documents, updating a simple data structure.

public class InvertedIndex
{
    private readonly IDictionary<string, List> _data = 
        new Dictionary<string, List>();
    internal void Append(string term, int documentId)
    {
        if (_data.ContainsKey(term))
        {
            _data[term].Add(documentId);
        }
        else
        {
            var postings = new List {documentId};
            _data.Add(term, postings);
        }
    }
    // ...
}

As you can see, an inverted index is just a dictionary of terms, where the values are lists of documents containing these terms.

Processing Boolean Queries

The technique for boolean queries is simple. Considering that I have to terms to query, I would need:

  1. Locate the first term in the dictionary
  2. Retrieve its postings
  3. Locate the second term in the dictionary
  4. Retrieve its postings
  5. Intersect the two posting lists.

Here is my test:

[Theory]
[InlineData(
    new[]
    {
        "Elemar is learning how to create an inverted index",
        "It is cool to use csharp.",
        "This is a simple string document created by Elemar",
        "This string will be indexed in an inverted index generated by Elemar"
    },
    "Elemar inverted",
    new[] { 0, 3 }
)]
public void SearchByQuery(
    string[] documents,
    string query,
    int[] expectedResults
)
{
    string[] terms;
    using (var reader = new StringReader(query))
    {
        terms = new TokenSource(reader)
            .ReadAll(DefaultAnalyzer.Instance.Process)
            .ToArray();
    }

    var index = StringIndexer.CreateIndex(documents);
    var results = index.Search(terms);
    Assert.Equal(expectedResults, results);
}

I start splitting the query into terms and then using these terms to search.

Here is my implementation:

public IEnumerable<int> Search(params string[] terms)
{
    // no terms
    if (terms.Length == 0)
    {
        return Enumerable.Empty();
    }

    // only one term
    if (terms.Length == 1)
    {
        return _data.ContainsKey(terms[0])
            ? _data[terms[0]]
            : Enumerable.Empty();
    }

    var postings = new List<List>();
    for (var i = 0; i < terms.Length; i++)
    {
        // there is a term with no results
        if (!_data.ContainsKey(terms[i]))
        {
            return Enumerable.Empty();
        }
        postings.Add(_data[terms[i]]);
    }

    // 
    postings.Sort((l1, l2) => l1.Count.CompareTo(l2.Count));

    var results = postings[0];

    for (var i = 1; i < postings.Count; i++)
    {
        results = results.Intersect(postings[1]).ToList();
    }

    return results;
}

That’s it.

Conclusions

It is just my first attempt in years to work with inverted indexes. It’s not that hard, but it is not that simple too.

Ayende’s Corax project was an excellent reference for tokenizing and analyzing documents. Also, the “Information Retrieval” book that I have been reading is straightforward to follow and understand.

Please, share your thoughts about the code. Let’s improve it!

 

Compartilhe este insight:

5 respostas

  1. Hi Elemar, which book do you refer to for information retrieval reference? I would like to use it/read it and then even use python as well. Thank you.

Deixe uma resposta

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *

Elemar Júnior

Sou fundador e CEO da EximiaCo e atuo como tech trusted advisor ajudando diversas empresas a gerar mais resultados através da tecnologia.

Elemar Júnior

Sou fundador e CEO da EximiaCo e atuo como tech trusted advisor ajudando diversas empresas a gerar mais resultados através da tecnologia.

Mais insights para o seu negócio

Veja mais alguns estudos e reflexões que podem gerar alguns insights para o seu negócio:

Ao utilizar recursos não gerenciados, em .NET, precisamos garantir que estes sejam devidamente liberados. Para isso, quando criamos classes que...
As professionals, we should focus on developing great technical solutions. But, these solutions need to solve real problems. At the...
Investimos consideravelmente no planejamento estratégico de nossas organizações. Entretanto, a execução não tem sido tarefa fácil. Em minha interpretação, boa...
Este exemplo é inspirado no livro do Ayende Se você deseja aprender RavenDB, recomendo que se inscreva no RavenDB bootcamp...
I believe that, from time to time, it is interesting to learn a new language or framework that takes us...
In this post, I would like to explain a basic but confusing concept of CUDA programming: Thread Hierarchies. It will...