Processing Boolean Queries using C#

In the first post of this series, I explained how to produce an inverted index using C#. In the second post, I explained how to improve the tokenization process, using the Porter Stemming Algorithm. Now, I will share how to process boolean queries.

Inverted Index?

Within a  document collection, I assume that each document has a unique serial number (a documentId). During the index construction I simple assign successive integer to each new document.

The input to the indexing process is an enumeration of normalized tokens for each document.  The output is a dictionary where the keys are the tokens, and the values are lists of the documentIds of the documents which contains that token.

using System.Collections.Generic;
using System.Linq;

namespace Tupi.Indexing
{
    public class InvertedIndex
    {
        private readonly ISet _indexedDocuments = 
            new SortedSet();
        public IEnumerable<int> IndexedDocuments => _indexedDocuments;

        private readonly IDictionary<string, List<int>> _data = 
            new Dictionary<string, List>();

        internal void Append(string term, int documentId)
        {
            if (_data.ContainsKey(term))
            {
                _data[term].Add(documentId);
            }
            else
            {
                var postings = new List {documentId};
                _data.Add(term, postings);
            }

            _indexedDocuments.Add(documentId);
        }

        public IEnumerable<int> GetPostingsFor(string term) => !_data.ContainsKey(term) 
            ? Enumerable.Empty() 
            : _data[term];
    }
}

The Query Abstraction

Given an inverted index, I want to be able to perform a query that results in all documents that meet the specified criteria.

In my implementation, the Query class is a base class and a factory.

using System.Collections.Generic;
using Tupi.Indexing;

namespace Tupi.Querying.Queries
{
    public abstract class Query
    {
        public abstract IEnumerable<int> Execute(InvertedIndex index);

        public static Query Term(string term) =>
            TermQuery.From(term);

        public static Query All(params string[] terms) =>
            AllQuery.From(terms);

        public static Query All(IEnumerable<int> queries) =>
            new AllQuery(queries);

        public static Query And(Query left, Query right) =>
            new AllQuery(left, right);

        public static Query Any(params string[] terms) =>
            AnyQuery.From(terms);

        public static Query Or(Query left, Query right) =>
            new AnyQuery(left, right);

        public static Query Not(Query innerQuery) 
            => new NotQuery(innerQuery);
    }
}

Use this abstraction to define queries is simple.

var query1 = Query.Or(
    Query.Term("csharp"),
    Query.And(
        Query.Term("Elemar"),
        Query.Term("inverted")
    )
);

var query2 = Query.Not(
    Query.And(
        Query.Term("Elemar"),
        Query.Term("inverted")
    )
);

Querying terms

Retrieving the list of documents containing a term is pretty simple. We just need to normalize the term and then check the dictionary.

using System;
using System.Collections.Generic;
using System.IO;
using Tupi.Indexing;

namespace Tupi.Querying.Queries
{
    public class TermQuery : Query
    {
        private readonly string _term;

        public TermQuery(string term)
        {
            _term = term;
        }

        public override IEnumerable<int> Execute(InvertedIndex index) 
            => index.GetPostingsFor(_term);

        public static Query From(string term) =>
            From(term, DefaultAnalyzer.Instance);

        public static Query From(string term, IAnalyzer analyzer)
        {
            using (var reader = new StringReader(term))
            {
                var tokenSource = new TokenSource(reader);
                tokenSource.Next();

                if (!analyzer.Process(tokenSource))
                {
                    throw new InvalidOperationException($"Could not generate a term from: {term}");
                }

                return new TermQuery(tokenSource.ToString());
            }
        }
    }
}

AND operator

Then AND implementation just intersects the results of two (or more) queries.

using System.Collections.Generic;
using System.Linq;
using Tupi.Indexing;

namespace Tupi.Querying.Queries
{
    public class AllQuery : Query
    {
        private readonly IEnumerable<Query> _queries;

        public AllQuery(IEnumerable<Query> queries)
        {
            _queries = queries;
        }

        public AllQuery(params Query[] queries)
        {
            _queries = queries;
        }

        public override IEnumerable<int> Execute(InvertedIndex index)
        {
            var postings = new List<List>();
            foreach (var query in _queries)
            {
                var l = query.Execute(index).ToList();

                if (!l.Any())
                {
                    return Enumerable.Empty();
                }

                postings.Add(l.ToList());
            }

            postings.Sort((l1, l2) => l1.Count.CompareTo(l2.Count));

            var results = postings[0];

            for (var i = 1; i < postings.Count; i++) { results = results.Intersect(postings[1]).ToList(); } return results; } public static Query From(params string[] terms) => 
            new AllQuery(terms.Select(TermQuery.From).ToList());
    }
}

OR operator

The OR implementation just merges the results of two (or more) queries.

using System.Collections.Generic;
using System.Linq;
using Tupi.Indexing;

namespace Tupi.Querying.Queries
{
    public class AnyQuery : Query
    {
        private readonly IEnumerable<Query> _queries;

        public AnyQuery(IEnumerable<Query> queries)
        {
            _queries = queries;
        }

        public AnyQuery(params Query[] queries)
        {
            _queries = queries;
        }

        public override IEnumerable<int> Execute(InvertedIndex index)
        {
            var results = Enumerable.Empty();
            return _queries.Aggregate(
                seed: Enumerable.Empty(),
                func: (current, query) => current.Union(query.Execute(index))
            );
        }

        public static Query From(params string[] terms) =>
            new AnyQuery(terms.Select(TermQuery.From).ToList());
    }
}

NOT operator

The NOT implementation returns all indexed documents, except the returned by the execution of the inner query.

using System.Collections.Generic;
using System.Linq;
using Tupi.Indexing;

namespace Tupi.Querying.Queries
{
    public class NotQuery : Query
    {
        private readonly Query _query;

        public NotQuery(Query query)
        {
            _query = query;
        }

        public override IEnumerable<int> Execute(InvertedIndex index) => 
            index.IndexedDocuments.Except(_query.Execute(index));
    }
}

Last words

In this post a shared how to implement boolean queries on an inverted index. I am pretty sure that there are a lot of improvement opportunities here. Please give me your recommendations in the comments.

Compartilhe este insight:

Elemar Júnior

Sou fundador e CEO da EximiaCo e atuo como tech trusted advisor ajudando diversas empresas a gerar mais resultados através da tecnologia.

Elemar Júnior

Sou fundador e CEO da EximiaCo e atuo como tech trusted advisor ajudando diversas empresas a gerar mais resultados através da tecnologia.

Mais insights para o seu negócio

Veja mais alguns estudos e reflexões que podem gerar alguns insights para o seu negócio:

Ligo a TV e assisto algum político falando besteiras. Acesso o Instagram e vejo a foto de um casal de...
When designing systems that need to scale you always need to remember that using better resources could help you to...
Would you like to learn about NoSQL? Are you looking for help to make your first steps using RavenDB? So,...
Um dos benefícios mais bacanas do programa MVP é o direito a participação no MVP Global Summit – evento anual,...
Há pouco menos de um ano, aceitei o desafio de liderar o time de tecnologia e estratégia de negócios da...
Tem coisas que a gente até sabe, mas ignora pela vaidade… Uma das features mais aclamadas do Microsoft Teams, para...
Oferta de pré-venda!

Mentoria em
Arquitetura de Software

Práticas, padrões & técnicas para Arquitetura de Software, de maneira efetiva, com base em cenários reais para profissionais envolvidos no projeto e implantação de software.

× Precisa de ajuda?