In this post, I would like to share my first attempt to create a (still) elementary search library. The source code is available on my GitHub.
I am learning how to do it from the “Introduction to Information Retrieval” book (which is part of my current reading list). Also, all my code is inspired by the Ayende’s Corax project.
Major steps to build an inverted index
Following what I read, I would need:
- Collect the documents to be indexed – I will use simple strings for while;
- Tokenize the text, turning each document into a list of tokens
- Do linguistic preprocessing, producing a list of indexing terms
- Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
So, let’s do it.
Collecting and tokenizing documents
Starting simple, I decided that my documents would be simple strings. So, here is a test to do it.
public class TokenSourceTests { [Theory] [InlineData( "This is a simple string.", new[] {"This", "is", "a", "simple", "string"} )] public void TokenizationWorks(string input, string[] expected) { using (var reader = new StringReader(input)) { var tokenSource = new TokenSource(reader); var result = tokenSource.ReadAll().ToArray(); Assert.Equal(expected, result); } } }
I will ignore punctuation and white spaces.
public sealed class TokenSource { private readonly TextReader _reader; public TokenSource(TextReader reader) { _reader = reader; } public char[] Buffer { get; } = new char[256]; public int Size { get; private set; } = 0; public bool Next() { Size = 0; int r; while ((r = _reader.Read()) != -1) // EOF { var ch = (char)r; if (ch == 'r' || ch == 'n' || char.IsWhiteSpace(ch) || char.IsPunctuation(ch)) { if (Size > 0) { return true; } } else { Recognize(ch); } } return Size > 0; } private void Recognize(char curr) { Buffer[Size++] = curr; } public IEnumerable<string> ReadAll() { while (Next()) { yield return ToString(); } } public IEnumerable<string> ReadAll(Func<TokenSource, bool> filter) { while (Next()) { if (filter(this)) { yield return ToString(); } } } public override string ToString() { return new string(Buffer, 0, Size); } }
The idea of using this array of chars looks excellent (that comes from Ayende). The code does not generate “garbage” strings.
Doing linguistic preprocessing
Having tokens, I need to normalize them (I will merely use lower cases) and to ignore stop words.
public class DefaultAnalyzerTests { [Theory] [InlineData( "This is a simple string.", new[] { "simple", "string" } )] public void TokenizationWorks(string input, string[] expected) { using (var reader = new StringReader(input)) { var tokenSource = new TokenSource(reader); var result = tokenSource .ReadAll(DefaultAnalyzer.Instance.Process) .ToArray(); Assert.Equal(expected, result); } } }
As you can see, DefaultAnalyzer class do all the filtering/normalizing process.
using System; using Tupi.Indexing.Filters; namespace Tupi.Indexing { public class DefaultAnalyzer : IAnalyzer { private readonly IFilter[] _filters = { new ToLowerFilter(), new StopWordsFilter() }; public bool Process(TokenSource source) { for (var i = 0; i < _filters.Length; i++) { if (!_filters[i].Process(source)) return false; } return true; } private static readonly Lazy<DefaultAnalyzer> LazyInstance = new Lazy(() => new DefaultAnalyzer()); public static DefaultAnalyzer Instance => LazyInstance.Value; } }
The analyzer uses two filters. One to make all tokens lower case.
public class ToLowerFilter : IFilter { public bool Process(TokenSource source) { for (var i = 0; i < source.Size; i++) { source.Buffer[i] = char.ToLowerInvariant(source.Buffer[i]); } return true; } }
One to ignore stop words.
using System.Collections.Generic; namespace Tupi.Indexing.Filters { // based on: // https://github.com/ayende/Corax/blob/master/Corax/Indexing/Filters/StopWordsFilter.cs public class StopWordsFilter : IFilter { private static readonly string[] DefaultStopWords = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" }; private readonly HashSet<ArraySegmentKey> _stopWords = new HashSet<ArraySegmentKey>(); public StopWordsFilter() { foreach (var word in DefaultStopWords) { _stopWords.Add(new ArraySegmentKey(word.ToCharArray())); } } public bool Process(TokenSource source) { var term = new ArraySegmentKey(source.Buffer, source.Size); return !_stopWords.Contains(term); } } }
Creating an inverted index
Now that we have all the components, the indexing process is simple.
public class StringIndexer { public static InvertedIndex CreateIndex( params string[] documents ) { var result = new InvertedIndex(); for (var i = 0; i < documents.Length; i++) { using (var reader = new StringReader(documents[i])) { var tokenSource = new TokenSource(reader); var tokens = tokenSource .ReadAll(DefaultAnalyzer.Instance.Process) .Distinct() .ToArray(); foreach (var token in tokens) { result.Append(token, i); } } } return result; } }
I tokenize all the documents, updating a simple data structure.
public class InvertedIndex { private readonly IDictionary<string, List> _data = new Dictionary<string, List>(); internal void Append(string term, int documentId) { if (_data.ContainsKey(term)) { _data[term].Add(documentId); } else { var postings = new List {documentId}; _data.Add(term, postings); } } // ... }
As you can see, an inverted index is just a dictionary of terms, where the values are lists of documents containing these terms.
Processing Boolean Queries
The technique for boolean queries is simple. Considering that I have to terms to query, I would need:
- Locate the first term in the dictionary
- Retrieve its postings
- Locate the second term in the dictionary
- Retrieve its postings
- Intersect the two posting lists.
Here is my test:
[Theory] [InlineData( new[] { "Elemar is learning how to create an inverted index", "It is cool to use csharp.", "This is a simple string document created by Elemar", "This string will be indexed in an inverted index generated by Elemar" }, "Elemar inverted", new[] { 0, 3 } )] public void SearchByQuery( string[] documents, string query, int[] expectedResults ) { string[] terms; using (var reader = new StringReader(query)) { terms = new TokenSource(reader) .ReadAll(DefaultAnalyzer.Instance.Process) .ToArray(); } var index = StringIndexer.CreateIndex(documents); var results = index.Search(terms); Assert.Equal(expectedResults, results); }
I start splitting the query into terms and then using these terms to search.
Here is my implementation:
public IEnumerable<int> Search(params string[] terms) { // no terms if (terms.Length == 0) { return Enumerable.Empty(); } // only one term if (terms.Length == 1) { return _data.ContainsKey(terms[0]) ? _data[terms[0]] : Enumerable.Empty(); } var postings = new List<List>(); for (var i = 0; i < terms.Length; i++) { // there is a term with no results if (!_data.ContainsKey(terms[i])) { return Enumerable.Empty(); } postings.Add(_data[terms[i]]); } // postings.Sort((l1, l2) => l1.Count.CompareTo(l2.Count)); var results = postings[0]; for (var i = 1; i < postings.Count; i++) { results = results.Intersect(postings[1]).ToList(); } return results; }
That’s it.
Conclusions
It is just my first attempt in years to work with inverted indexes. It’s not that hard, but it is not that simple too.
Ayende’s Corax project was an excellent reference for tokenizing and analyzing documents. Also, the “Information Retrieval” book that I have been reading is straightforward to follow and understand.
Please, share your thoughts about the code. Let’s improve it!