Cases de melhoria de performance: NÃO utilize RegEx no caminho crítico

Expressões regulares são fantásticas. Entretanto, precisam ser utilizadas com moderação pois podem impactar de forma perceptível a performance.

A expressão regular do código abaixo (criado para esse post) estava no caminho crítico de execução do sistema de um cliente.

public static bool VersionWithRegex(string strToFind, string input)
{
    var regex = new Regex($".*({strToFind})[/.].*", RegexOptions.IgnoreCase);
    return regex.Match(input).Success;
}

Basicamente, essa expressão verifica se uma string, seguida por um “.” ou por uma “/” estão contidos em uma outra string, desconsiderando minúsculos e maiúsculos.

Para que possamos apenas ter “um cheiro” do peso de execução dessa expressão regular, criei um teste simples.

static void Main(string[] args)
{
    Func<string, string, bool> sut = VersionWithRegex;

    var g0 = GC.CollectionCount(0);
    var g1 = GC.CollectionCount(1);
    var g2 = GC.CollectionCount(2);

    var sw = new Stopwatch();
    sw.Start();


    for (int i = 0; i < 100_000; i++)
    {
        if (!sut("abc", "xptoABC.abc"))
            throw new Exception();

        if (!sut("abc", "xptoAbC/abc"))
            throw new Exception();

        if (!sut("abc", "xptoABC."))
            throw new Exception();

        if (!sut("abc", "xptoAbC/"))
            throw new Exception();

        if (!sut("abc", "ABC."))
            throw new Exception();

        if (!sut("abc", "AbC/"))
            throw new Exception();

        if (!sut("abc", "xptoabC.xpto"))
            throw new Exception();

        if (!sut("abc", "xptoabC/xpto"))
            throw new Exception();

        if (sut("abcd", "xptoabC/xpto"))
            throw new Exception();

        if (sut("abc/", "xptoabC/xpto"))
            throw new Exception();
    }

    Console.WriteLine($"Total time: {sw.ElapsedMilliseconds}ms.");
    Console.WriteLine($"Gen0: {GC.CollectionCount(0) - g0}.");
    Console.WriteLine($"Gen1: {GC.CollectionCount(1) - g1}.");
    Console.WriteLine($"Gen2: {GC.CollectionCount(2) - g2}.");
}

Resultado:

Pouco mais de nove segundos para verificar mais de 1_000_000 de strings não parece um tempo ruim. Entretanto, como sempre digo, 10s, para um computador, é muito tempo. Vejamos se conseguimos melhorar.

Primeira modificação: adeus Regex

Minha primeira proposta foi remover o Regex. Afinal, o teste não seria tão custo de implementar.

public static class Attempt1
{
    public static bool VersionWithoutRegex(string strToFind, string input)
    {

        return
            input.IndexOf($"{strToFind}/", StringComparison.InvariantCultureIgnoreCase) != -1 ||
            input.IndexOf($"{strToFind}.", StringComparison.InvariantCultureIgnoreCase) != -1;
    }
}

Resultado:

Surpreendente, não? O código acima é mais simples (na minha opnião) e ~22 vezes mais rápido. Aliás, interessante observar o número de coletas que o Regex provoca.

Essa performance já seria suficiente para o cliente. Mas, apenas por curiosidade, será que conseguimos melhorar?

Segunda modificação: Adeus GC

Sabemos que o GC impacta a performance da aplicação. Então, vamos ver se podemos ter algum ganho não alocando memória.

public static bool VersionWithoutRegex(string strToFind, string input) 
    => input.Contains(strToFind, '.') || 
       input.Contains(strToFind, '/');

private static bool Contains(this string input, string strToFind, char suffix)
{
    var limit = input.Length - (strToFind.Length + 1);
    var c0 = char.ToUpperInvariant(strToFind[0]);

    for (int i = 0; i <= limit; i++)
    {
        if (char.ToUpperInvariant(input[i]) == c0)
        {
            bool success = true;

            for (int j = 1; j < strToFind.Length; j++)
            {
                if (char.ToUpperInvariant(input[i + j]) != char.ToUpperInvariant(strToFind[j]))
                {
                    success = false;
                    break;
                }
            }

            if (success && char.ToUpperInvariant(input[i + strToFind.Length]) == char.ToUpperInvariant(suffix))
            {
                return true;
            }
        }
    }

    return false;
}

Observe que, definitivamente, abraçamos alguma complexidade nesse código. Vejamos o resultado:

Terceira modifição: Comparando de forma mais inteligente

Reparei que estava usando uma “comparação” pesada demais para os caracteres, por isso, resolvi “aliviar”

public static bool VersionWithoutRegex(string strToFind, string input)
{

    return
        (input.Contains(strToFind, '.') || input.Contains(strToFind, '/'));
}
private static bool Contains(this string input, string strToFind, char suffix)
{
    var limit = input.Length - (strToFind.Length + 1);

    for (int i = 0; i <= limit; i++)
    {
        if (strToFind[0].EqualsIgnoringCase(input[i]))
        {
            bool success = true;

            for (int j = 1; j < strToFind.Length; j++)
            {
                if (input[i + j].EqualsIgnoringCase(strToFind[j]))
                {
                    success = false;
                    break;
                }
            }

            if (success && input[i + strToFind.Length].EqualsIgnoringCase(suffix))
            {
                return true;
            }
        }
    }

    return false;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool EqualsIgnoringCase(this char c1, char c2)
{
    if (!char.IsLetter(c1))
    {
        return c1 == c2;
    }

    if (char.IsLower(c1))
    {
        return char.IsLower(c2) ? c1 == c2 : c1 == char.ToLowerInvariant(c2);
    }

    return char.IsUpper(c2) ? c1 == c2 : c1 == char.ToUpperInvariant(c2);
}

Resultado:

Quarta modifição: Comparando de forma AINDA mais inteligente

Minha abordagem anterior para comparar caracteres se mostrou mais inteligente (e mais performática). Vejamos outra ainda mais efetiva:

public static bool VersionWithoutRegex(string strToFind, string input)
{

    return
        (input.Contains(strToFind, '.') || input.Contains(strToFind, '/'));
}
private static bool Contains(this string input, string strToFind, char suffix)
{
    var limit = input.Length - (strToFind.Length + 1);

    for (int i = 0; i <= limit; i++)
    {
        if (strToFind[0].EqualsIgnoringCase(input[i]))
        {
            bool success = true;
            for (int j = 1; j < strToFind.Length; j++)
            {
                if (input[i + j].EqualsIgnoringCase(strToFind[j]))
                {
                    success = false;
                    break;
                }
            }

            if (success && input[i + strToFind.Length].EqualsIgnoringCase(suffix))
            {
                return true;
            }
        }
    }

    return false;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool EqualsIgnoringCase(this char c1, char c2)
{
    return (
        (c1 - 'a') == (c2 - 'a') ||
        (c1 - 'a') == (c2 - 'A') ||
        (c1 - 'A') == (c2 - 'a') ||
        (c1 - 'A') == (c2 - 'A')
        );
}

Resultado:

Quinta (última) modificação: Adeus processamento em duplicidade

Percebi que meu código de “contido” rodava duas vezes – uma para cada sufixo. Vamos resolver isso.

public static bool VersionWithoutRegex(string strToFind, string input)
{

    return
        (input.Contains(strToFind, '.', '/'));
}
private static bool Contains(this string input, string strToFind, char suffix1, char suffix2)
{
    var limit = input.Length - (strToFind.Length + 1);

    for (int i = 0; i <= limit; i++)
    {
        if (strToFind[0].EqualsIgnoringCase(input[i]))
        {
            bool success = true;
            for (int j = 1; j < strToFind.Length; j++)
            {
                if (input[i + j].EqualsIgnoringCase(strToFind[j]))
                {
                    success = false;
                    break;
                }
            }

            if (success && input[i + strToFind.Length].EqualsIgnoringCase(suffix1) || input[i + strToFind.Length].EqualsIgnoringCase(suffix2))
            {
                return true;
            }
        }
    }

    return false;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static bool EqualsIgnoringCase(this char c1, char c2)
{
    return (
        (c1 - 'a') == (c2 - 'a') ||
        (c1 - 'a') == (c2 - 'A') ||
        (c1 - 'A') == (c2 - 'a') ||
        (c1 - 'A') == (c2 - 'A')
        );
}

Resultado:

22ms!

Comparando a performance de forma mais “profissional”

Nosso código, que originalmente consumia quase 10 segundos, agora performa em ~22ms. Vejamos uma análise mais qualificada com o BenchmarkDotNet.

Conclusão

[tweet]Performance é uma feature. Escrever código de boa performance é um hábito.[/tweet] Código que performa melhor pode, eventualmente, ficar mais complexo. Entretanto, dependendo da demanda, essa complexidade é compensada.

O primeiro código desse post, que está em produção hoje em uma grande empresa, provavelmente é o mais simples. Também é ~440x mais lento que o último código.

No próximo dia 29, irei falar, de forma gratuita, para o Canal.NET sobre como escrever código com performance ótima. Faça sua inscrição.

NÃO utilize RegEx no caminho crítico

Primeira modificação: adeus Regex

Segunda modificação: Adeus GC

Terceira modifição: Comparando de forma mais inteligente

Quarta modifição: Comparando de forma AINDA mais inteligente

Quinta (última) modificação: Adeus processamento em duplicidade

Comparando a performance de forma mais “profissional”

Conclusão

Elemar Júnior

Excelência tecnológica potencializando resultados.

Elemar Júnior

Excelência tecnológica potencializando resultados.

Live Unit Testing: Feedback instantâneo do meu código!

Getting started with microservices

Vale a pena estimar?

[INSCRIÇÕES ABERTAS] Workshop sobre RavenDB, comigo, em São Paulo, dia 22 de Junho

From GoF to Lambdas – The Strategy Pattern

SOLUTION: What does this code do?

NÃO utilize RegEx no caminho crítico

Primeira modificação: adeus Regex

Segunda modificação: Adeus GC

Terceira modifição: Comparando de forma mais inteligente

Quarta modifição: Comparando de forma AINDA mais inteligente

Quinta (última) modificação: Adeus processamento em duplicidade

Comparando a performance de forma mais “profissional”

Conclusão

Elemar Júnior

Elemar Júnior

Mentoria em Arquitetura de Software

Mentoria em
Arquitetura de Software