7

Dynamic PGO in .NET 6.0.md

 2 years ago
source link: https://gist.github.com/EgorBo/dc181796683da3d905a5295bfd3dd95b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Dynamic PGO in .NET 6.0

Dynamic PGO (Profile-guided optimization) is a JIT-compiler optimization technique that allows JIT to collect additional information about surroundings (aka profile) in tier0 codegen in order to rely on it later during promotion from tier0 to tier1 for hot methods to make them even more efficient.

What exactly PGO can optimize for us?

  1. Profile-driving inlining - inliner relies on PGO data and can be very aggressive for hot paths and care less about cold ones, see dotnet/runtime#52708 and dotnet/runtime#55478. A good example where it has visible effects is this StringBuilder benchmark:

  2. Guarded devirtualization - most monomorphic virtual/interface calls can be devirtualized using PGO data, e.g.:

void DisposeMe(IDisposable d)
{
    d.Dispose();
}

It looks like nothing can be optimized here, right? Just an ordinary virtual (interface) call on top of an unknown object that will go through several indirects to call the actual Dispose() implementation and its body will never be inlined here. Now let's see what PGO can do here.
With Dynamic PGO on, this method will be compiled to something like this in tier0 (in machine code):

void DisposeMe(IDisposable d)
{
+   call CORINFO_HELP_CLASSPROFILE32(d, offset);
    d.Dispose();
}

We now poll that d for its underlying type every call of that method. Yes, it makes it slightly slower, but eventually it will be re-compiled to tier1 to something like this:

void DisposeMe(IDisposable d)
{
+   if (d is MyType)           // E.g. Profile states that Dispose here is 'mostly' called on MyType.
+       ((MyType)d).Dispose(); // Direct call - can be inlined now!
+   else
        d.Dispose();           // a cold fallback, just in case
}

    ^ codegen diff for a case where MyType::Dispose is empty

  1. Hot-cold block reordering - JIT re-orders blocks to keep hot ones closer to each other and pushes cold ones to the end of the method. The following code:
void DoWork(int a)
{
    if (a > 0)
        DoWork1();
    else
        DoWork2();
}

Is compiled like this in tier0:

void DoWork(int a)
{
    if (a > 0)
+       __block_0_counter++;
        DoWork1();
    else
+       __block_1_counter++;
        DoWork2();
}

And again: once it's recompiled to tier1 it is optimized into:

void DoWork(int a)
{
    // E.g. __block_0_counter is smaller or even zero => DoWork1 is rarely (never) taken
    // and JIT re-orders DoWork2 with DoWork1:
-   if (a > 0)
+   if (a <= 0)
-       DoWork2();
+       DoWork1();
    else
-       DoWork1();
+       DoWork2();
}
  1. Misc - some optimizations such as Loop Clonning, Inlined Casts, etc. aren't applied in cold blocks.

Benchmarks (Default mode vs Dynamic PGO)

using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

// Run the benchmarks
BenchmarkRunner.Run<PgoBenchmarks>();


[Config(typeof(MyEnvVars))]
public class PgoBenchmarks
{
    // Custom config to define "Default vs PGO"
    class MyEnvVars : ManualConfig
    {
        public MyEnvVars()
        {
            // Use .NET 6.0 default mode:
            AddJob(Job.Default.WithId("Default mode"));

            // Use Dynamic PGO mode:
            AddJob(Job.Default.WithId("Dynamic PGO")
                .WithEnvironmentVariables(
                    new EnvironmentVariable("DOTNET_TieredPGO", "1"),
                    new EnvironmentVariable("DOTNET_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("DOTNET_ReadyToRun", "0")));
        }
    }


    //
    // Benchmark 1: Devirtualize unknown virtual calls:
    //

    public IEnumerable<object> TestData()
    {
        // Test data for 'GuardedDevirtualization(ICollection<int>)'
        yield return new List<int>();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public void GuardedDevirtualization(ICollection<int> collection)
    {
        // a chain of unknown virtual calls...
        collection.Clear();
        collection.Add(1);
        collection.Add(2);
        collection.Add(3);
    }


    //
    // Benchmark 2: Allow inliner to be way more aggressive than usual
    //              for profiled call-sites:
    //

    [Benchmark]
    public StringBuilder ProfileDrivingInlining()
    {
        StringBuilder sb = new();
        for (int i = 0; i < 1000; i++)
            sb.Append("hi"); // see https://twitter.com/EgorBo/status/1451149444183990273
        return sb;
    }


    //
    // Benchmark 3: Reorder hot-cold blocks for better performance
    //

    [Benchmark]
    [Arguments(42)]
    public string HotColdBlockReordering(int a)
    {
        if (a == 1)
            return "a is 1";
        if (a == 2)
            return "a is 2";
        if (a == 3)
            return "a is 3";
        if (a == 4)
            return "a is 4";
        if (a == 5)
            return "a is 5";
        return "a is too big"; // this branch is always taken in this benchmark (a is 42)
    }
}

Results:

Method Job Mean Error StdDev GuardedDevirtualization Default mode 5.7448 ns 0.0020 ns 0.0017 ns GuardedDevirtualization Dynamic PGO 3.2651 ns 0.0233 ns 0.0182 ns

ProfileDrivingInlining Default mode 3,538.2980 ns 26.7256 ns 23.6915 ns ProfileDrivingInlining Dynamic PGO 2,167.8397 ns 5.0619 ns 4.2269 ns

HotColdBlockReordering Default mode 1.5244 ns 0.0029 ns 0.0025 ns HotColdBlockReordering Dynamic PGO 0.0181 ns 0.0051 ns 0.0040 ns

How can I try it on my production?

You only need to make sure the following environment variables are defined in the execution process of your program:

# Enable Dynamic PGO
export DOTNET_TieredPGO=1

# AOT images aren't instrumented so we need to disable them and collect
# relevant PGO data for literally everything. It affects startup time badly, 
# but leads to higher performance after warm up.
export DOTNET_ReadyToRun=0

# For .NET 7.0 we hopefully will enable full-fledged OSR, but for now methods with loops 
# always bypass tier0, however, we do need them in tier0 to be instrumented for PGO.
export DOTNET_TC_QuickJitForLoops=1

^ Linux/macOS, for Windows-Powershell:

$env:DOTNET_TieredPGO=1
$env:DOTNET_ReadyToRun=0
$env:DOTNET_TC_QuickJitForLoops=1

Community feedback on PGO in .NET 6.0

Please, tag me EgorBo on twitter and I'll forward it to the team









About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK