Wednesday, April 18, 2007
Recently, Dan Shechter sent me a code sample that illustrates a certain peculiarity of the .NET garbage collector. Here's a version of the sample:
using System;
using System.Text;

class Program
{
  static long g_LastMemory;

  static void DisplayMemory(string caption, bool showDelta)
  {
    long memory = GC.GetTotalMemory(true); // passing true forces full GC...

    StringBuilder builder = new StringBuilder();
    builder.AppendFormat("{0}: {1:###,###,###,##0} bytes", caption, memory);

    if (showDelta)
      builder.AppendFormat(" (delta: {0:###,###,###,##0} bytes)",
      memory - g_LastMemory);

    Console.WriteLine(builder.ToString());

    g_LastMemory = memory;
  }

  static void Main()
  {
    DisplayMemory("Initial memory", false);

    byte[] myBigArray = new byte[1024 * 1024]; // allocate 1 megabyte...

    DisplayMemory("After allocating 1 megabyte", true);

    for (int i = 0; i < myBigArray.Length; i++)
      myBigArray[i] = (byte)(i % 255);

    DisplayMemory("After manipulating 1 megabyte", true);

    myBigArray = null;

    DisplayMemory("After releasing 1 megabyte", true);
  }
}

The sample application is fairly simple; it manipulates an array and displays the total memory size. One important detail we should pay attention to is that a garbage collection is forced when "DisplayMemory()" calls "GC.GetTotalMemory()" with a value of "true." Consequently, the application flow looks like this:

  1. Force a garbage collection and display the initial memory size.
  2. Allocate a megabyte-sized byte array.
  3. Force a garbage collection and display the current memory size.
  4. Loop through the byte array and assign values to each element in the array.
  5. Force a garbage collection and display the current memory size.
  6. Set the byte array to null.
  7. Force a garbage collection and display the current memory size.

When the application is built with Visual Studio 2005 in the "Debug" configuration and run at the command-line, the results are predictable.

Initial memory: 212,520 bytes
After allocating 1 megabyte: 1,263,748 bytes (delta: 1,051,228 bytes)
After manipulating 1 megabyte: 1,263,748 bytes (delta: 0 bytes)
After releasing 1 megabyte: 215,156 bytes (delta: -1,048,592 bytes)

However, when built in the "Release" configuration, the results are quite unexpected.

Initial memory: 212,552 bytes
After allocating 1 megabyte: 1,263,752 bytes (delta: 1,051,200 bytes)
After manipulating 1 megabyte: 215,160 bytes (delta: -1,048,592 bytes)
After releasing 1 megabyte: 215,160 bytes (delta: 0 bytes)

In the "Release" build, the byte array is reclaimed by the garbage collector before it is set to null. From a memory-optimization perspective, this makes sense because the byte array isn't actually used after its elements are assigned except to set it to null. Releasing the memory early relieves pressure on the garbage collector. However, to the old-school programmer who is accustomed to memory remaining allocated until explicitly freed, this is harder to swallow.

The lesson we learn from this example is not to expect the memory in a garbage-collected environment to be treated in predictable ways. Anyone who tries to pin down the garbage collector (excepting CLR developers) will find it to be an elusive and magical creature. I am always looking for the "why" behind the magic, so let's track down exactly how this optimization is applied under the covers.

Our first step is to check the IL emitted by the C# compiler to see if the compiler simply moves the null assignment. Here is the disassembled IL code with annotations to show exactly what each chunk of instructions does:

.method private hidebysig static void Main() cil managed
{
  .entrypoint
  .maxstack 4
  .locals init (
    [0] uint8[] myBigArray,
    [1] int32 i)

  // DisplayMemory("Initial memory", false)
  ldstr "Initial memory"
  ldc.i4.0
  call void GCTest.Program::DisplayMemory(string, bool)

  // byte[] myBigArray = new myBigArray[1024 * 1024]
  ldc.i4 1048576
  newarr uint8
  stloc.0

  // DisplayMemory("After allocating 1 megabyte", true)
  ldstr "After allocating 1 megabyte"
  ldc.i4.1
  call void GCTest.Program::DisplayMemory(string, bool)

  // int i = 0
  ldc.i4.0
  stloc.1

  // goto LengthCheck
  br.s LengthCheck

LoopBody:
  // myBigArray[i] = (byte)(i % 255)
  ldloc.0
  ldloc.1
  ldloc.1
  ldc.i4 255
  rem
  conv.u1
  stelem.i1

  // i++
  ldloc.1
  ldc.i4.1
  add
  stloc.1

LengthCheck:
  // if (i < myBigArray.Length) goto LoopBody
  ldloc.1
  ldloc.0
  ldlen
  conv.i4
  blt.s LoopBody

  // DisplayMemory("After manipulating 1 megabyte", true)
  ldstr "After manipulating 1 megabyte"
  ldc.i4.1
  call void GCTest.Program::DisplayMemory(string, bool)

  // myBigArray = null
  ldnull
  stloc.0

  // DisplayMemory("After releasing 1 megabyte", true)
  ldstr "After releasing 1 megabyte"
  ldc.i4.1
  call void GCTest.Program::DisplayMemory(string, bool)

  ret
}

The code above shows us that the byte array is explicitly set to null after "DisplayMemory()" is called. While the IL code is interesting, the optimization isn't there. That leaves only one possibility. The optimization must happen at JIT time when the IL is compiled to native instructions.

The Visual Studio 2005 debugger is the most robust tool for viewing native JIT-compiled code. However, we must set a few options first.

  1. Switch to the "Release" solution configuration.
  2. In the Solution Explorer, right-click on the project node and choose Properties from the context menu to bring up the Application Designer.
  3. On the Build tab of the Application Designer, press the Advanced button to display the Advanced Build Settings dialog. In this dialog, set Debug Info to "pdb-only" and click OK.
  4. On the Debug tab of the Application Designer, ensure that "Enabled unmanaged code debugging" is checked.

At this point, we can return to the sample code, set a breakpoint at the beginning of the "Main()" method and run the application by pressing F5. When the application enters break mode, we can view the disassembled native code by selecting Debug -> Windows -> Disassembly from the Visual Studio menu. Here's what it looks like:

    static void Main()
    {
      DisplayMemory("Initial memory", false);
00000000  push edi
00000001  push esi
00000002  cmp  dword ptr ds:[00912DC8h],0
00000009  je   00000010
0000000b  call 78F9228E
00000010  xor  esi,esi
00000012  xor  edi,edi
00000014  mov  ecx,dword ptr ds:[022B303Ch]
0000001a  xor  edx,edx
0000001c  call dword ptr ds:[00913040h]


      byte[] myBigArray = new byte[1024 * 1024]; // allocate 1 megabyte...
00000022  mov  edx,100000h
00000027  mov  ecx,7915985Ah
0000002c  call FF802120
00000031  mov  esi,eax

      DisplayMemory("After allocating 1 megabyte", true);
00000033  mov  ecx,dword ptr ds:[022B3040h]
00000039  mov  edx,1
0000003e  call dword ptr ds:[00913040h]


      for (int i = 0; i < myBigArray.Length; i++)
00000044  xor  edi,edi
00000046  nop
00000047  jmp  00000062


        myBigArray[i] = (byte)(i % 255);
00000049  mov  eax,edi
0000004b  mov  ecx,0FFh
00000050  cdq
00000051  idiv eax,ecx
00000053  cmp  edi,dword ptr [esi+4]
00000056  jb   0000005D
00000058  call 78F94671
0000005d  mov  byte ptr [esi+edi+8],dl


      for (int i = 0; i < myBigArray.Length; i++)
00000061  inc  edi
00000062  cmp  edi,dword ptr [esi+4]
00000065  jl   00000049


      DisplayMemory("After manipulating 1 megabyte", true);
00000067  mov  ecx,dword ptr ds:[022B3044h]
0000006d  mov  edx,1
00000072  call dword ptr ds:[00913040h]


      myBigArray = null;
00000078  xor  esi,esi

      DisplayMemory("After releasing 1 megabyte", true);
0000007a  mov  ecx,dword ptr ds:[022B3048h]
00000080  mov  edx,1
00000085  call dword ptr ds:[00913040h]

    }
0000008b  nop
0000008c  pop  esi
0000008d  pop  edi
0000008e  ret

Examining the native instructions reveals that the optimization isn't there, either. The byte array is stored in the ESI register at 00000031 and is explicitly cleared using the old assembly trick of XOR'ing a register with itself at 00000078. However, the registry is cleared after "DisplayMemory()" is called. In fact, when we remove the breakpoint and allow the application to finish running, we see similar results to what we saw in the "Debug" configuration.

Initial memory: 150,584 bytes
After allocating 1 megabyte: 1,211,176 bytes (delta: 1,060,592 bytes)
After manipulating 1 megabyte: 1,211,176 bytes (delta: 0 bytes)
After releasing 1 megabyte: 162,584 bytes (delta: -1,048,592 bytes)

What did we do wrong?

The problem is that, by default, Visual Studio 2005 disables the JIT optimizer during debugging. Doing so significantly simplifies managed code debugging. For those of you who are interested, Vance Morrison has written a great article about this "feature". Here is a summary of his instructions to force Visual Studio to show optimized JIT-compiled code:

  1. Select Tools -> Options from the Visual Studio main menu to open the Options dialog.
  2. In the Options dialog, choose the Debugging | General options page.
  3. Ensure that "Enable Just My Code (Managed only)" is unchecked.
  4. Ensure that "Suppress JIT optimization on module load (Managed only)" is unchecked.
  5. Click OK to close the Options dialog.

Now when we run the application, the results indicate that the optimization we saw when we ran it in "Release" configuration at the command-line is present.

Initial memory: 150,612 bytes
After allocating 1 megabyte: 1,211,176 bytes (delta: 1,060,564 bytes)
After manipulating 1 megabyte: 162,584 bytes (delta: -1,048,592 bytes)
After releasing 1 megabyte: 162,584 bytes (delta: 0 bytes)

With the additional options selected, we again set a breakpoint at the beginning of the "Main()" method and run the application. Here is the disassembled (and fully optimized) native code:

    static void Main()
    {
      DisplayMemory("Initial memory", false);
00000000  push edi
00000001  push esi
00000002  push ebx
00000003  mov  ecx,dword ptr ds:[022B303Ch]
00000009  xor  edx,edx
0000000b  call dword ptr ds:[00913040h]


      byte[] myBigArray = new byte[1024 * 1024]; // allocate 1 megabyte...
00000011  mov  edx,100000h
00000016  mov  ecx,7915985Ah
0000001b  call FF802120
00000020  mov  edi,eax

      DisplayMemory("After allocating 1 megabyte", true);
00000022  mov  ecx,dword ptr ds:[022B3040h]
00000028  mov  edx,1
0000002d  call dword ptr ds:[00913040h]


      for (int i = 0; i < myBigArray.Length; i++)
00000033  xor  esi,esi
00000035  mov  ebx,dword ptr [edi+4]
00000038  test ebx,ebx
0000003a  jle  0000004F


        myBigArray[i] = (byte)(i % 255);
0000003c  mov  eax,esi
0000003e  mov  ecx,0FFh
00000043  cdq
00000044  idiv eax,ecx
00000046  mov  byte ptr [edi+esi+8],dl


      for (int i = 0; i < myBigArray.Length; i++)
0000004a  inc  esi
0000004b  cmp  ebx,esi
0000004d  jg   0000003C


      DisplayMemory("After manipulating 1 megabyte", true);
0000004f  mov  ecx,dword ptr ds:[022B3044h]
00000055  mov  edx,1
0000005a  call dword ptr ds:[00913040h]


      DisplayMemory("After releasing 1 megabyte", true);
00000060  mov  ecx,dword ptr ds:[022B3048h]
00000066  mov  edx,1
0000006b  call dword ptr ds:[00913040h]
00000071  pop  ebx

    }
00000072  pop  esi
00000073  pop  edi
00000074  ret

This time, the byte array is stored in the EDI register (ESI is used for the index variable of the for-loop) at 00000020. However, this register isn't ever cleared! The code to set the byte array to null has been completely removed. How does the garbage collector know that it can be reclaimed? To find out, we'll have to peek into the internal data structures of the CLR. Fortunately, Microsoft has provided us with a tool to allow us to do so.

The SOS.DLL ("Son of Strike") debugger extension is one of the most powerful tools available for debugging .NET applications. With it, we can examine the inner workings of the CLR at runtime. However, because it is so low-level, many .NET developers don't know that it exists or have never tried using it. In the case of our byte array, figuring out why the memory is being reclaimed earlier than we expect would be nearly impossible without SOS. It can show us how the JIT optimizer provides the garbage collector with the information that it needs to reclaim the memory.

Loading the SOS extension is easy. While still at the breakpoint, we can open the Immediate Window by selecting Debug -> Windows -> Immediate from the Visual Studio main menu. Then, we just have type the following into the Immediate Window:

.load sos

We should be rewarded with the following message:

extension C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\sos.dll loaded

First, we need to examine the "MethodDesc" for the current method. This data structure describes important runtime details about a managed method. SOS provides a command called "ip2md" (Instruction-Pointer-To-MethodDesc). According to the documentation, "given an address in managed jitted code, ip2md attempts to find the MethodDesc associated with it." Since we're interested in the MethodDesc for the current method, we can pass the CPU's instruction pointer (stored in the EIP register) to ip2md. To display the contents of the CPU registers, we select Debug -> Windows -> Registers from the Visual Studio main menu. Using the current value of the EIP register, we type the following command into the Intermediate Window (your EIP value may differ from the one shown here):

!ip2md 01100073

SOS will display the MethodDesc information for the current method.

MethodDesc: 00912ff8
Method Name: Program.Main()
Class: 00911284
MethodTable: 00913008
mdToken: 06000002
Module: 00912c14
IsJitted: yes
m_CodeOrIL: 01100070

Now that we have the MethodDesc, its address can be used with the "gcinfo" command to gather detailed information about how the garbage collector tracks registers containing managed objects. I should point out that the gcinfo documentation states that it is really intended for CLR developers and is "very difficult for anyone else to make sense of." In other words, we're being a bit naughty, but I'm OK with that. To use the gcinfo command with the MethodDesc address, we type the following:

!gcinfo 00912ff8

And here's what comes back:

entry point 01100070
Normal JIT generated code
GC info 1120aae800911800
Method info block:
    method      size   = 0075
    prolog      size   =  3
    epilog      size   =  4
    epilog     count   =  1
    epilog      end    = yes
    callee-saved regs  = EDI ESI EBX
    ebp frame          = no
    fully interruptible= yes
    double align       = no
    arguments size     =  0 DWORDs
    stack frame size   =  0 DWORDs
    untracked count    =  0
    var ptr tab count  =  0
    epilog        at   0071
75 B1 40 |

Pointer table:
F0 49          | 0009    reg ECX becoming live
F0 08          | 0011    reg ECX becoming dead
F0 47          | 0020    reg EAX becoming live
7A             | 0022    reg EDI becoming live
4E             | 0028    reg ECX becoming live
F0 03          | 0033    reg EAX becoming dead
08             | 0033    reg ECX becoming dead
F2 3C          | 004F    reg EDI becoming dead
4E             | 0055    reg ECX becoming live
F0 0B          | 0060    reg ECX becoming dead
4E             | 0066    reg ECX becoming live
F0 0B          | 0071    reg ECX becoming dead
FF             |

The highlighted lines in the pointer table show the range of instructions where the EDI register (which holds the byte array) is live. While live, the EDI register and not a candidate for garbage collection. EDI becomes live at 00000022 and is considered dead at 0000004f. If we refer back to the method's optimized JIT-compiled code, we can highlight this range of instructions.

    static void Main()
    {
      DisplayMemory("Initial memory", false);
00000000  push edi
00000001  push esi
00000002  push ebx
00000003  mov  ecx,dword ptr ds:[022B303Ch]
00000009  xor  edx,edx
0000000b  call dword ptr ds:[00913040h]


      byte[] myBigArray = new byte[1024 * 1024]; // allocate 1 megabyte...
00000011  mov  edx,100000h
00000016  mov  ecx,7915985Ah
0000001b  call FF802120
00000020  mov  edi,eax


      DisplayMemory("After allocating 1 megabyte", true);
00000022  mov  ecx,dword ptr ds:[022B3040h]
00000028  mov  edx,1
0000002d  call dword ptr ds:[00913040h]


      for (int i = 0; i < myBigArray.Length; i++)
00000033  xor  esi,esi
00000035  mov  ebx,dword ptr [edi+4]
00000038  test ebx,ebx
0000003a  jle  0000004F


        myBigArray[i] = (byte)(i % 255);
0000003c  mov  eax,esi
0000003e  mov  ecx,0FFh
00000043  cdq
00000044  idiv eax,ecx
00000046  mov  byte ptr [edi+esi+8],dl


      for (int i = 0; i < myBigArray.Length; i++)
0000004a  inc  esi
0000004b  cmp  ebx,esi
0000004d  jg   0000003C


      DisplayMemory("After manipulating 1 megabyte", true);
0000004f  mov  ecx,dword ptr ds:[022B3044h]
00000055  mov  edx,1
0000005a  call dword ptr ds:[00913040h]


      DisplayMemory("After releasing 1 megabyte", true);
00000060  mov  ecx,dword ptr ds:[022B3048h]
00000066  mov  edx,1
0000006b  call dword ptr ds:[00913040h]
00000071  pop  ebx

    }
00000072  pop  esi
00000073  pop  edi
00000074  ret

Since the EDI register is live only until DisplayMemory() is called at 0000004f, the garbage collector is free to reclaim the byte array's memory . That's the magic. Although we thought that we were holding a strong reference to the byte array, the JIT optimizer decided otherwise.

Like most magic, this puzzle turns out to have a simple solution in the end. Instead of some convoluted reordering of code, the JIT optimizer simply marks a range of instructions.

Spelunking the JIT optimizer has been a whirlwind adventure. Each step has taken us deeper under the hood of the CLR and revealed new sparkling gems of information. Whether this information can be practically applied to daily coding is unlikely. However, the tools that we have explored can help us to investigate some of the most elusive bugs that we might encounter. Also, pinning down the garbage collector has served to strengthen our faith in that magical creature's ability to do its job. And finally, we have learned that the JIT optimizer is a sorcerer that toils especially hard to work its own magic.

posted on Wednesday, April 18, 2007 7:08:50 AM (Pacific Standard Time, UTC-08:00)  #    Comments [6]

kick it on DotNetKicks.com