Category: CrashDumps

  • Ghost in the Machine – Debugging an Ornery Windows Bluescreen Problem – Part 2

    Collecting Additional Data Points

    Continuing where I left off in Part 1, I had identified that the issue was likely either faulty hardware or a bad driver – which pretty much describes the cause of every bugcheck. I was hoping for something a bit more precise. I did have the one mini-dump so I decided to tear into it a little bit more until I could get my hands on a kernel dump.

    Deep Dive: Level 2

    The first thing I did after loading the crash dump into WinDbg was see what !analyze -v had to say about it:

    2: kd> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    SYSTEM_SERVICE_EXCEPTION (3b)
    An exception happened while executing a system service routine.
    Arguments:
    Arg1: 00000000c0000005, Exception code that caused the bugcheck
    Arg2: fffff800031842b1, Address of the instruction which caused the bugcheck
    Arg3: fffff88007a64ef0, Address of the context record for the exception that caused the bugcheck
    Arg4: 0000000000000000, zero.
    
    Debugging Details:
    ------------------
    
    
    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
    
    FAULTING_IP: 
    nt!ObpLookupObjectName+461
    fffff800`031842b1 f6              ???
    
    CONTEXT:  fffff88007a64ef0 -- (.cxr 0xfffff88007a64ef0)
    rax=fffffa80042bccc8 rbx=fffff80002e0f000 rcx=fffff8a00c73af12
    rdx=000000000000006c rsi=0000000000000000 rdi=0000000000000001
    rip=fffff800031842b1 rsp=fffff88007a658d0 rbp=fffffa80042bcb10
     r8=fffffa8003c6cf30  r9=fffffa8005fec268 r10=fffff8a000004ca0
    r11=fffffa80040f6890 r12=0000000000000000 r13=fffff8a000004ca0
    r14=0000000000000040 r15=fffff88007a65a40
    iopl=0         nv up ei pl nz na po nc
    cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
    nt!ObpLookupObjectName+0x461:
    fffff800`031842b1 f6              ???
    Resetting default scope
    
    CUSTOMER_CRASH_COUNT:  1
    
    DEFAULT_BUCKET_ID:  CODE_CORRUPTION
    
    BUGCHECK_STR:  0x3B
    
    PROCESS_NAME:  svchost.exe
    
    CURRENT_IRQL:  0
    
    LAST_CONTROL_TRANSFER:  from fffff800031855f6 to fffff800031842b1
    
    STACK_TEXT:  
    fffff880`07a658d0 fffff800`031855f6 : 00000000`00000000 fffffa80`042bcb10 fffffa80`00000000 fffffa80`03d0a260 : nt!ObpLookupObjectName+0x461
    fffff880`07a659c0 fffff800`03158d8c : fffff800`03013940 00000000`00000000 fffff8a0`0c445301 00000000`00000005 : nt!ObOpenObjectByName+0x306
    fffff880`07a65a90 fffff800`03163ef2 : 00000000`03c9f980 00000000`00020019 00000000`03c9f990 000007fe`00000000 : nt!CmOpenKey+0x28a
    fffff880`07a65be0 fffff800`02e8d253 : 00000000`00000008 000007fe`e8425c34 00000000`00000001 00000000`00000000 : nt!NtOpenKey+0x12
    fffff880`07a65c20 00000000`771a143a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
    00000000`03c9f938 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x771a143a
    
    
    CHKIMG_EXTENSION: !chkimg -lo 50 -d !nt
        fffff800031842b0 - nt!ObpLookupObjectName+460
    	[ 85:95 ]
    1 error : !nt (fffff800031842b0)
    
    MODULE_NAME: memory_corruption
    
    IMAGE_NAME:  memory_corruption
    
    FOLLOWUP_NAME:  memory_corruption
    
    DEBUG_FLR_IMAGE_TIMESTAMP:  0
    
    MEMORY_CORRUPTOR:  ONE_BIT
    
    STACK_COMMAND:  .cxr 0xfffff88007a64ef0 ; kb
    
    FAILURE_BUCKET_ID:  X64_MEMORY_CORRUPTION_ONE_BIT
    
    BUCKET_ID:  X64_MEMORY_CORRUPTION_ONE_BIT
    
    Followup: memory_corruption
    ---------
    

    So we have a one-bit memory corruption that has overwritten code in the nt module. Besides the nice bucket id to key us into this, we can also see it in the disassembly. We know that the f6 instruction has decoded into an invalid instruction, so the code corruption must be right before or right after that instruction. If we disassemble starting just before the faulting IP in the crash we get this:

    2: kd> u nt!ObpLookupObjectName+0x457
    nt!ObpLookupObjectName+0x457:
    fffff800`031842a7 488bb42428010000 mov     rsi,qword ptr [rsp+128h]
    fffff800`031842af 4895            xchg    rax,rbp
    fffff800`031842b1 f6              ???
    fffff800`031842b2 0f8544060000    jne     nt!ObpLookupObjectName+0xaac (fffff800`031848fc)
    fffff800`031842b8 b801000000      mov     eax,1
    fffff800`031842bd 488d942490000000 lea     rdx,[rsp+90h]
    fffff800`031842c5 440fb6c8        movzx   r9d,al
    fffff800`031842c9 458bc6          mov     r8d,r14d
    

    Going to a known good copy of the same OS (Windows 7 x64 SP1) and disassembling that same location, we get this:

    1: kd> u nt!ObpLookupObjectName+457
    nt!ObpLookupObjectName+0x457:
    fffff800`02bc6357 488bb42428010000 mov     rsi,qword ptr [rsp+128h]
    fffff800`02bc635f 4885f6          test    rsi,rsi
    fffff800`02bc6362 0f8544060000    jne     nt!ObpLookupObjectName+0xaac (fffff800`02bc69ac)
    fffff800`02bc6368 b801000000      mov     eax,1
    fffff800`02bc636d 488d942490000000 lea     rdx,[rsp+90h]
    fffff800`02bc6375 440fb6c8        movzx   r9d,al
    fffff800`02bc6379 458bc6          mov     r8d,r14d
    fffff800`02bc637c 498bcd          mov     rcx,r13
    

    Ignoring the different addresses due to the different module base addresses, we can see that the second instruction is incorrect. The xchg rax, rbp; ??? should be a test rsi,rsi. Looking at the code bytes, we should have 0x4885f6 but instead we have 0x4895f6. There’s the single-bit corruption!

    Let’s Try Throwing Money at the Problem

    This was another strike in the faulty hardware column for me. The next thing I did was have them run the built-in Windows Memory Diagnostic Tool. I had them configure it to use the Extended test with Default caching and an unlimited number of passes. I was hoping for an easy win at this point because RAM is cheap and easy to install. After about an 18 hrs test run however, nothing had shown up. The next thing I tried as a CPU stress-test using Prime95 to see if the CPU was flaking out when it heated up. I had them use the ‘Blend’ setting which uses lots of RAM in addition to stressing the CPU and 4 threads since the machine is a 4 Core i5 750 which lacks hyperthreading. After running for a few hours it will still running strong without any signs of trouble.

    Going back to my last visit when I cleaned the case out, the video card was the only thing that had a significant amount of dust on it. Video cards make heavy use of RAM and use DMA so a fault in the hardware could result in direct corruption of physical memory that would appear to show up in random places in the virtual address space of the kernel and processes on the system. This fact plus the lack of a good GPU diagnostic tool led me to just order them a Radeon HD 6450 to replace their existing ATI Radeon HD 4350. I’ve also had several of those same ATI Radeon HD 4350 cards from the same manufacturer die on me at work as well, which further added to my suspicion. A few days later it arrived and they installed it. All seemed to be well for a couple of weeks and I was beginning to think we had nipped this one. Apparently the machine had other plans though, as it crashed yet again.

    Deep Dive: Level 3

    Having previously enabled kernel crash dumps, I was now armed with both a mini-dump and a kernel memory dump for this latest crash. Zipped, the kernel memory dump was only 144MB which was easily uploaded to me overnight, even on the slow DSL connection. I once again started off with !analyze -v:

    3: kd> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************
    
    PAGE_FAULT_IN_NONPAGED_AREA (50)
    Invalid system memory was referenced.  This cannot be protected by try-except,
    it must be protected by a Probe.  Typically the address is just plain bad or it
    is pointing at freed memory.
    Arguments:
    Arg1: fffff8800764a6c0, memory referenced.
    Arg2: 0000000000000001, value 0 = read operation, 1 = write operation.
    Arg3: fffff960006b9eca, If non-zero, the instruction address which referenced the bad memory
    	address.
    Arg4: 0000000000000000, (reserved)
    
    Debugging Details:
    ------------------
    
    
    Could not read faulting driver name
    
    WRITE_ADDRESS:  fffff8800764a6c0 
    
    FAULTING_IP: 
    cdd+9eca
    fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx
    
    MM_INTERNAL_CODE:  0
    
    CUSTOMER_CRASH_COUNT:  1
    
    DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
    
    BUGCHECK_STR:  0x50
    
    PROCESS_NAME:  OUTLOOK.EXE
    
    CURRENT_IRQL:  0
    
    TRAP_FRAME:  fffff8800744a4a0 -- (.trap 0xfffff8800744a4a0)
    NOTE: The trap frame does not contain all registers.
    Some register values may be zeroed or incorrect.
    rax=fffff900c0151020 rbx=0000000000000000 rcx=0000000000000060
    rdx=fffff8800744aa50 rsi=0000000000000000 rdi=0000000000000000
    rip=fffff960006b9eca rsp=fffff8800744a630 rbp=fffff8800744a6e0
     r8=0000000000000001  r9=fffff8800744af50 r10=fffff900c0151080
    r11=fffff8800744a6e0 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0         nv up ei pl nz na pe nc
    cdd+0x9eca:
    fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx ss:0018:fffff880`0744a6c0=fffff8800744aa50
    Resetting default scope
    
    LAST_CONTROL_TRANSFER:  from fffff80002e8deec to fffff80002ee5fc0
    
    STACK_TEXT:  
    fffff880`0744a338 fffff800`02e8deec : 00000000`00000050 fffff880`0764a6c0 00000000`00000001 fffff880`0744a4a0 : nt!KeBugCheckEx
    fffff880`0744a340 fffff800`02ee40ee : 00000000`00000001 fffff880`0764a6c0 fffff6fc`80610800 fffff880`0744aa50 : nt! ?? ::FNODOBFM::`string'+0x4514f
    fffff880`0744a4a0 fffff960`006b9eca : 00000000`00000000 fffff880`0744ab70 fffff900`c0145020 fffff900`c1daf9c8 : nt!KiPageFault+0x16e
    fffff880`0744a630 00000000`00000000 : fffff880`0744ab70 fffff900`c0145020 fffff900`c1daf9c8 00000000`0000056a : cdd+0x9eca
    
    
    STACK_COMMAND:  kb
    
    FOLLOWUP_IP: 
    cdd+9eca
    fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx
    
    SYMBOL_STACK_INDEX:  3
    
    SYMBOL_NAME:  cdd+9eca
    
    FOLLOWUP_NAME:  MachineOwner
    
    MODULE_NAME: cdd
    
    IMAGE_NAME:  cdd.dll
    
    DEBUG_FLR_IMAGE_TIMESTAMP:  0
    
    FAILURE_BUCKET_ID:  X64_0x50_cdd+9eca
    
    BUCKET_ID:  X64_0x50_cdd+9eca
    
    Followup: MachineOwner
    ---------
    

    This bugcheck is different from the last one we’ve got, but we’ve seen that this machine has been issuing bugcheck 0x50 stop codes via the Windows Event Logs I investigated in Part 1. The faulting IP this time is in cdd.dll, which is the Microsoft Canonical Display Driver. Although the bugcheck name is PAGE_FAULT_IN_NONPAGED_AREA, we can see that our current IRQL is 0 (Passive) where page faults are allowed. This indicates that the reference memory address is invalid. We can see that it was a write operation to fffff8800764a6c0 and that the faulting IP was fffff960006b9eca. Viewing the trap frame to see the disassembly of the faulting instruction and the current values of the registers, we see this:

    3: kd> .trap 0xfffff8800744a4a0
    NOTE: The trap frame does not contain all registers.
    Some register values may be zeroed or incorrect.
    rax=fffff900c0151020 rbx=0000000000000000 rcx=0000000000000060
    rdx=fffff8800744aa50 rsi=0000000000000000 rdi=0000000000000000
    rip=fffff960006b9eca rsp=fffff8800744a630 rbp=fffff8800744a6e0
     r8=0000000000000001  r9=fffff8800744af50 r10=fffff900c0151080
    r11=fffff8800744a6e0 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0         nv up ei pl nz na pe nc
    cdd+0x9eca:
    fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx ss:0018:fffff880`0744a6c0=fffff8800744aa50
    

    Hrmn! All this instruction is trying to do is write the value of rcx to a stack location and it’s essentially access violating. A buffer overrun is one way this can occur, but you generally don’t see that with a static stack offset. Let’s check the math on this calculation:

    rsp=fffff8800744a630
    rsp+90 => fffff8800744a630 + 90 => fffff8800744a6c0

    So the address of the stack write should be fffff8800744a6c0 however we see from argument one of the bugcheck that it was fffff8800764a6c0. Viewing these in 64-bit binary representation we see:

    fffff8800744a630 (correct calculation)
    1111 1111 1111 1111 1111 1000 1000 0000 0000 0111 0100 0100 1010 0110 1100 0000
    fffff8800764a630 (actual access attempt)
    1111 1111 1111 1111 1111 1000 1000 0000 0000 0111 0110 0100 1010 0110 1100 0000
                                                        ^
    

    Notice that bit 21 is incorrectly set in the actual access attempt. We have another single-bit memory corruption!

    At this point I am a bit puzzled. In my experience, CPU failures are generally pretty rare and if the CPU is going to go bad, it does so very early on (this machine is >3yrs old). I’ve already replaced the video card. I’m leaning towards this being a faulty RAM issue although it could be a motherboard issue. However, from my recollection from my computer organization courses in college, the way that move instruction is going to execute is that that the arithmetic logic unit (ALU) will get the value from the RSP register, add 90, and then directly pass the resulting address to the memory-management unit (MMU) via its select lines, indicating a write operation and the value in RCX. The various processor caches (L1, L2, L3, etc.) are of course involved but the main point is that I don’t recall main system memory (RAM) being involved in the calculation of the write address. The address of the fault is a virtual address and the output from the MMU to the RAM should be a physical address. Said another way, although RAM is the target of the write, I don’t recall there being any RAM access required to lookup the target address. The only things I could think of were:

    1. During the hardware assisted conversion of the virtual address to a physical address, the various structures (PXE, PDE, PTE,PFN) weren’t present in the translation look-aside buffer (TLB) which resulted in a page walk, introducing RAM accesses which presents an opportunity for the target address to be corrupted.
    2. The problem isn’t RAM at all, but the motherboard or CPU or Power Supply Unit (PSU) or ….

    Let’s Throw Some More Money at the Problem

    Motherboards are fairly pricey, especially OEM boards and they are difficult for the average joe to replace. The CPU in this machine is a Core i5 750 which is still plenty fast for what they need. However, it’s the old LGA 1156 form factor, which they don’t make anymore. As a result, getting a replacement processor was going to be $200 and it’s still difficult for the average joe to replace. So, although the last memory test didn’t show anything, I went ahead and ordered 2 4GB DIMMS to replace the existing 4 1GB DIMMS.

    Driver Verifier

    I also noticed a few drivers on the system that I didn’t recognize or that have been known to cause issues in the past (based on various Google searches). Windows has a built-in utility called Driver Verifier. Per MSDN – “Driver Verifier monitors kernel-mode drivers to detect incorrect function calls or actions that might corrupt the system.” I went ahead and had them enable Driver Verifier using the standard settings on the following list of drivers:

    1. 000 – Associated with Cyberlink Power DVD. Fishy module name most likely associated with DRM.
    2. mcupdate – McAfee update service. McAfee isn’t installed on the box.Update: After looking at the loaded drivers in Process Explorer, the full name of this module mcupdate_GenuineIntel.dll and it’s the Intel Microcode Update Library from Microsoft Corporation
    3. ElbyCDIO – VirtualClone drive
    4. VClone – VirtualClone drive
    5. fowp64 – This one is really fishy. It’s a driver that injects into every process and hooks file accesses. It has a corresponding user-mode service and is part of something called FileOpen Manager Service. Google searches are turning up very little and it’s installed into it’s own Program Files directory so I don’t know what application it’s associated with.
    6. rikvm – Associated with Cyberlink Power DVD. Google searches turn up a lot of talk of this being a rootkit and people being unhappy with Cyberlink about it
    7. psi_mf – Associated with Secunia PSI
    8. pnarp – Associated with Cisco Pure Networks
    9. k57nd60a – Broadcom Ethernet Driver

    Wait and See

    Now that I have still have kernel crash dumps enabled, Driver Verifier turned on, the video card replaced, all the drivers and applications fully updated, and some new RAM on the way, it’s time to just wait and see. Hopefully the new RAM takes care of the issue in which case I’ll update this post with a “Hooray!”. If not, then it will be on to Part 3. Until then….

    Update 2013-03-08:Within 12hrs of installing the new ram, the machine bug checked 3 times. All the new bug checks point at single bit memory corruption. Going back to the mov qword ptr [rsp+90h],rcx with a stuck bit, I just can’t see how this could be anything other than the CPU. At this point I’m hunting eBay for a Core i5-750 LGA 1156 to give that a go. More when it’s available….

    Update 2013-06-13:After swapping out the processor almost 4 months ago, there have been zero crashes. Looks like we finally got this one beat!

  • Ghost in the Machine – Debugging an Ornery Windows Bluescreen Problem – Part 1

    Background

    I’ve been helping a family member figure out why their computer keeps crashing off and on for several months. The machine was just over 3 years old when it started having issues. It’s a Dell Studio XPS 8000 that otherwise has been rock solid until this point. During a previous visit I replaced the hard drive and upgraded the OS from Windows Vista x64 to Windows 7 x64 via a clean install. The machine case has been cleaned inside and outside periodically, is relatively free of dust, and has adequate ventilation. None of the case fans or CPU fans show signs of failure. The video card fan was pretty dusty and sometimes squeaked, but seemed to settle down after a little bit of canned air.

    When the crashes first started in October 2012, I went through the usual deal. I checked the Event Log to see if anything jumped out as the culprit in the logs, and nothing did. In fact there was neither a BugCheck (Event ID 1001) entry or a minidump and the Kernel-Power (Event ID 41) entry had 0 for all it’s bugcheck info. I determined what the most recently installed program that had a driver was and uninstalled it, with no effect. I then updated all the drivers on the machine, updated the BIOS, and used Secunia’s PSI to make sure the OS and all it’s applications were fully up to date.

    The crashes continued to happen periodically – sometimes multiple times a day, sometimes not for several days. To up the difficulty level, the machine is located thousands of miles away on a rather slow DSL line.

    Deep Dive: Level 1

    Since I didn’t have physical access to the machine, I was going to have to be resourceful to get the information I need to start debugging this. The first thing I tried was using Fog Creek’s Copilot to remote to the machine and poke around. I’ve generally found Copilot to work very well for me when I’m providing remote assistance. However, triaging a machine over a 768Kbps DSL line with latency north of 300ms ultimately proved painful. Certain operations like scrolling through the Event Viewer logs are just brutal to screen diffing algorithms. I switched up tactics and I had them install Microsoft’s Sysinternals Suite, run Autoruns and Process Explorer, and export both of their results for me. I also had them export the Application and System logs from Event Viewer so that I could view them on my local machine. Unfortunately, almost all of the crash dumps had already been auto-submitted to Microsoft and purged from the file system. The one crash dump that was still around was a mini-dump so I went ahead and enabled kernel memory dumps so that any future crash would be captured with more (hopefully useful) information.

    My initial inspection of the single mini-dump didn’t yield any obvious answer (more on that in Part 2), so I started looking at the event log. There had been 36 Kernel-Power (Event ID 41) events in a two month period (October 9th – December 9th 2012). If you view the Details tab of the event properties, you can see the bugcheck code and parameters. The vast majority of the 36 events were BugCheckCode 0 with a PowerButtonTimestamp of 0. This indicates either a hard hang of the machine or a power outage. (Had the machine been powered off by using the power button, you generally get a PowerButtonTimestamp value). The 8 events that did have BugCheckCode entries had several different stop conditions as shown below. NOTE:Rather annoyingly, EventID 41 displays the stop code in decimal, not hex. I’ve put the hex value in [] next to the decimal value for clarity.

    1x Bugcheck 80  [0x50] (0xfffff80022fb4fe0, 0x0, 0xfffff80002e5586b, 0x2)
    1x Bugcheck 59  [0x3B] (0xc000001d, 0xfffff9600007a7e0, 0xfffff8800716df50, 0x0)
    1x Bugcheck 194 [C2]   (0x7, 0x109b, 0x4323440, 0xfffffa8003649c70)
    2x Bugcheck 209 [D1]   (0x0, 0x2, 0x8, 0x0)
    1x Bugcheck 30  [1E]   (0xffffffffc000001d, 0xfffff80002f00c29, 0x0, 0xfffffa8005476c00)
    1x Bugcheck 80  [0x50] (0xfffff12014be480c, 0x1, 0xfffff8800121899d, 0x7)
    1x Bugcheck 59  [0x3B] (0xc0000005, 0xfffff800031842b1, 0xfffff88007a64ef0, 0x0)
    

    There were also 8 BugCheck (Event ID 1001) events in that time frame:

    11/05/2012 3:21:23 PM  The bugcheck was: 0x00000050 (0xfffff80022fb4fe0, 0x0000000000000000, 0xfffff80002e5586b, 0x0000000000000002).
    11/16/2012 10:16:35 AM The bugcheck was: 0x0000003b (0x00000000c000001d, 0xfffff9600007a7e0, 0xfffff8800716df50, 0x0000000000000000).
    11/18/2012 4:39:51 PM  The bugcheck was: 0x000000c2 (0x0000000000000007, 0x000000000000109b, 0x0000000004323440, 0xfffffa8003649c70). 
    11/18/2012 4:41:38 PM  The bugcheck was: 0x000000d1 (0x0000000000000000, 0x0000000000000002, 0x0000000000000008, 0x0000000000000000). 
    11/30/2012 12:09:51 AM The bugcheck was: 0x000000d1 (0x0000000000000000, 0x0000000000000002, 0x0000000000000008, 0x0000000000000000). 
    12/02/2012 3:47:26 PM  The bugcheck was: 0x0000001e (0xffffffffc000001d, 0xfffff80002f00c29, 0x0000000000000000, 0xfffffa8005476c00). 
    12/06/2012 1:33:05 PM  The bugcheck was: 0x00000050 (0xfffff12014be480c, 0x0000000000000001, 0xfffff8800121899d, 0x0000000000000007). 
    12/09/2012 5:17:29 PM  The bugcheck was: 0x0000003b (0x00000000c0000005, 0xfffff800031842b1, 0xfffff88007a64ef0, 0x0000000000000000). 
    

    Nothing initially jumped out at me because the stop codes varied and I didn’t have any crash dumps to correlate them with. I started working through the stop code documentation for each event to see if there was some commonality.

    Stop Code 0xC2

    I could see by the timestamps on the two events with stop code 0xC2 (one Event ID 41 and one Event ID 1001) were referring to the same event. I referred to the MSDN documentation and the bugcheck parameters to see what type of violation had occurred. The answer is a double free: “The current thread attempted to free the pool which was already freed”. The address it was trying to double free was 0xfffffa8003649c70 and it’s pool header contents were: 0x0000000004323440. Using a kernel debugging session in WinDbg on a Windows 7 x64 target machine, we can see that what a pool header looks like:

    1: kd> dt nt!_POOL_HEADER -v
    struct _POOL_HEADER, 9 elements, 0x10 bytes
       +0x000 PreviousSize     : Bitfield Pos 0, 8 Bits
       +0x000 PoolIndex        : Bitfield Pos 8, 8 Bits
       +0x000 BlockSize        : Bitfield Pos 16, 8 Bits
       +0x000 PoolType         : Bitfield Pos 24, 8 Bits
       +0x000 Ulong1           : Uint4B
       +0x004 PoolTag          : Uint4B
       +0x008 ProcessBilled    : Ptr64 to struct _EPROCESS, 138 elements, 0x4d0 bytes
       +0x008 AllocatorBackTraceIndex : Uint2B
       +0x00a PoolTagHash      : Uint2B
    

    Therefore, taking the pool header as a quadword (0x0000000004323440) we get it’s sequence of bytes as (in hex): 40 34 32 04 00 00 00 00. Placing these into the _POOL_HEADER fields, we get:

    PreviousSize: 0x40
    PoolIndex: 0x34
    BlockSize: 0x32 (size of allocation is thus 0x320 or 0n880)
    PoolType: 0x4 (session pool)
    PoolTag: 0x00000000

    I was hoping for a PoolTag value there to help point me in the correct direction, but alas, dead end. I also tried searching the session pool for allocations of size 0x320 on one of my Windows 7 x64 VMs but didn’t turn up anything interesting. What we do know is that the double free can be caused by either a faulty driver or memory corruption and thus our problem could be either.

    Stop Code 0xD1

    I turned my attention next to the 0xD1 stop code. The MSDN documentation states that this occurs when a page fault occurred at an IRQL too high (>= DISPATCH_LEVEL). Parameter 1 is the memory referenced, Parameter 2 is the IRQL at the time of the reference, Parameter 3 is the access type (read, write, or execute) and Parameter 4 is the address that referenced the memory. Applying that to these bugchecks we see:

    0x000000d1 (0x0000000000000000, 0x0000000000000002, 0x0000000000000008, 0x0000000000000000)
    Memory Referenced: 0x0 (NULL)
    Current IRQL: 0x2 (DISPATCH_LEVEL)
    Operation: 0x8 Execute
    Faulting Address: 0x0

    At first glace it looks like we a NULL pointer dereference in a function at the NULL page, which doesn’t make sense. Referring back to the MSDN page though we see:

    If the first parameter has the same value as the fourth parameter, and the third parameter indicates an execute operation, this bug check was likely caused by a driver that was trying to execute code when the code itself was paged out. Possible causes for the page fault include the following:

    • The function was marked as pageable and was running at an elevated IRQL (which includes obtaining a lock).
    • The function call was made to a function in another driver, and that driver was unloaded.
    • The function was called by using a function pointer that was an invalid pointer.

    This seems to indicate that a faulty driver is the most likely cause of this bugcheck, although it’s plausible that memory corruption lead to this error as well.

    Stop Code 0x3B

    Moving on to the 0x3B stop code, I again refer to the MSDN documentation. Applying this info along with the documentation on NTSTATUS values to the 0x3B bugcheck here we have:

    0x0000003b (0x00000000c000001d, 0xfffff9600007a7e0, 0xfffff8800716df50, 0x0000000000000000)
    Exception: 0xC0000001D (STATUS_ILLEGAL_INSTRUCTION) – {EXCEPTION} Illegal Instruction An attempt was made to execute an illegal instruction.
    Faulting Address: 0xfffff9600007a7e0
    Exception Context Record (CXR): 0xfffff8800716df50

    Since I don’t have a crash dump for this one, I can’t be sure. However, in my experience session specific drivers like win32k.sys and cdd.sys are generally loaded into the 0xfffff960’00000000 address range (vs. the 0xfffff880’00000000 and 0xfffff800’0000000 range that system-wide drivers are loaded into). I haven’t found any documentation on this behavior. Assuming that the address that caused this exception was in a Microsoft driver like win32k.sys or cdd.dll, it’s very unlikely that there is an illegal instruction bug in the code. Rather, this is more likely memory code corruption from a bad driver, a bad device using DMA, or bad memory.

    Stop Code 0x1E

    Next up was the 0x1E stop code. Glancing at it’s documentation we get:

    0x0000001e (0xffffffffc000001d, 0xfffff80002f00c29, 0x0000000000000000, 0xfffffa8005476c00)
    Exception Code: 0xC0000001D (STATUS_ILLEGAL_INSTRUCTION) – {EXCEPTION} Illegal Instruction An attempt was made to execute an illegal instruction.
    Exception Address: 0xfffff80002f00c29
    Parameter 0 of Exception: 0x0
    Parameter 1 of Exception: 0xfffffa8005476c00

    So a few things here. First of all, I believe that the first parameter translates to STATUS_ILLEGAL_INSTRUCTION because NTSTATUS values are 32-bits even on 64-bit architectures, so since 0xC0000001D is a negative number, it’s sign-extending the value to 0xffffffffc000001d. Second is that in my experience, modules like nt and hal get loaded in the 0xfffff800`00000000 range. Again, it seems unlikely that nt or hal would have an illegal instruction bug in them. The MSDN documentation for 0x1E speaks quite a bit about hardware issues and faulty device drivers and recommends you run a memory test. It also mentions a BIOS issue. However, back in October when I was first triaging this issue, I updated the BIOS and the release notes didn’t mention any issues that sounded related to this. Since the crashes occurred both before and after the upgrade, it’s unlikely that it is the cause here.

    Stop Code 0x50

    The final stop code in the list is 0x50. Taking a gander at the MSDN page we again see discussion about faulty hardware:

    Bug check 0x50 usually occurs after the installation of faulty hardware or in the event of failure of installed hardware (usually related to defective RAM, be it main memory, L2 RAM cache, or video RAM).
    Another common cause is the installation of a faulty system service.
    Antivirus software can also trigger this error, as can a corrupted NTFS volume.

    For completeness sake, the particular 0x50 bugchecks we had were:

    0x00000050 (0xfffff80022fb4fe0, 0x0000000000000000, 0xfffff80002e5586b, 0x0000000000000002)
    Memory Referenced: 0xfffff80022fb4fe0
    Operation: 0x0 (Read)
    Faulting Address: 0xfffff80002e5586b (likely nt or hal module)

    0x00000050 (0xfffff12014be480c, 0x0000000000000001, 0xfffff8800121899d, 0x0000000000000007)
    Memory Referenced: 0xfffff12014be480c
    Operation: 0x1 (Write)
    Faulting Address: 0xfffff8800121899d (unknown)

    That 0xfffff12014be480c looks suspicious to me, because in my experience, I don’t see many memory allocations in the 0xfffff120’00000000 range.

    Where to next?

    Now that I had exhausted the bugcheck information from the Windows Event Logs and still didn’t have a clear answer, where do I head next? Stay tuned for Part 2 where I start digging into the crash dumps in more detail.