Ghost in the Machine – Debugging an Ornery Windows Bluescreen Problem – Part 2

Collecting Additional Data Points

Continuing where I left off in Part 1, I had identified that the issue was likely either faulty hardware or a bad driver – which pretty much describes the cause of every bugcheck. I was hoping for something a bit more precise. I did have the one mini-dump so I decided to tear into it a little bit more until I could get my hands on a kernel dump.

Deep Dive: Level 2

The first thing I did after loading the crash dump into WinDbg was see what !analyze -v had to say about it:

2: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

SYSTEM_SERVICE_EXCEPTION (3b)
An exception happened while executing a system service routine.
Arguments:
Arg1: 00000000c0000005, Exception code that caused the bugcheck
Arg2: fffff800031842b1, Address of the instruction which caused the bugcheck
Arg3: fffff88007a64ef0, Address of the context record for the exception that caused the bugcheck
Arg4: 0000000000000000, zero.

Debugging Details:
------------------


EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

FAULTING_IP: 
nt!ObpLookupObjectName+461
fffff800`031842b1 f6              ???

CONTEXT:  fffff88007a64ef0 -- (.cxr 0xfffff88007a64ef0)
rax=fffffa80042bccc8 rbx=fffff80002e0f000 rcx=fffff8a00c73af12
rdx=000000000000006c rsi=0000000000000000 rdi=0000000000000001
rip=fffff800031842b1 rsp=fffff88007a658d0 rbp=fffffa80042bcb10
 r8=fffffa8003c6cf30  r9=fffffa8005fec268 r10=fffff8a000004ca0
r11=fffffa80040f6890 r12=0000000000000000 r13=fffff8a000004ca0
r14=0000000000000040 r15=fffff88007a65a40
iopl=0         nv up ei pl nz na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206
nt!ObpLookupObjectName+0x461:
fffff800`031842b1 f6              ???
Resetting default scope

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  CODE_CORRUPTION

BUGCHECK_STR:  0x3B

PROCESS_NAME:  svchost.exe

CURRENT_IRQL:  0

LAST_CONTROL_TRANSFER:  from fffff800031855f6 to fffff800031842b1

STACK_TEXT:  
fffff880`07a658d0 fffff800`031855f6 : 00000000`00000000 fffffa80`042bcb10 fffffa80`00000000 fffffa80`03d0a260 : nt!ObpLookupObjectName+0x461
fffff880`07a659c0 fffff800`03158d8c : fffff800`03013940 00000000`00000000 fffff8a0`0c445301 00000000`00000005 : nt!ObOpenObjectByName+0x306
fffff880`07a65a90 fffff800`03163ef2 : 00000000`03c9f980 00000000`00020019 00000000`03c9f990 000007fe`00000000 : nt!CmOpenKey+0x28a
fffff880`07a65be0 fffff800`02e8d253 : 00000000`00000008 000007fe`e8425c34 00000000`00000001 00000000`00000000 : nt!NtOpenKey+0x12
fffff880`07a65c20 00000000`771a143a : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`03c9f938 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x771a143a


CHKIMG_EXTENSION: !chkimg -lo 50 -d !nt
    fffff800031842b0 - nt!ObpLookupObjectName+460
	[ 85:95 ]
1 error : !nt (fffff800031842b0)

MODULE_NAME: memory_corruption

IMAGE_NAME:  memory_corruption

FOLLOWUP_NAME:  memory_corruption

DEBUG_FLR_IMAGE_TIMESTAMP:  0

MEMORY_CORRUPTOR:  ONE_BIT

STACK_COMMAND:  .cxr 0xfffff88007a64ef0 ; kb

FAILURE_BUCKET_ID:  X64_MEMORY_CORRUPTION_ONE_BIT

BUCKET_ID:  X64_MEMORY_CORRUPTION_ONE_BIT

Followup: memory_corruption
---------

So we have a one-bit memory corruption that has overwritten code in the nt module. Besides the nice bucket id to key us into this, we can also see it in the disassembly. We know that the f6 instruction has decoded into an invalid instruction, so the code corruption must be right before or right after that instruction. If we disassemble starting just before the faulting IP in the crash we get this:

2: kd> u nt!ObpLookupObjectName+0x457
nt!ObpLookupObjectName+0x457:
fffff800`031842a7 488bb42428010000 mov     rsi,qword ptr [rsp+128h]
fffff800`031842af 4895            xchg    rax,rbp
fffff800`031842b1 f6              ???
fffff800`031842b2 0f8544060000    jne     nt!ObpLookupObjectName+0xaac (fffff800`031848fc)
fffff800`031842b8 b801000000      mov     eax,1
fffff800`031842bd 488d942490000000 lea     rdx,[rsp+90h]
fffff800`031842c5 440fb6c8        movzx   r9d,al
fffff800`031842c9 458bc6          mov     r8d,r14d

Going to a known good copy of the same OS (Windows 7 x64 SP1) and disassembling that same location, we get this:

1: kd> u nt!ObpLookupObjectName+457
nt!ObpLookupObjectName+0x457:
fffff800`02bc6357 488bb42428010000 mov     rsi,qword ptr [rsp+128h]
fffff800`02bc635f 4885f6          test    rsi,rsi
fffff800`02bc6362 0f8544060000    jne     nt!ObpLookupObjectName+0xaac (fffff800`02bc69ac)
fffff800`02bc6368 b801000000      mov     eax,1
fffff800`02bc636d 488d942490000000 lea     rdx,[rsp+90h]
fffff800`02bc6375 440fb6c8        movzx   r9d,al
fffff800`02bc6379 458bc6          mov     r8d,r14d
fffff800`02bc637c 498bcd          mov     rcx,r13

Ignoring the different addresses due to the different module base addresses, we can see that the second instruction is incorrect. The xchg rax, rbp; ??? should be a test rsi,rsi. Looking at the code bytes, we should have 0x4885f6 but instead we have 0x4895f6. There’s the single-bit corruption!

Let’s Try Throwing Money at the Problem

This was another strike in the faulty hardware column for me. The next thing I did was have them run the built-in Windows Memory Diagnostic Tool. I had them configure it to use the Extended test with Default caching and an unlimited number of passes. I was hoping for an easy win at this point because RAM is cheap and easy to install. After about an 18 hrs test run however, nothing had shown up. The next thing I tried as a CPU stress-test using Prime95 to see if the CPU was flaking out when it heated up. I had them use the ‘Blend’ setting which uses lots of RAM in addition to stressing the CPU and 4 threads since the machine is a 4 Core i5 750 which lacks hyperthreading. After running for a few hours it will still running strong without any signs of trouble.

Going back to my last visit when I cleaned the case out, the video card was the only thing that had a significant amount of dust on it. Video cards make heavy use of RAM and use DMA so a fault in the hardware could result in direct corruption of physical memory that would appear to show up in random places in the virtual address space of the kernel and processes on the system. This fact plus the lack of a good GPU diagnostic tool led me to just order them a Radeon HD 6450 to replace their existing ATI Radeon HD 4350. I’ve also had several of those same ATI Radeon HD 4350 cards from the same manufacturer die on me at work as well, which further added to my suspicion. A few days later it arrived and they installed it. All seemed to be well for a couple of weeks and I was beginning to think we had nipped this one. Apparently the machine had other plans though, as it crashed yet again.

Deep Dive: Level 3

Having previously enabled kernel crash dumps, I was now armed with both a mini-dump and a kernel memory dump for this latest crash. Zipped, the kernel memory dump was only 144MB which was easily uploaded to me overnight, even on the slow DSL connection. I once again started off with !analyze -v:

3: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except,
it must be protected by a Probe.  Typically the address is just plain bad or it
is pointing at freed memory.
Arguments:
Arg1: fffff8800764a6c0, memory referenced.
Arg2: 0000000000000001, value 0 = read operation, 1 = write operation.
Arg3: fffff960006b9eca, If non-zero, the instruction address which referenced the bad memory
	address.
Arg4: 0000000000000000, (reserved)

Debugging Details:
------------------


Could not read faulting driver name

WRITE_ADDRESS:  fffff8800764a6c0 

FAULTING_IP: 
cdd+9eca
fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx

MM_INTERNAL_CODE:  0

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

BUGCHECK_STR:  0x50

PROCESS_NAME:  OUTLOOK.EXE

CURRENT_IRQL:  0

TRAP_FRAME:  fffff8800744a4a0 -- (.trap 0xfffff8800744a4a0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffff900c0151020 rbx=0000000000000000 rcx=0000000000000060
rdx=fffff8800744aa50 rsi=0000000000000000 rdi=0000000000000000
rip=fffff960006b9eca rsp=fffff8800744a630 rbp=fffff8800744a6e0
 r8=0000000000000001  r9=fffff8800744af50 r10=fffff900c0151080
r11=fffff8800744a6e0 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cdd+0x9eca:
fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx ss:0018:fffff880`0744a6c0=fffff8800744aa50
Resetting default scope

LAST_CONTROL_TRANSFER:  from fffff80002e8deec to fffff80002ee5fc0

STACK_TEXT:  
fffff880`0744a338 fffff800`02e8deec : 00000000`00000050 fffff880`0764a6c0 00000000`00000001 fffff880`0744a4a0 : nt!KeBugCheckEx
fffff880`0744a340 fffff800`02ee40ee : 00000000`00000001 fffff880`0764a6c0 fffff6fc`80610800 fffff880`0744aa50 : nt! ?? ::FNODOBFM::`string'+0x4514f
fffff880`0744a4a0 fffff960`006b9eca : 00000000`00000000 fffff880`0744ab70 fffff900`c0145020 fffff900`c1daf9c8 : nt!KiPageFault+0x16e
fffff880`0744a630 00000000`00000000 : fffff880`0744ab70 fffff900`c0145020 fffff900`c1daf9c8 00000000`0000056a : cdd+0x9eca


STACK_COMMAND:  kb

FOLLOWUP_IP: 
cdd+9eca
fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx

SYMBOL_STACK_INDEX:  3

SYMBOL_NAME:  cdd+9eca

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: cdd

IMAGE_NAME:  cdd.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x50_cdd+9eca

BUCKET_ID:  X64_0x50_cdd+9eca

Followup: MachineOwner
---------

This bugcheck is different from the last one we’ve got, but we’ve seen that this machine has been issuing bugcheck 0x50 stop codes via the Windows Event Logs I investigated in Part 1. The faulting IP this time is in cdd.dll, which is the Microsoft Canonical Display Driver. Although the bugcheck name is PAGE_FAULT_IN_NONPAGED_AREA, we can see that our current IRQL is 0 (Passive) where page faults are allowed. This indicates that the reference memory address is invalid. We can see that it was a write operation to fffff8800764a6c0 and that the faulting IP was fffff960006b9eca. Viewing the trap frame to see the disassembly of the faulting instruction and the current values of the registers, we see this:

3: kd> .trap 0xfffff8800744a4a0
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=fffff900c0151020 rbx=0000000000000000 rcx=0000000000000060
rdx=fffff8800744aa50 rsi=0000000000000000 rdi=0000000000000000
rip=fffff960006b9eca rsp=fffff8800744a630 rbp=fffff8800744a6e0
 r8=0000000000000001  r9=fffff8800744af50 r10=fffff900c0151080
r11=fffff8800744a6e0 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei pl nz na pe nc
cdd+0x9eca:
fffff960`006b9eca 48898c2490000000 mov     qword ptr [rsp+90h],rcx ss:0018:fffff880`0744a6c0=fffff8800744aa50

Hrmn! All this instruction is trying to do is write the value of rcx to a stack location and it’s essentially access violating. A buffer overrun is one way this can occur, but you generally don’t see that with a static stack offset. Let’s check the math on this calculation:

rsp=fffff8800744a630
rsp+90 => fffff8800744a630 + 90 => fffff8800744a6c0

So the address of the stack write should be fffff8800744a6c0 however we see from argument one of the bugcheck that it was fffff8800764a6c0. Viewing these in 64-bit binary representation we see:

fffff8800744a630 (correct calculation)
1111 1111 1111 1111 1111 1000 1000 0000 0000 0111 0100 0100 1010 0110 1100 0000
fffff8800764a630 (actual access attempt)
1111 1111 1111 1111 1111 1000 1000 0000 0000 0111 0110 0100 1010 0110 1100 0000
                                                    ^

Notice that bit 21 is incorrectly set in the actual access attempt. We have another single-bit memory corruption!

At this point I am a bit puzzled. In my experience, CPU failures are generally pretty rare and if the CPU is going to go bad, it does so very early on (this machine is >3yrs old). I’ve already replaced the video card. I’m leaning towards this being a faulty RAM issue although it could be a motherboard issue. However, from my recollection from my computer organization courses in college, the way that move instruction is going to execute is that that the arithmetic logic unit (ALU) will get the value from the RSP register, add 90, and then directly pass the resulting address to the memory-management unit (MMU) via its select lines, indicating a write operation and the value in RCX. The various processor caches (L1, L2, L3, etc.) are of course involved but the main point is that I don’t recall main system memory (RAM) being involved in the calculation of the write address. The address of the fault is a virtual address and the output from the MMU to the RAM should be a physical address. Said another way, although RAM is the target of the write, I don’t recall there being any RAM access required to lookup the target address. The only things I could think of were:

  1. During the hardware assisted conversion of the virtual address to a physical address, the various structures (PXE, PDE, PTE,PFN) weren’t present in the translation look-aside buffer (TLB) which resulted in a page walk, introducing RAM accesses which presents an opportunity for the target address to be corrupted.
  2. The problem isn’t RAM at all, but the motherboard or CPU or Power Supply Unit (PSU) or ….

Let’s Throw Some More Money at the Problem

Motherboards are fairly pricey, especially OEM boards and they are difficult for the average joe to replace. The CPU in this machine is a Core i5 750 which is still plenty fast for what they need. However, it’s the old LGA 1156 form factor, which they don’t make anymore. As a result, getting a replacement processor was going to be $200 and it’s still difficult for the average joe to replace. So, although the last memory test didn’t show anything, I went ahead and ordered 2 4GB DIMMS to replace the existing 4 1GB DIMMS.

Driver Verifier

I also noticed a few drivers on the system that I didn’t recognize or that have been known to cause issues in the past (based on various Google searches). Windows has a built-in utility called Driver Verifier. Per MSDN – “Driver Verifier monitors kernel-mode drivers to detect incorrect function calls or actions that might corrupt the system.” I went ahead and had them enable Driver Verifier using the standard settings on the following list of drivers:

  1. 000 – Associated with Cyberlink Power DVD. Fishy module name most likely associated with DRM.
  2. mcupdate – McAfee update service. McAfee isn’t installed on the box.Update: After looking at the loaded drivers in Process Explorer, the full name of this module mcupdate_GenuineIntel.dll and it’s the Intel Microcode Update Library from Microsoft Corporation
  3. ElbyCDIO – VirtualClone drive
  4. VClone – VirtualClone drive
  5. fowp64 – This one is really fishy. It’s a driver that injects into every process and hooks file accesses. It has a corresponding user-mode service and is part of something called FileOpen Manager Service. Google searches are turning up very little and it’s installed into it’s own Program Files directory so I don’t know what application it’s associated with.
  6. rikvm – Associated with Cyberlink Power DVD. Google searches turn up a lot of talk of this being a rootkit and people being unhappy with Cyberlink about it
  7. psi_mf – Associated with Secunia PSI
  8. pnarp – Associated with Cisco Pure Networks
  9. k57nd60a – Broadcom Ethernet Driver

Wait and See

Now that I have still have kernel crash dumps enabled, Driver Verifier turned on, the video card replaced, all the drivers and applications fully updated, and some new RAM on the way, it’s time to just wait and see. Hopefully the new RAM takes care of the issue in which case I’ll update this post with a “Hooray!”. If not, then it will be on to Part 3. Until then….

Update 2013-03-08:Within 12hrs of installing the new ram, the machine bug checked 3 times. All the new bug checks point at single bit memory corruption. Going back to the mov qword ptr [rsp+90h],rcx with a stuck bit, I just can’t see how this could be anything other than the CPU. At this point I’m hunting eBay for a Core i5-750 LGA 1156 to give that a go. More when it’s available….

Update 2013-06-13:After swapping out the processor almost 4 months ago, there have been zero crashes. Looks like we finally got this one beat!

Comments

3 responses to “Ghost in the Machine – Debugging an Ornery Windows Bluescreen Problem – Part 2”

  1. Kenzo Avatar
    Kenzo

    Thanks for this great pair of articles! This kind of detail about the detective-style process of troubleshooting helps us all. (Now, if only I had the same patience as you, perhaps I could isolate the cause of my STOP 0xC2 errors!)

  2. david Avatar
    david

    I inherited one of these machines from a client with a I7 1156 processor. He had more then the usual set of problems that he has with most machines.
    The memory tested good in 2 other machines (memtest), but would fail in in the original machine; It seemed that there was an intermittent address line
    problem. Saved the chip and the memory, recycled the motherboard.
    Was telling someone about this a few months later and they suggested reflashing the BIOS would correct the problem, don’t know. if this would have solved
    the problem.
    The entire XPS series has an interesting history with regards to what was promised, what was delivered, and what occurred.
    I find it suspicious that Dell had sold the XPS8000 for a relatively short period of time (about 3 months) and then abruptly removed it from sale.

    1. ZachB Avatar

      Interesting! Fortunately in my situation, the only issue we’ve had with the machine was this. After changing the processor, the machine has been rock solid for over two years now.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.