Tuesday, September 12, 2017

Cool Advanced Troubleshooting Technique - WinDbg

I was having some problems with my Thinkpad laptop hanging. It didn't blue-screen. It just stopped updating the screen, and did not reboot. It seemed either that the entire video layer had hung, or that the kernel itself had hung (such as waiting on a spinlock forever), or that a critical subsystem of my computer had failed (since windows uses virtual memory heavily, such a problem can also be commonly caused by a storage layer failure, such as your primary boot hard disk).

What's relevant to this delphi blog,. is that it was Delphi's own debugger that most commonly triggered the hang.  Exiting the debugger, terminating a debug session, was the most common way to hang the system.

I've watched a few of the videos on MSDN Channel 9 from the guys called Defrag Tools. This is some serious low level technical know-how that these folks are dishing out. If you're into it, go check them out.   

Anyways after a bit of troubleshooting, I decided to try switching out various driver versions to see if I can make the hang go away. After trying the OEM Intel HD 520 graphics drivers from Intel's website, I was able to work inside Delphi all day today with NO crashes.

This is not my first "Delphi + Video Driver" hell experience.   About two years ago, I had a crazy set of bugs in Delphi that only reproduced with certain ATI/AMD video cards, and only in parts of my Delphi application using Microsoft Direct2D/DirectX.   Direct2D surfaces would just stop rendering, and whatever parts of my delphi application were using Direct2D apis just stopped working.  A couple years before that, I saw similar problems with GDI+, also on ATI/AMD video cards.   And going way back about 20 years ago, in the WIndows 98 era, I remember there were bugs in ATI video cards of the era that broke the ability of the Win32 GDI layer to render transparent bitmaps.

But this one is the first time I've had a Delphi program, or Delphi itself, plus a bad video driver, actually hang my entire system, requiring a cold reboot to recover.

More details on my question on superuser.com here.

Some suggested ways to get started:

1. install the Windows 10 SDK to get WinDBG installed on your computer.
2. on a system which generates BSODs, enable memory dumps on your computer in your System -> Advanced Startup Options.  In systems which hang without a BSOD, learn how to enable the Scroll Lock key way to force a hang.
3. watch a few Defrag Tools episodes to get the flavor of how to use WinDBG.

Note that full memory dumps are huge. If you have 64 gigs of RAM a full dump will be more than 64 gigabytes in size.  On my 1 tb disk with about when my total free space is less than 16 gigs, a Helpful windows feature will automatically delete my dumps so that windows doesn't fail to boot.   Try to have at least twice your total memory free on your main hard drive before you try enabling full dumps.  If you have 64 gigs of ram, make sure your primary windows boot disk has at least 128 gigs of free space. For just a bit of learning with Windbg, just enable some non-full-memory dumps so you can get the flavor of the WinDBG tool, and try entering some commands.

There are two "sides" to WinDBG, the kernel side and the user space side.  In the kernel side, you can view call stacks, thread states, change which CPU you are looking at, list NT native kernel level processes, and walk back in time to various saved states. In the user space side, you can see the  Win32 level APIs, processes, and walk back to a time prior to the last saved exception to examine the state of your system before the last first chance exception.  Besides being able to see C/C++ symbols from non-managed DLL/exe images, you can also use some extensions to be able to see things from inside the CLR, so you can troubleshoot problems with .net code.

Unfortunately, the debug information format used by WinDBG is not the same debug information format that is produced by Delphi so this tool is not very helpful for debugging Delphi application crashes, unless you don't mind working with uninstrumented disassembled code with no debug symbols.

1 comment:

  1. The reason why full memory dump is greater than the amount of RAM your system has is becouse during full memory dump Windows is also saving a copy od the Pagefile.sys which by Windows recomendation is about 150% size of the system RAM but depending on your system settings can be even larger. I have seen PageFile.sys sizes to rach 10% of the system partition.