Wednesday, August 29, 2012

The one with BSOD

Some few months ago I looked into a problem with a virtual machine not booting up. Well, it was booting up fine but right before the windows welcome screen was BSODing with this screen
 
A problem has been detected and windows has been shut down to prevent damage to your computer (your standard text) and ..
A process or thread crucial to system operation has unexpectedly exited or been terminated.
 
 
 
and STOP code 0x000000F4.
 
VM was a Windows 2008 server running on xenserver with shared storage. Right before it started to misbehave it's been powered down so that resources like RAM and HD space be added. Once this was done and the machine powered back on it started to BSOD. I'm usually happy when a computer BSODs rather than simply restarting because there's a dump file you can debug. Sudden restarts usually suggest problems at lower level like hardware. However, as always, there's a catch. If your system BSODs when running then that's cool. Take the dump, analyze it and take measures. You have access to the system...you can deploy a fix easily. This one was crashing before I could enter some type of interface. Normal, safe mode, command line, last known good configuration...crashing. Nevertheless, challenge accepted. As expected, there was a dump file on our server (booted with a live cd) and although this issue didn't look like a citrix one to me I had to get Citrix involved to debug it. Even VHD is a stable format and the technology is around for some time now, I never ruled out vhd corruption which could've led to problems. No luck...because it was a kernel dump it was not providing enough data to the engineers at Citrix. Sometimes you need a full memory dump to catch everything (user data). I know how to set a windows os to generate a full memory dump via the gui itself...but how do you do it without gui or registry access? I did it by booting the vm with a windows live cd and mounting the system hive from c:\windows\system32\config\. I then edited the CrashDumpEnabled flag as per
 
 
Needless to say...even though the vm continued to BSOD and the screen was saying writing data to disk...no dump file was being created. Sad face , dead end.
As I was implementing this reg change I could not ignore the fact that currentcontrolset was missing and all I had was controlset1 and 3. So like every normal human out there that doesn't know something but wants to know about it, I found this - http://support.microsoft.com/kb/100010
Aha. Now everything clicks. (for those that don't want to read the MS KB - CurrentControlSet exists only when windows is running. It is nothing more than the ControlSet1 key mapped under it. ControlSetx is your last known good configuration).
I went back to my registry and for some dumb reason I checked the Select key (which controls what control set the system should use for normal booting and last known good configuration booting). Hmm...something is fishy. The default flag has the same value as the lastknowngood flag, or vice versa. This means that each time I wanted to boot into last known boot configuration I was actually booting into the default one, current one. Made the changes and pointed lastknowngood flag to the backup controlset. Restarted vm and.....it booted just fine. Argh...there's something there in the registry that is causing this...but what? Went back to registry and devised a simple trial and error plan. Exported each sub key from the working controlset and had that imported into the non working one. Powered the vm. So on and so forth until I found the subkey that was making the vm stable. In this case the subkey Control. I went then one step further and exported every subkey from this key and repeated the test. Half a day later I ended up with the faulty key "hklm\controlset01\control\session manager\environment"
Right ...which flag is it then? Ran the same tests as above, excluding one by one and.....PATH was the one.

The one causing the problem was having some extra entries at the beginning. After removing those entries (in red in the picture) and leaving the default (what's with black in the picture) server booted just fine. I don't know if the length of the path was causing this or something else, but interesting enough.

Click it to enlarge.

 

No comments:

Post a Comment