Dealing with crash reports

¶Dealing with crash reports

In an ideal world -- or, rather, an ideal world from the programmer's standpoint -- software is written, polished, completed, shoved out the door, and that's that. In the real world, it doesn't work that way. Software programs are now too large and complex to be pushed out reasonably bug-free the first time, at least for the waiting time and cost that consumers are willing to put up with. Even in the world of commercial packaged software, where releasing updates is an expensive process given the need to actually find and pull people who still know the code and regression test the software, you're still generally expected to release one patch. But how do you know what to patch?

Crashes are finicky. If you tell people to simply report crashes to you, you're going to get a lot of worthless reports, guaranteed. This doesn't mean your users are stupid or unmotivated, because they largely aren't, but most aren't trained as to what they should look for and report in a software failure, much less for your software. So the best thing you can do to improve your crash diagnostic capabilities is to rig your software to report what you want to know, and that can be done by having it write out a crash report. A crash report is a quick dump of the state of the program and the program's environment at the time of the failure. Since it is done programmatically, a lot of reporting bias is removed, and even if the report itself isn't useful, you can match it to other reports to obtain clues as to what is causing failures in your application.

That leaves the problem of what to put in the crash report.

I'll assume that you don't have one of the easy methods available, such as trucking the person and his machine in next to you (not generally feasible unless you're dealing with in-house QA), and taking a full memory image of the entire process, which takes forever to upload and waste tremendous amounts of storage and bandwidth. What we're looking for here are items that (a) speed up identification of the most common failures, (b) are space efficient, and (c) fairly easy to implement. You don't need a truckload of information in your crash report for it to be effective in helping reduce defects in subsequent releases.

What to put in the report

First, you need basic identification information: program name and version/build. You need the enough information to pinpoint the exact build that failed, down to the executable with the exact checksum. If you can't determine the exact build, then it's going to be annoying to narrow down the possible changes causing a regression and even more annoying to do any sort of disassembly-based sleuthing. Ideally, this is also fine-grained enough to also distinguish internal or local builds, so you can quickly throw out any reports that were inadvertently generated by pre-release or otherwise non-official builds.

You should then also dump the type of failure. Access Violation is the most common type, but you should remember to dump the address and read/write flag so you can determine if it was a null pointer dereference. Other types that will pop up are: unhandled C++ exception, privilege violation (caused by an unaligned SSE access), stack overflow, not implemented (hitting the guard page of another thread's stack), and illegal instruction.

The next most important piece of information is the execution point of failure. Most failures are caused by stupid coding errors, and just knowing which function failed is often enough to spot the bug in the source. Therefore, you should at the very least dump the instruction pointer (EIP in x86), as well as your module's base address if you are in a DLL; if you have symbol names in the dump, or a post-process utility that quickly adds them to the report, you can speed up initial bug report triage. Line number helps even more, but don't throw away EIP in case the line number information is inaccurate.

A stack trace is also important, because of the failures for which the IP alone isn't enough, knowing the next few calls will often do the trick. Now, if you are on Win32, don't make the newbie mistake of relying on the DbgHelp StackWalk() function. It is not reliable. Win32 on x86 is such a squirrelly execution environment that it is not generally possible to determine the call stack by the disassembly alone, due to the presence of __stdcall and thiscall calling conventions, which rely on the called function to remove arguments... with the problem being that you can't determine the called function statically due to indirect calls. Even worse, the Visual C++ compiler doesn't generate nearly enough information to unwind the stack reliably from all failure points in optimized code, and in foreign code you will have no debug information at all. If you rely on StackWalk() for your stack, you have a high probability of missing critical calls near the point of failure, and even worse, not dumping enough information to manually reconstruct the correct stack. An EBP-based frame crawl is even more worthless considering that any good optimizer is going to omit the frame pointer. So what should you do?

Bite the bullet and dump the raw stack. Not only will this allow you to manually reconstruct the stack if required, you can also extract valuable parameters and local variables if you spend enough time with the disassembly. You don't need much; a few hundred bytes is frequently enough. I should note that VirtualDub didn't do this originally, and still only dumps only a small portion around ESP. One reason for this is space, and another reason is that I need a crash report that is easily scannable without any additional tools. What it does do, however, is attempt to produce a call stack that doesn't lie. What it does is scan the stack and identify DWORDs that point to potential call sites -- executable memory with preceding data that looks like a CALL instruction. This produces some false positives, but never any false negatives, so the reported call stack is always a superset of the correct stack. More importantly, it can take advantage of information that isn't available to me, namely the instruction data of the other modules within the process, and the names and locations of DLL exports.

A disassembly, or instruction bytes that can be used to form a disassembly, can be useful, especially in conjunction with a register dump. It's not going to be useful except for the hard cases that require machine code level grunt work, though, and requires an experienced programmer with good knowledge of compiler code generation and assembly language. It is quite useful, however, if you are dealing with a crash that is caused by interaction with a third party module, for which you likely don't have the source code or even the binary. It's also useful in that it can help identify the code involved if you don't have the symbols for the build that failed -- an unchanged routine tends to compile to the same object code even if it has moved in location. The sticking point here is that good x86 disassemblers are hard to find and harder to implement. If you think they're easy to implement, I should remind you that some opcodes change mneumonics depending on prefix (JCXZ/JECXZ and MOVSB/D/W/Q), some are aliases (NOP is actually XCHG EAX, EAX special-cased), some old prefixes have been repurposed (many SSE instructions overload the REP and REPNE prefixes, and SSE2 overloads the 66h size override), and Intel just added three-byte opcodes with the SSSE3 instructions in the Core 2 Duo. I gave up trying to find nice patterns in the x86 decoding mess and just implemented a full-blown pattern matching engine for VirtualDub's disassembler. If you're looking into doing this and don't have a disassembler already, I recommend just dumping out raw bytes and hacking up a tool to abuse DUMPBIN /DISASM on your end.

You should also consider a module (DLL) list for several reasons: identifying DLL version mismatches; spotting intrusive third-party applications that are known to interfere, particularly "window skinning" or applications that otherwise have global hooks; and identifying mystery code addresses within the report. The catch here is that the module list can be quite big.

Don't forget to dump the version of Windows that the program was running on. If you have optimized code paths, consider dumping the CPU type, and if you are multithreaded, the number of logical hardware threads.

Finally, you might want to dump a machine identifier, like the computer name, so you can tell if it's a particular machine that is giving you grief. Bad memory can and does pop up in the wild, given that the amount of memory that machines have has gone way up, but error rates haven't decreased to compensate. Dumping a machine identifier in a public scenario may have privacy implications, though, even if the identifier is hashed.

The nitty-gritty details

Don't dump thread and process IDs unless you actually write a table of them somewhere or having other TID/PID values to compare against. Otherwise, they're useless, because they change between every run. What exactly is thread FFFFFF9C? The same goes for handle values, unless you're dumping them to determine if they're null or corrupted.

Segment registers are absolutely worthless on Windows NT, because they never change. The kernel changes the selectors instead. I believe they can change on Windows 9x, but their values are still worthless.

ESP is useful, even if you have nothing to compare it against. Why? Because on Windows NT/2000/XP, the thread stack for an application with default linker settings always grows down from 00130000 to 00030000. From the value alone you can determine if it was the main thread that crashed, and whether deep recursion (or otherwise high stack usage) was occurring.

You might as well dump all of EAX/EBX/ECX/EDX/ESI/EDI/EBP as well. In a C++ method compiled by VC++, the this pointer is in ECX on method entry, and optimized code will often move it into EBX or ESI.

Floating point, MMX, and SSE registers aren't likely to be useful. You may want to consider dumping the FPU control and tag words, though. The FPU control word will help you determine if a floating-point exception was caused by an external module mucking with the thread's floating-point exception mode, as the Borland CRT is apt to do; the tag word will quickly indicate if a crash may have been caused by a missing EMMS/FEMMS instruction in MMX code.

In code that makes heavy use of exception handling, dumping the FS:[0] SEH chain could theortically be useful because it's one of the elements of the call stack that Doesn't Lie(tm). I don't think I've ever had enough nested scopes to make it worthwhile, however.

Application-specific data can be extremely helpful in debugging, but be careful: the greatest sin you can commit here is to crash again in the crash handler. Limit the data structures that you crawl, protect the code in exception handlers, and dump the app-specific info last so the rest of the report survives regardless. Don't forget to flush any I/O write buffers in the process.

3 comments | Oct 12, 2006 at 03:16 | default

Current version

Navigation

Archives

¶Dealing with crash reports

Comments