Good Practice Guide Debugging

Debugging

Introduction
A note on error messages
Some useful compiler options
Basics of GDB
Using TotalView
Tips for using print
Abnormal Termination Processing (ATP)
References

Introduction

Debugging, the methodical process of identifying errors, is an essential part of writing computer programs. Programmers will need to debug their code as errors appear during testing and development, but released applications typically require debugging too when applications are used with unusual input and by new users for different purposes. Bugs can arise for a variety of reasons; common ones are trying to access memory illegally such as writing beyond the bounds of an array, performing an illegal operation such as dividing by zero, and getting stuck in a non-terminating loop.

It is impossible to produce an exhaustive list of all types of bugs one might encounter, so the purpose of this guide is to introduce some tools and techniques that are generally applicable. By far the most popular debugging tool is the print statement. The idea is to produce a commentary of what your program is doing and deduce where the program fails, either from the point at which output stops, or from the point of unexpected output, and understand why it fails from the values of printed variables. Another important ally is the compiler; changing compilation options can often reveal bugs. More sophisticated (but sometimes no more effective) tools are also available such as the GNU Debugger (GDB) and proprietary tools such as TotalView (which is available on HECToR). These tools allow programmers to stop their programs at specified points, inspect variables and step through code line-by-line.

A note on Error messages

Some bugs reveal themselves through erroneous output, the likely cause of which is an incorrectly programmed algorithm (or a fundamentally incorrect one), but others are more obvious since the program crashes and produces an error message. Often the error produced is accompanied by a UNIX signal, the most common by far being SIGSEGV (segmentation fault) and SIGKILL (killed).

Application exit codes on HECToR's compute nodes are forwarded to the aprun command, which adds 128 to their value. Thus the UNIX signal numbers for SIGKILL (9) and SIGSEGV (11) are 137 and 139 respectively. SIGKILL indicates that the program ran out of memory (more accurately, since Linux uses an optimistic memory allocation strategy it can overcommit the amount of memory available to the program and only when the program comes to touch the overcommitted memory does it cause this error, you will also see a message that the OOM killer terminated your job). SIGSEGV indicates that an illegal memory operation has occurred such as writing to read-only memory or touching a memory location that doesn't belong to the program.

Some useful compiler options

For compiler-specific debugging options, see the man pages for the different compilers (PGI - man pgf90/pgcc, GNU - man gcc, Cray - man crayftn/craycc) or the user manuals, which can be found on-line. This section discusses some of the common and most useful compiler options.

If your bug appears with highly-optimised code the first thing to try is running unoptimised code, compiled with -O0. If the bug disappears it may have been introduced by the compiler because of one of the optimisation options -- try eliminating the options one-by-one to identify the offending option.

The Fortran bounds checking option, typically -C, can be useful particularly if you encounter a segmentation fault. This option adds some extra code which produces a run-time error message for out of bounds array accesses.

Using the -g option adds debugging tags to object files which can be used to produce more information in the event of a crash, such as the source file and line number where the crash occurred. These tags can also be used by debugging tools (see below).

The NAG Fortran 95 compiler

The NAG Fortran compiler is a useful debugging aid because it is one of the strictest standard-conforming compilers and produces detailed error reports. Even if you ultimately want to run your code as compiled by a different compiler, it is useful to use the NAG compiler for its error reporting capabilities. See the compilation section of the userguide for information about how to load the compiler into your environment.

GDB

The GNU Debugger (GDB) is a popular and powerful tool for interactive and post-mortem debugging which relies on compiling your code with the -g compiler flag. Interactive debugging means running your program through GDB and inspecting state as it runs, and post-mortem debugging is inspecting the state of a crashed program after the event.

Note that while it is possible to debug parallel programs with GDB by attaching to running processes, this is not possible from HECToR's compute nodes since GDB is an interactive tool. However, it may be possible on your local cluster. The reader is referred to the GDB website for more information.

Of more use on HECToR is GDB's ability to perform post-mortem debugging.

Post Mortem Debugging

Upon encountering an error in a program, the Operating System may produce a core file, which is a snapshot of the state of the program when the error occurred. This does not happen by default on HECToR; in order to enable core dumps add the line ulimit -c unlimited before calling aprun in your batch script. With a core file and GDB it is possible to examine after it has run the sate of your program. This is similar to running interactively with GDB and issuing the run command at the start. The common things to do with a core file are to display a stack trace with the bt command and show the values of key variables with the print command. In order to inspect a core file with GDB, run as follows: gdb ./a.out core where a.out is the executable that produced the core file.

It is possible to force the operating system to dump a core file on exit from your program when ordinarily it does not, but having one would be useful. Link the following code into your executable and call the function at the start of your program:

#include 
void abortatexit_()
{
 atexit(*abort);
}

This function registers another function to be used on exit from your program (unless your program receives the KILL signal). The abort function registered will raise the ABORT signal, which by default causes the operating system to produce a core file that can be used in post mortem debugging.

Using TotalView

TotalView is an interactive debugger similar in its operation to GDB, except that it is designed to handle parallel programs, can be run from the compute nodes and has a GUI.

In order to use TotalView the first step, as with GDB, is to compile your program with the -g option. Totalview also comes with an optional application for viewing memory statistics called MemoryScape. In order for this to work you must link your application with libtvheap_cnl_static.a. To use MemoryScape you should add the following to your link line:

-L/opt/toolworks/totalview.8.11.0-0/linux-x86-64/lib -ltvheap_cnl_static

Next, change your batch script in the following ways:

Add your DISPLAY environment variable to the PBS options: #PBS -v DISPLAY
Invoke your program in the following way: totalview aprun -a -b -a xt -n $NPROCS -N $NTASK ./a.out (for the X2 use the option -a x2 instead)

For an example, see the Tools section of the HECToR userguide.

To enable MemoryScape you must select the radio button "Enable Memory Debugging" at the start of your job. Then, to start debugging click the 'Go' button in the main window. The smaller window will be filled with a list of processes and the larger window is broken into 4 panes. The main pane in the larger window shows source code listings for the root process. To view a different process, double click your selection from the smaller window. To navigate through your source double-click on function/subroutine names or search for a particular file via the file menu. The pane at the bottom of the window shows a list of action points. This will be empty initially, but for example to set a break point simply click on a line number and the number will turn into a red STOP icon and the action point will be added to the list in the bottom pane. The top two panes show the stack trace and current stack frame. The stack frame lists the contents of registers and the values of variables currently in scope. Another way to inspect the values of variables is to hover your mouse pointer over a scalar or double click an array to open a new window that will list its contents. Once action points are set, run the program by clicking the Go button. The next and step buttons correspond to the next and step commands in GDB discussed above.

To view a memory profile with Memory Scape select Debug and then Open MemoryScape. A new window will appear with a number of options for viewing different aspects of memory usage.

The TotalView GUI: source code is clickable, making setting breakpoints and inspecting variables simple.

As with GDB there are many more options beyond the basic ones discussed in this guide and the reader is referred to the TotalView userguide, available from the documentation section of the TotalView website. However, the basic commands here are more than sufficient for debugging in most cases.

Tips for using print

If your program crashes and you don't know where, print statements can be used to perform a kind of binary search for the statement that causes the crash. It is often clear which are the suspect subroutines/functions in which to add print statements, but in case you have no idea start by adding print commands to the main program, run the code and repeat by adding print to the last subroutine/function to be called. Recursively applying this idea will narrow down the location of the error. Finding the cause of the error relies on inspecting the printed values of key variables, comparing with your expectations.

The print function should be called sparingly. It is possible to be overwhelmed by so much debug output that you can't find the important bit of information that indicates the cause of the bug. Avoid printing every element of an array and be mindful of how many iterations a loop will take if you want to print inside the loop.

Messages sent to the standard output stream (Fortran unit 6 on HECToR) is likely to be buffered and therefore anything printed immediately prior to the point of a crash may not appear. The standard error stream (Fortran unit 0) is unbuffered and is therefore more suitable. However, since it is unbuffered printing takes longer, which is another reason not to print too much.

For parallel programs producing a lot of debugging information is often unavoidable and since the order is non-deterministic the output can be hard to read. Therefore, it is sometimes a good idea to set up a separate output file per process for debugging output. To force output to disk it is possible to use the C functions fflush and fsync. For example, if ofile is the file stream (opened with fopen) then the following will send output to disk:

fflush(ofile); fsync(fileno(ofile));

For Fortran code it is possible to either call a C function that performs the above or call the flush intrinsic function: call flush(unit_number). (Flush is an extension to the F95 standard; F2003 specifies a FLUSH UNIT statement.) For example the following code prints a message and some useful debugging data, flushes this to the system and then waits at a barrier for all processes to complete:

write(300+myrank,*) "your debug message",data call flush(300+myrank) call mpi_barrier(mpi_comm_world,ierr)

Abnormal Termination Processing (ATP)

In preference to using print on HECToR it is possible to get a full backtrace on exit using a tool called ATP. ATP registers signal handlers that will produce backtrace information whenever an application terminates abnormally e.g. from segmentation violations or library aborts. However, it is not possible to catch out of memory (OOM) terminations since such jobs receive the SIGKILL signal, which cannot be caught by a signal handler.

In order to use ATP, make sure you have the atp module loaded (it is loaded by default) before compiling your code, and then simply add ATP_ENABLED=1 to your job script:

#!/bin/bash --login
#
# Parallel script produced by bolt
#        Resource: HECToR (Cray XE6 (32-core per node))
#    Batch system: PBSPro
#
# bolt is written by EPCC (http://www.epcc.ed.ac.uk)
#
#PBS -l mppwidth=32
#PBS -l mppnppn=32
#PBS -N atp_demo
#PBS -A z03
#PBS -l walltime=00:20:00

# Switch to current working directory

cd $PBS_O_WORKDIR

# Run the parallel program

export OMP_NUM_THREADS=1

module load atp
export ATP_ENABLED=1

aprun -n 4 ./testMPIApp some_arguments

On abnormal exit your program should now produce a backtrace on stderr plus a "atpMergedBT.dot" file, which contains backtrace information that may be visualised with the "statview" application.

References

The GNU Debugger
Totalview website
Debugging, Profiling and Optimising HECToR CSE training course