Debugging MPI Applications

Throughout this project, we ended up doing a great deal of debugging with the venerable printf family of functions; in general, this method of debugging is tedious and involves a great deal of guesswork. The need for this method arises because our cluster does not have a distributed debugger available. We found a partial workaround:

Debugging MPI Applications with GDB

GDB is not suitable for debugging distributed applications in the normal method (launching the application under GDB). Instead, the debugger must attach to an already running process (so that proper initialization can take place and the real processes can be inspected. Unfortunately, this requires one instance of GDB per task to be debugged. The general process is as follows:

(Optional) Instrument the code; either in the MPI_Init_post wrapper or in the program itself, add a line similar to the following: printf("Rank=%d,pid=%d\n", my_rank, getpid());. Also printing out the hostname (via gethostname()) can be useful when running on multiple nodes.
Further, instrument the code with a call to the sleep() system call for a small amount of time (60 seconds, for example) - long enough to attach to the process.
The easiest way to proceed is to utilize the -machinefile argument to mpirun to limit all of the tasks to run on a single host; this simplifies finding them and then attaching.
After the task is running (and has been initialized), it will sleep due to the instrumentation; in the GDB shell, run the command:
(gdb) attach <pid>
(add breakpoints, etc)
(gdb) continue