MPI Trace Compression

Synchronizing the replay engine with record engine for x86 Architecture

(This project is performed by Apoorva Kulkarni and Vivek Thakkar under the guidance of Dr. Frank Mueller)

Aim
The aim of the project is to fix the replay engine of the MPI Trace compression source that is currently out-of-sync with the record engine and test on a set of small and large benchmarks.

Motivation
This project is an extension to the work previously undertaken by Mark Noeth, Dr. Frank Mueller at North Carolina State Univeristy and Martin Schulz, Bronis de Supinski at Lawrence Livermore National Laboratory. Their work, Scalable Compression and Replay of Communication Traces in Massively Parallel Envrionments, has been submitted to the IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2007. Their aim was to implement a trace-driven approach to analyze MPI Communication with the following goals:

Extract entire communication trace
Maintain Structure
Full Replay (independent of original application)
Scalable
Rapid Instrumentation
MPI implementation independent

An analysis tool has been developed which characterizes the communication behavior of large scale massively parallel applications which use MPI to communicate amongst parallel tasks. This tool has essentially two components - record engine and replay engine. The record engine efficiently produces the communication traces in compressed file(s) and replay engine replays the execution and generates statistical data for analysis purpose.

Abstract
The current record engine has some extensions done over a period of time. However the corresponding changes have not been done to the replay engine. So our task is to identify the changes that are needed and verify the changes by benchmarking using small and large benchmarks. For a complete problem description and our approach please see the initial report.

Project Update (Date:11/7/06)
Some progress has been made with respect compiling the old record-replay framework. Some minor changes in the Makefiles with respect to the environment variables were required. One more change was made in the mpi_wrappers.c to successfully compile the source code including the instrumented test files.

Sample programs from "tests" were executed to test whether the record engine produced the trace files that the replay engine uses to perform the inverse operation. It was found that although the record engine generated the desired trace files, the replay engine was not succesfully handling them. On good old printf() debugging it was found that the prsd_utils.c required a modification for MPI_Foo calls.

The issue with MPI_Send, MPI_Recv, MPI_Reduce, MPI_Barrier, MPI_Gather and MPI_Bcast has been solved. What remains is the handling of asynchronous calls such as MPI_Isend and MPI_Irecv. We found that the replay engine is not able to retrieve the request handle that it stored in the global static array earlier.

Also, it has been found that the new replay engine does not implement basic parsing mechanism for most of MPI_Foo calls. This means that we can only look at using old replay with the old record and then finding the minor modifications that may be there in the new record to make the corresponding changes in old replay to get it in sync with new record.

Further, Planned testing with NAS and ASCI benchmarks will have to be done to finally ensure the sync of old record and old replay and new record and old replay engine.

For a detailed project update please view the intermediate report.

Project Update (Date: 11/30/06)
Major progress has been made with respect fixing the record engine and synchronizing the old replay engine with the new record engine. The solved and unsolved issues can be enlisted as follows:

Solved Issues

1. Fixing the new record engine.
We have managed to successfully generate the rsd files for the IS benchmark in the NAS Parallel Benchmarking suite. This required three fixes in the code.
First fix:
While trying to run the benchmark with the new record framework we found that most of the calls were successfully being captured by the record engine but the MPI_Alltoallv() call was aborting in the middle of execution. On further analysis of the code we found that initialization of the value for sender's displacement was incorrect. The fix was nothing but a change in the name of the variable used to denote sender's displacement in the mpi-spec.umpi.extract file. Instead of being sdispl it was named displ in the file.
Changed File: mpi-spec.umpi.extract
Second Fix:
Secondly, the communicator size does not seem to be retrieved in the data structure “op-> data.mpi.size”. This always returns 0 and the allocation of the amount of memory for parameters like sender_count, receiver_count, sender_displacement and receiver_displacement in MPI_Alltoallv does not happen and therefore replay fails.
Changed File: mpi-spec.umpi.extract
Third Fix:
Here we used the stack_sig.h file that was modified by Prasun Ratn earlier. The record framework generates a unique call site address by using frame pointers stored in an execution environment variable by setjmp system call. Since, the code was originally written for blue gene machine with different architecture and different registers, the method of generating the unique sequence had to be modified for X86 architecture. However, the limitation of this fix is that we need to turn off the optimisation flags in the following files :
For record Engine: config/Makefile.config
For replay Engine: replay/Makefile
For Test mpi program: Disable any optimisation flag. This is because with optimization enabled compiler does an inlining of call frames. For further clarification, please refer to the message board.
Changed File: stack_sig.h

2. Implementation of MPI_Alltoallv() in replay engine
The MPI_Alltoallv() function was not being implemented initially in the old replay framework. We have added the implementation code to the prsd_utils.c (it is inverse of record's implementation) in the replay source. But we have not been able to successfully perform the replay operation for that call and hence for the IS benchmark. However the printings illustrate that the implementation is successful but we are currently debugging the replay side of the benchmark which is almost complete (One thread reaches at the final stage and is waiting for other threads. It seems that there are some issues with asynchronous mpi operations).

3. Handling offsets in replay
Record generates source and destination identifiers in Recv/Irecv and Send/Isend in terms of offsets from the current rank. However, replay has been implemented to take absolute identifiers i.e. the rank of the node as the send/receive identifier. Hence, we have calculated the absolute value of the ids by adding the rank in offsets. Fix has been done in the function lookup_prsd_call(...) in the handling of MPI_Send,MPI_Isend,MPI_Recv and MPI_Irecv. Changed File: prsd_utils.c

4. Correct Handling of Offset for Request handle
On debugging we found that errors like MPI_Isend: Unhandled Parameter were coming. We analyzed this problem and found out that offset was incorrectly initialized in handle for MPI_Wait. We have fixed this and tested it and it now works. Changed File: prsd_utils.c

5. A small Hack
If replay does not recognize some prsd_content in the generated rsd file, it exits leaving other threads waiting. To prevent this from hampering the outcome of the result, we have commented exit(0) call. For example, the new record generates a printing of all execution time . This is not recognized by our replay. With our hack it is able to run successfully but just prints out unhandled parameter . This is not a fatal bug and if time permits, we may try to resolve this later on at lower priority.

6. Linking mpiP with the old replay engine
The old replay engine was successfully linked with the mpiP profiling framework. So if the rsd files were properly generated by the record engine and then the replay engine should be able to perform its job and the mpiP successfully creates a profile of the execution. Changed File: replay/Makefile

Unsolved Issues

1. Although it appears that the record engine has been fixed for some common problems a slew of tests need to be performed to ensure that the new record engine is working fine and proper rsd output files are generated. So far tests have been performed using some of the files provided in the tests folder of the record source. We were able to execute some of them successfully. As mentioned earlier we were also able to execute the IS benchmark using the record engine. But more benchmarking needs to be performed to test that most of the general MPI_foo() calls are captured.

2. Some MPI operations seemingly unimportant in terms of their use in benchmarks have been left unimplemented. We need to decide if we can implement a few of them within the time constraint. Some of the operations which we may need to implement could be MPI_Scatter , MPI_Scatterv, MPI_Abort. But we need to reach a final agreement on this with our instructor.

Note: The above files are in debug mode and hence executing them may generate lot of debug output. The project tarball will be updated shortly
For a detailed project update view the report

Project Update (Date: 12/04/06)
We turned the compression off and found that the replay engine works fine and mpiP file is also generated. This suggests that exists a bug in the record framework and is not generating the rsd files properly when the compression is turned on. Currently we are trying to fix this problem.

Project Update (Date: 12/07/06)
We were able to locate the problem of failure of executing the IS NAS parallel benchmark. It was observed that record does not write the src/destination information in the rsd files if they are at an offset of -1. And the replay was not initializing the structure elements which copy elements. And this caused the inconsistency in the output results. We have been able to fix this by initializing them with -1, the default src/destination. Further, on running the benchmark we found that MPI_Alltoallv() call was giving an error "Truncated Length". We were able to successfully run the small programs with MPI_Alltoallv(). It looks like the compression in record does not correctly do parameter matching. To summarize, if we run the IS NAS parallel benchmark with the compression off, then we are able to run the benchmark successfully in both the record and replay framework. However with the compression turned on the record engine does not output correct rsd files due to which the replay engine crashes.

The modified source files can be obtained by clicking on the links below
Changed record engine files:: changed_record_src.tar.gz
Changed replay engine files:: changed_replay_src.tar.gz
The final project report is available here final.pdf

References:

M. Noeth, F. Mueller, M. Schulz, B. de Supinski, Scalable Compression and Replay of Communication Traces in Massively Parallel Envrionments, submitted to IPDPS 2007.
Umpire: a dynamic software testing tool for MPI applications (Jeff Vetter, LLNL/ORNL)
Benchmarks:
- ASCI purple benchmarks
- NAS Parallel Benchmarks

Contact Information: Apoorva Kulkarni - askulkar [at] ncsu [dot] edu
Vivek Thakkar - vthakka [at] ncsu [dot] edu