MPI Trace Compression Tuning Project

Participants

Alex Balik and Tristan Ravitch

Background

In massively parallel applications, communication is often the cause of poor scalability. In order to better understand the communication patterns used in parallel applications and hopefully improve them, various tools such as mpiP and Vampir have been developed to record MPI usage. Unfortunately, these tools generally suffer from one of two problems:

  1. Lossy replay: In order to keep trace sizes scalable, some MPI information (such as a complete temporal ordering of events) is lost. mpiP falls in this category.
  2. Non-scalable storage requirements: Tools that maintain lossless traces of MPI events require each node to store its own trace file. This is not very scalable since increasing the number of nodes will cause an increase in the amount of space required to store the traces. Vampir falls in this category.

In an attempt to get the benefits of both types of MPI tools, Noeth et al. have developed a tool to compress lossless MPI trace files into a single file with hopefully a constant or near constant size regardless of the number of nodes [1]. This is achieved by using both intra-node and inter-node compression techniques. Intra-node compression uses regular section descriptors (RSDs) to represent repeated sequences of MPI calls (due to loops) in constant size. Stencil identification is also used to compress sequences of MPI calls that communicate in a set pattern (for example in a 2D layout a node might repeatedly communicate with its neighbors to the north, south, east, and west). Inter-node compression takes all the trace files generated during program execution and compresses them down to a single file, grouping common MPI calls along the way.

Problem Description

There are three groups of benchmarks in the NAS Parallel Benchmark suite, grouped according to the performance of the MPI Trace Compression utility:

The poor scaling affects both trace output size and the time required to write out traces. Both of these will need to be addressed. The effects of these scaling anomolies also show in both task- and cross-node-level compression.

The fact that there are two distinct groups of performance anomoly suggest that there are at least two different problems at work (or, in the best case, that the usage of one particular idiom is aggrivated by some unusual pattern in CG, BT, and FT). An initial guess would say that there is a common problem shared by both groups, and that the super-linear group has additional odd usage pattern(s).

Another major problem is that the NAS code is all in Fortran.

Plan of Attack

There are four immediately obvious vectors from which to approach the problem:

We will have to become familiar enough with Fortran constructs to at least be able to examine loop constructs and MPI calls (and to where those MPI calls point).

Timeline

Week Activity Summary
1
  • Identify all of the MPI calls (and strides) in LU and CG.
  • Find the stencil identification code in the MPI Trace Compression utility (as well as the code for the task compression and the cross-node compression).
  • Examine the problematic traces on LU and CG (compared to IS).

Goal - Become familiar with the utility and benchmark code, identify areas for improvement.

We primarily tackled linking problems and identified probable areas that will need to be modified after we manage to acquire traces for our two benchmarks. We have two separate paths of inquiry in addressing these linking errors:

We managed to fix the Fortran linking through the process described in the first link above.

2
  • Alex/Tristan - Investigate intra-node compression to determine if there is any room for improvement.
  • Alex - Determine if intra-node compression can be modified to reduce the trace sizes from LU
  • Tristan - The same for CG
  • Tristan - Hash out a few comparitive traces (to see the growth curves at low node counts) for IS, LU, and CG (now that we can). Accompany with nice charts.

Goal - Determine whether CG and LU can be improved with just changes to stencil code.

The last fix we found allowed us to get production traces from all of the benchmarks; debug traces still seemed problematic (the benchmarks would complete but segfault at the end, before outputting anything). With a very limited window, the traces sometimes completed but were less than informative (due to the limited window size). The first result below fixes some of this limitation and allows us to get partial debug traces (task-level compressed only).

Coming Soon

3
  • Finish remaining work on the stencil code.
  • Move on to task compression code if there are still issues.

Goal - Determine if changes to the task-level compression can improve performance on CG and LU.

Coming Soon

4
  • Finish any work on task compression
  • Move on to cross-node compression code if issues persist.

Goal - Determine if changes to the cross-node compression can improve performance on CG and LU.

Coming Soon

References

  1. M. Noeth, F. Mueller, M. Schulz, and B. de Supinski. Scalable Compression and Replay of Communication Traces in Massively Parallel Environments.
  2. A related thesis
  3. Assignment
  4. NAS Parallel Benchmarks (2.4-MPI)
  5. Fortran language
  6. Fortran 90 introduction
  7. LU Decomposition (LU)
  8. Conjugate Gradient (CG)

Valid XHTML 1.0 Strict