Performance Analysis and Tuning with TAU

Kevin James Edwards
North Carolina State University
kjedward@ncsu.edu

Description

TAU is a performance analysis tool that can be used to for assisting in creating more efficient code. One can use the tools provided with TAU to help determine where inefficient portions of code may lie so that they can be analyzed and strengthened. TAU also has the capability to analyze parallel applications that use OpenMP. It also is able to be extended to work with programs using the MPI standard.

Since parallel applications must be especially focused on efficiency, TAU and similar tools can be extremely important for identifying portions of applications that are negatively affecting runtime. Parallel applications in particular can be difficult to assess and determine where weaknesses lie. So with the help of TAU and other tools, one can determine if particular techniques can be used to minimize runtime.

I plan on using TAU to first determine which inefficiencies are present in some benchmarks commonly used to test parallel architectures. I will focus on benchmarks that use OpenMP for threading among processors. I may also use other tools for analysis such as mpiP which will help me find any areas for improvement in regards to MPI benchmarks. After finding potential problem areas, I will attempt to alleviate or improve these weaknesses. The goal is to improve overall runtime of one or two particular benchmarks by determining weak points in these benchmarks using analysis tools and attacking these faults with techniques for efficiency improvement.

Progress

Week 1 The focus of the first week was getting TAU installed and running on some benchmarks. I was also to get some new benchmarks that we have not used previously used this semester running on the os cluster. Unfortunately I have yet to get TAU working with any benchmark but I believe I am getting closer to this goal. I have also been working on building benchmarks such as SWEEP3D and those within the OpenMP version of the NAS Parallel Benchmarks.
Week 4 The project implementation and report phases are completed. My original plan included working with the SWEEP3D benchmark as well as devloping and tuning an OpenMP implementation of Strassen's matrix multiplication algorithm. However I had many problems getting TAU to profile parallel applications. After much work I am finally able to get statistics for both MPI and OpenMP programs. However I was never able to get TAU to function with programs written in FORTRAN. With these difficulties combined with the poor performance of the OpenMP implementation of Strassen's algorithm I had to slightly modify my project. Instead of focusing on the aformentioned benchmarks I worked with an OpenMP and MPI+OpenMP implementations of Strassen's algorithm. My results include quite poor performance with the OpenMP implementation due to the large amount of memory overhead of this algorithm. However I received very favorable performance with the MPI+OpenMP implementation. Specifically with large matrices, performance is almost double that of the MPI+OpenMP three-loop method algrorithm previously implemented in class.

Documentation

Project Proposal
Progress Report One
Final Report
Final Implementation