Gprof Profiling via Binary Instrumentation

Methodology

Overview
Our instrumentation of arbitrary binaries took place over the course of four stages. The first stage consisted of a familiarization with the tools we would be using (Gprof, Gnu C Compiler, and Dyninst). The second stage consisted of proving we could insert profiling code directly into the source of an application and ensuring the profile was correct. The third stage extended the second by ensuring that the inserted source code could be called from a dynamic library. This additional stage proved necessary because Dyninst calls its snippets from a dynamic library. The final stage consisted of dynamically inserting profiling instrumentation directly into arbitrary binaries using Dyninst.

Stage 1: Familiarization with tools
The first step of the project was becoming familiar with the tools we would be using to develop a method of instrumenting binaries with Gprof profiling. The tools we focused on were Gprof, the Gnu C Compiler (gcc), and Dyninst. Mike Noeth and Jyothish Varma worked to become team experts on Gprof and gcc while Tao Yang became the expert on Dyninst.

Through use of GDB and examination of the gcc source code, we discovered that a function called mcount was called at the beginning of every function to be profiled by Gprof. On the first call to mcount, all data structures for profiling were setup, and an initial call to profil (a system call to perform a stack walk and collect information every 10 milliseconds – see man profil for more information) was made. Subsequent calls to
mcount resulted in tallying calls to each individual function while the profil continued to perform its stack walk every 10 milliseconds. The final call to mcount resulted in all the data collected being written to file (gmon.out) and stopping the profil stack walk. The mcount function was extracted from the gcc source code and compiled into an object file from which we could link the mcount call into our programs.

Familiarization with Dyninst was broken down into three major tasks to be prototyped. The first task was to specify a binary for Dyninst to instrument with code. This would later allow our tool to instrument a target binary with Gprof profiling code. The second task required that we be able to identify all the functions used in a program. This would later allow our tool to identify the functions to be prototyped. The third task required that we be able to insert code at the beginning of a function. This would later allow our tool to instrument the necessary call to mcount at the beginning of each function.

Stage 2: Static compilation of mcount
The second stage consisted of creating a simple application and inserting profiling code into the existing source code by hand. This would ensure that we understood how mcount was working, and that we knew when and where to call it. Two binaries were compiled: one with the -pg option specified and the other with the profiling code inserted by hand. Both binaries generated gmon.out profiles that could be examined using Gprof.
To complete this stage, we needed to ensure the resultant profiles of both binaries matched. We began this stage by creating a micro benchmark that would serve as our application (Micro benchmark source code). The benchmark made multiple function calls to create a somewhat complex call graph. We compiled the micro benchmark and ensured it executed as expected. Next, we modified the micro benchmark with calls to mcount at the beginning of each function. The modified micro benchmark was linked to the object file containing mcount obtained in stage 1.
We compiled two binaries: the first was the unmodified micro benchmark compiled with the -pg option specified; the second was the modified micro benchmark linked with the mcount code obtained in stage 1. Both binaries generated gmon.out files which were examined using Gprof. The resultant profiles matched exactly in the number of calls made to each function as well as the time spent in each function. From here we decided
we were ready to move onto the next stage.

Stage 3: Dynamic compilation of mcount
Originally, we did not plan on a third stage. We thought that after proving our design via stage 2, we would be able to move on to stage 4 (implementing a tool to instrument binaries with Gprof profiling). After attempting stage 4 and seeing no success, we had to back track. Reviewing Dyninst documentation revealed that the code we were inserting was called from a dynamic library. Thus we decided that we had to call the mcount function from a dynamic library (by hand) to ensure this was not the cause of failure in stage 4. The mcount function discovered in stage 1 was compiled into a dynamic library. We loaded our dynamic library (appendix A – Wrapper source code) and rather than making
static calls to mcount in the modified micro benchmark, we called mcount from the dynamic library. This additional stage revealed that part of our problem was due to name collisions. When gcc is compiled and installed, it adds an additional library with mcount and its supporting
functions. Rather than calling our profiling code from the dynamic library we created, the functions from the library installed with gcc were being invoked. By renaming mcount and all the supporting functions, we were able to avoid the name collisions.

Stage 4: Dynamic instrumentation of mcount
With a new dynamic library created from stage 3 (with new names for mcount and its supporting functions) we attempted to create our tool again. We simply attempted to call the renamed mcount at the beginning of each function for an arbitrary binary. There were two complications at this phase. The first complication was that mcount requires knowing the address range of the code it is profiling, but this information was being
corrupted due to the use of dynamic libraries. The second complication was that Dyninst was messing up the stack. Thus, the stack walk code was unable to perform correctly. The first call to mcount setup data structures for tracking the profiling data. To allocate space for this data to be collected, mcount requires knowing the address range of the program being profiled. In the gcc version of mcount, a hard coded value is used to
specify the beginning address of the code segment, and a function etext (we believe this stands for end of text segment) returns the address of the last function in the text segment of a program to be used as the ending address of the code segment. The ranges were coming out incorrect due to another naming collision with etext. The etext function was not returning the correct address anymore. To resolve this issue, we used Dyninst to scan all the functions be profiled, collect their addresses, and pick the largest and smallest address found. The smallest address specified the beginning of the range while the large address specified the end of the range. With this modification, we were able to work around the name collision of etext. In order to create a call graph, Gprof requires knowledge of which function called mcount (we will refer to this function as “Caller #1”), and knowledge of which function called Caller #1 (we will refer to this function as “Caller #2”). To obtain these values, the mcount code uses a system call __builtin_addr(int level). To get Caller #1, mcount uses __builtin_addr(0), and to get Caller #2, mcount uses __builtin_addr(1). When Dyninst is used to insert calls to mcount, the stack is corrupted (see figure 1). Due to Dyninst’s method of inserting code, mcount could no longer determine Caller #2. We tried to use increasing levels of __builtin_addr, but these resulted in memory address
from the heap (where Dyninst created its own fake stack). When the top of the stack is reached, any higher levels cause the application to segfault.

To work around the Dyninst stack, we implemented our own simple stack (Fake stack source code). Upon entering a new function, we pushed that functions address onto the stack. Upon leaving a function, we popped the address off of the stack. We modified mcount so that it would use the address of the item on the top of the stack for Caller #1 (rather than __builtin_addr(0)) and the item one away from the top of the stack
for Caller #2 (rather than __builtin_addr(1)). With this modification, we were able to work around the stack corruption that Dyninst introduced.
After implementing our own range finder and stack, we were able to generate the correct profiles for arbitrary binaries. The proof of our design is shown in the next section.

For problems or questions regarding this Web site contact [jsvarma@ncsu.edu]