Homework Assignment #5 — Debugging Automation

In this assignment you will develop two debugging support tools to reduce the effort associated with software maintenance.

First, you will implement the Delta Debugging algorithm to minimize "interesting" sets.

Second, you will implement a Coverage-Based Fault Localization Algorithm to identify suspicious lines implicated in test failures.

You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. You must indicate your partner in the report. Use the Gradescope partner selection feature. Only one partner needs to submit the report on Gradescope, but if you both do, nothing fatal happens.

Delta Debugging

You must write a Python 3 program, delta.py, that implements delta debugging to find a one-minimal interesting subset of a given set. Your program takes two arguments:

  1. the size n of the set to be minimized — for example, if n is 5, the program will find an interesting subset of {0,1,2,3,4}
  2. a command that determines if a given subset is interesting — for example, if the command is is-interesting.sh, you may invoke is-interesting.sh 0 2 4 to probe if the subset 0 2 4 is interesting
    • the command returns the exit code 1 if the subset is interesting and 0 otherwise
    • note that command may be "bash ./is-interesting.sh" (or something similarly long) on the autograding server for security/permission reasons; your program should correctly handle the case where the single command string passed in contains spaces (by treating it as one command: do not break it up)
    • do not try to add your own "bash" or "./" or whatnot with string concatenation — really just simply use the command passed in and append some arguments

Your program should print out a one-minimal interesting subset in standard Python list format in sorted order. (This is the only output your program should produce. Do not do anything more. Do not submit a program with debugging output for a grade.)

Autograder Submission

We will evaluate your delta.py implementation in terms of the correct answers it generates (i.e., the one-minimal interesting subsets) and also in terms of the number of probes (i.e., calls to is-interesting.sh).

Minimizing Failure-Inducing Inputs

A standard use of Delta Debugging is to minimize failure-inducing inputs — inputs which cause a program to crash.

For example, consider this (inefficient) implementation of is-interesting.sh that checks to see if its command-line arguments contain both 3 and 6:

$ cat is-interesting.sh
#!/bin/bash
FIRST=0
SECOND=0
for i in $* ; do
  if [ $i -eq 3 ]; then FIRST=1 ; fi
  if [ $i -eq 6 ]; then SECOND=1 ; fi
done
if [ $FIRST -eq 1 ] ; then
  if [ $SECOND -eq 1 ] ; then
    exit 1 # interesting
  fi
fi
exit 0

To find a one-minimal interesting subset of [0,1,2,3,4,5,6,7,8], we would run:

$ python3 delta.py 9 ./is-interesting.sh
[3, 6]

The above example shows the only expected or required output for your program: the single line with the list of elements.

With debugging output turned on (do not submit with debugging output like this), we see:

$ python3 delta.py 9 ./is-interesting.sh
delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3
delta.py: calling ./is-interesting.sh 4 5 6 7 8
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [0, 1, 2, 3] )
delta.py: calling ./is-interesting.sh 4 5 6 7 8 0 1
delta.py: calling ./is-interesting.sh 4 5 6 7 8 2 3
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [2, 3] )
delta.py: calling ./is-interesting.sh 4 5 6 7 8 2
delta.py: calling ./is-interesting.sh 4 5 6 7 8 3
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [3] )
delta.py: dd(p= [0, 1, 2, 3] , c= [4, 5, 6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3 4 5
delta.py: calling ./is-interesting.sh 0 1 2 3 6 7 8
delta.py: dd(p= [0, 1, 2, 3] , c= [6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3 6
delta.py: dd(p= [0, 1, 2, 3] , c= [6] )
[3, 6]

In this example we see that Delta Debugging made 9 probes to is-interesting.sh to find a one-minimal subset (compared to 2^9 = 512 to exhaustively find a minimal subset). Delta Debugging is usually very efficient.

Note that the debugging output shown above is only to help you see the algorithm in action and is not something you should have enabled in your final submission.

Minimizing Failure-Inducing Changes

To use Delta Debugging to minimize a different type of set, such as a set of failure-inducing changes, it suffices to change your is-interesting.sh script.

Consider the program wireworld.c from Rosetta Code. We have an original version of it as well as 10 patches:

$ cat patch.0
6a7
> this looks bad

$ cat patch.2
11c13
<     "+-----------+\n"
---
>     "+===========+\n"

$ cat patch.4
25c27
<   for (i = 0; i < w*h; i++) {
---
>   for (i = 0+0; i < w*h; i++) {

We observe that the original version compiles correctly:

$ gcc -c wireworld-original.c
$

But if we apply all of the patches, it no longer compiles:

$ cat patch.* | patch -p0 wireworld-original.c
patching file wireworld-original.c

$ gcc -c wireworld-original.c
wireworld-original.c: In function ‘next_world’:
wireworld-original.c:36:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
wireworld-original.c:36:11: note: each undeclared identifier is reported only once for each function it appears in

Given the correct definition of "interesting", Delta Debugging can find the bad subset of patches for us:

$ python3 delta.py 10 ./is-failure-inducing-change
delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] )
delta.py: calling ./is-failure-inducing-change 0 1 2 3 4
patching file wireworld-original.c
delta.py: calling ./is-failure-inducing-change 5 6 7 8 9
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [5, 6, 7, 8, 9] )
delta.py: calling ./is-failure-inducing-change 5 6
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [5, 6] )
delta.py: calling ./is-failure-inducing-change 5
patching file wireworld-original.c
delta.py: calling ./is-failure-inducing-change 6
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [6] )
[6]

$ cat patch.6
34c36
<       out[i] = (hc == 1 || hc == 2) ? 'H' : '.';
---
>       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';

You must write is-failure-inducing-change (a bash shell script) that determines if given patches (named patch.0 through patch.n), when applied to wireworld-original.c, result in a GCC compilation error. Your script is responsible for restoring wireworld-original.c to its pristine state after each run. You will likely have to read up on how the Unix patch utility works.

The inputs, outputs and arguments to is-failure-inducing-change are exactly the same as is-interesting from HW5a.

The wireworld example shown above is available for local testing.

Minimizing Test Suites

We can also use Delta Debugging to find one-minimal subsets of test suites that retain the same coverage. We will return to one of our favorite programs, libpng-1.6.34.tar.gz, since many students manually minimized test suites for it in an earlier homework.

You should use Delta Debugging with the goal of finding a one-minimal subset of these 1639 test cases such that that subset has the same line coverage as the complete set as reported by gcov (just as in Homework 1). Read the PDF report guidelines carefully before beginning.

You must submit a short PDF report, via Gradescope, describing your experiences using Delta Debugging to minimize this test suite. In particular:

What if test suite minimization seems to be taking too long?

Planning your time in advance and scheduling so that you do not start too late are major themes in this course.

Most students are able to run the delta.py computing part of HW5c "overnight". If it seems to be running for too long, one possibility is that you have made a mistake in your programming (e.g., of Delta Debugging or the definition of interesting) and another is that you are running on a very slow setup (e.g., a virtual machine in an unplugged laptop with files stored on a networked drive, etc.).

You can roughly predict how long the entire operation will take by measuring how long it takes to generate coverage for the full test suite and then studying the Big-Oh running time for Delta Debugging.

If all else fails and you fear it will take too long:

Coverage-Based Fault Localization

In the second part of this assignment we will implement coverage-based fault localization atop the output of gcov.

In addition to reporting summary coverage statistics, gcov also indicates how many times each line was visited. This information can be found in the produced .gcov files (which are actually just text files!), as detailed in its documentation.

For example, consider a fragment of pngtest.c.gcov from an example run:

        6: 2020:         if (i == 1)
        2: 2021:            status_dots_requested = 1;
        -: 2022:
        4: 2023:         else if (verbose == 0)
        4: 2024:            status_dots_requested = 0;

The lines begin with whitespace. Then the first number is the "number of visits observed" and the second number is the "line number in the original code". The rest of the text is the original source code to the program. So in this example, the if on line 2020 was visited 6 times. The true branch was taken 2 times, and the other four times, verbose was always 0. (You will only need to read the first "number", the colon and the second number.)

You should generate your own .gcov files and take a look at them; the format (e.g., whitespace, etc.) might be slightly different on your machine.

In class we discussed the Tarantula coverage-based fault localization algorithm. In this assignment you will implement the Ochiai coverage-based fault localization algorithm, as summarized in Pearson et al.'s Evaluating and improving fault localization (local copy). They are very similar, but Ochiai is believed to be better than Tarantula in the real world.

You must write a Python 3 program, faultloc.py, that takes the names of multiple .gcov files as command-line arguments. The arguments will all be named passi.gcov and faili.gcov. For example, you might receive pass0.gcov, pass1.gcov, pass2.gcov, and fail0.gcov, fail1.gcov as arguments. The former encodes coverage from passing runs, the latter from failing runs. Your program must compute the Ochiai suspiciousness rating for each applicable line and print out a sorted list of the top 100 most suspicious lines (and their suspiciousness values). This printout should be formatted as a Python list of tuples using the standard Python print function. For example, if line 5 has suspiciousness 0.75 and line 9 has suspiciousness 0.2, then you would print:

[(5, 0.75), (9, 0.2)]

Details:

Submission

Via the autograder, you should:

Via Gradescope, submit the PDF report for HW5c. You must indicate your name and UM id and partner's name and UM id (if applicable) in the report. There is no explicit format (e.g., for headings or citations) required. For example, you may either use an essay structure or a point-by-point list of question answers.

FAQ and Troubleshooting

In this section we detail previous student issues and resolutions:

  1. Question: For HW5b, when minimizing failure-inducing changes, my approach appears to work locally but not on the autograder.

    Answer: Some students report that applying the patches one by one (rather than using cat to apply them "all at once") resolves this issue.

  2. Question: I'm seeing an error like ./is_interesting.sh: line 4: syntax error near unexpected token `$'do\r''.

    Answer: Some students resolved this by running dos2unix on the is-interesting.sh script, since there are extra invisible characters (the \r above) that Windows uses.

  3. Question: I'm seeing an error like sh: 1: ./is-interesting-2.sh: Permission denied on the autograder.

    Answer 1: Make sure you're using os.system to invoke the interesting function.
    Answer 2: Make sure you're treating the command passed in to you as a single command, even if it is a string that contains spaces. Don't break it up. Locally, try out both "/bin/bash ./is-interesting.sh" and ./is-interesting.sh as commands and make sure your script works with both.

  4. Question: I'm seeing /bin/bash: /bin/bash: cannot execute binary file on the autograder.

    Answer: You don't need to add an "extra" bash to the beginning of the command you run. Just run the command you are given.

  5. Question: When I run gcov it gives me coverage for each file but does not give the total coverage.

    Answer: This is a gcov version issue. You need to install gcov-5 to get the summary.