Homework Assignment #5 — Debugging Automation

In this assignment you will develop two debugging support tools to reduce the effort associated with software maintenance.

First, you will implement the Delta Debugging agorithm to minimize "interesting" sets.

Second, you will implement a Coverage-Based Fault Localization Algorithm to identify suspicious lines implicated in test failures.

You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. You must indicate your partner in the report. Use the Gradescope partner selection feature. Only one partner needs to submit the report on Gradescope, but if you both do, nothing fatal happens.

Delta Debugging

You must write a Python 3 program, delta.py, that implements delta debugging to find a one-minimal interesting subset of a given set. Your program takes two arguments:

the size n of the set to be minimized — for example, if n is 5, the program will find an interesting subset of {0,1,2,3,4}
a command that determines if a given subset is interesting — for example, if the command is is-interesting.sh, you may invoke is-interesting.sh 0 2 4 to probe if the subset 0 2 4 is interesting
- the command returns the exit code 1 if the subset is interesting and 0 otherwise
- note that command may be "bash ./is-interesting.sh" (or something similar long) on the autograding server for security/permission reasons; your program should correctly handle the case where the single command string passed in contains spaces

Your program should print out a one-minimal interesting subset in standard Python list format in sorted order. (This is the only output your program should produce. Do not do anything more. Do not submit a program with debugging output for a grade.)

Autograder Submission

We will evaluate your delta.py implementation in terms of the correct answers it generates (i.e., the one-minimal interesting subsets) and also in terms of the number of probes (i.e., calls to is-interesting.sh).

Minimizing Failure-Inducing Inputs

A standard use of Delta Debugging is to minimize failure-inducing inputs — inputs which cause a program to crash.

For example, consider this (inefficient) implementation of is-interesting.sh that checks to see if its command-line arguments contain both 3 and 6:

$ cat is-interesting.sh
#!/bin/bash
FIRST=0
SECOND=0
for i in $* ; do
  if [ $i -eq 3 ]; then FIRST=1 ; fi
  if [ $i -eq 6 ]; then SECOND=1 ; fi
done
if [ $FIRST -eq 1 ] ; then
  if [ $SECOND -eq 1 ] ; then
    exit 1 # interesting
  fi
fi
exit 0

To find a one-minimal interesting subset of [0,1,2,3,4,5,6,7,8], we would run:

$ python3 delta.py 9 ./is-interesting.sh
[3, 6]

The above example shows the only expected or required output for your program: the single line with the list of elements.

With debugging output turned on (do not submit with debugging output like this), we see:

$ python3 delta.py 9 ./is-interesting.sh
delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3
delta.py: calling ./is-interesting.sh 4 5 6 7 8
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [0, 1, 2, 3] )
delta.py: calling ./is-interesting.sh 4 5 6 7 8 0 1
delta.py: calling ./is-interesting.sh 4 5 6 7 8 2 3
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [2, 3] )
delta.py: calling ./is-interesting.sh 4 5 6 7 8 2
delta.py: calling ./is-interesting.sh 4 5 6 7 8 3
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [3] )
delta.py: dd(p= [0, 1, 2, 3] , c= [4, 5, 6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3 4 5
delta.py: calling ./is-interesting.sh 0 1 2 3 6 7 8
delta.py: dd(p= [0, 1, 2, 3] , c= [6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3 6
delta.py: dd(p= [0, 1, 2, 3] , c= [6] )
[3, 6]

In this example we see that Delta Debugging made 9 probes to is-interesting.sh to find a one-minimal subset (compared to 2^9 = 512 to exhaustively find a minimal subset). Delta Debugging is usually very efficient.

Note that the debugging output shown above is only to help you see the algorithm in action and is not something you should have enabled in your final submission.

Minimizing Failure-Inducing Changes

To use Delta Debugging to minimize a different type of set, such as a set of failure-inducing changes, it suffices to change your is-interesting.sh script.

Consider the program wireworld.c from Rosetta Code. We have an original version of it as well as 10 patches:

$ cat patch.0
6a7
> this looks bad

$ cat patch.2
11c13
<     "+-----------+\n"
---
>     "+===========+\n"

$ cat patch.4
25c27
<   for (i = 0; i < w*h; i++) {
---
>   for (i = 0+0; i < w*h; i++) {

We observe that the original version compiles correctly:

$ gcc -c wireworld-original.c
$

But if we apply all of the patches, it no longer compiles:

$ cat patch.* | patch -p0 wireworld-original.c
patching file wireworld-original.c

$ gcc -c wireworld-original.c
wireworld-original.c: In function ‘next_world’:
wireworld-original.c:36:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
wireworld-original.c:36:11: note: each undeclared identifier is reported only once for each function it appears in

Given the correct definition of "interesting", Delta Debugging can find the bad subset of patches for us:

$ python3 delta.py 10 ./is-failure-inducing-change
delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] )
delta.py: calling ./is-failure-inducing-change 0 1 2 3 4
patching file wireworld-original.c
delta.py: calling ./is-failure-inducing-change 5 6 7 8 9
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [5, 6, 7, 8, 9] )
delta.py: calling ./is-failure-inducing-change 5 6
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [5, 6] )
delta.py: calling ./is-failure-inducing-change 5
patching file wireworld-original.c
delta.py: calling ./is-failure-inducing-change 6
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [6] )
[6]

$ cat patch.6
34c36
<       out[i] = (hc == 1 || hc == 2) ? 'H' : '.';
---
>       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';

You must write is-failure-inducing-change (a bash shell script) that determines if given patches (named patch.0 through patch.n), when applied to wireworld-original.c, result in a GCC compilation error. Your script is responsible for restoring wireworld-original.c to its pristine state after each run. You will likely have to read up on how the Unix patch utility works.

The inputs, outputs and arguments to is-failure-inducing-change are exactly the same as is-interesting from HW5a.

The wireworld example shown above is available for local testing.

Minimizing Test Suites

We can also use Delta Debugging to find one-minimal subsets of test suites that retain the same coverage. We will return to one of our favorite programs, libpng-1.6.34.tar.gz, since many students manually minimized test suites for it in an earlier homework.

You should use Delta Debugging with the goal of finding a one-minimal subset of these 1639 test cases such that that subset has the same line coverage as the complete set as reported by gcov (just as in Homework 1). Read the PDF report guidelines carefully before beginning.

You must submit a short PDF report, via Gradescope, describing your experiences using Delta Debugging to minimize this test suite. In particular:

[interesting] What was your high level notion of interesting for this activity? How did you implement it?
[1 point for concept, 2 points for impelmentation]
[results] What was the original coverage of the test suite? What were the size and coverage of your final result? Were you able to use Delta Debugging to produce a one-minimal subset of the test suite with the same coverage? Why or why not?
- Note that both why and why not can potentially be full-credit answers, depending on the logical argument you present and the mastery of the course material you demonstrate. This is a tricky thought question.
- (You may be tempted to ask for clarifications on this point, but I'm afraid we must leave this aspect to your interpretation. Figuring out what it means to provide a high-quality writeup here is part of the assignment.)
[1 point for numbers, 5 points for explanation]
[time] How many probes did Delta Debugging make? What was the overall runtime? How does this formally compare to the expected "Big-Oh" performance? How does this informally compare to a manual effort to minimize the test suite (either your own in an earlier homework or a hypothetical situation)?
[1 point for numbers, 3 points for comparisons]
[conclusion] Would you use Delta Debugging to minimize a test suite in the future? Why or why not?
[2 point for insight]

What if test suite minimization seems to be taking too long?

Planning your time in advance and scheduling so that you do not start too late are major themes in this course.

Most students are able to run the delta.py computing part of HW5c "overnight". If it seems to be running for too long, one possibility is that you have made a mistake in your programming (e.g., of Delta Debugging or the definition of interesting) and another is that you are running on a very slow setup (e.g., a virtual machine in an unplugged laptop with files stored on a networked drive, etc.).

You can roughly predict how long the entire operation will take by measuring how long it takes to generate coverage for the full test suite and then studying the Big-Oh running time for Delta Debugging.

If all else fails and you fear it will take too long:

Consider passing multiple png files to pngtest at once. (That may give slightly different coverage results than in HW1, but as long as you do it consistently, it is fine for this assignment.)
Some students cached the per-test coverage results and simply re-used them later (e.g., if you know the lines covered by test #1 and the lines covered by test #2, simply union them to get the lines covered jointly).
Some students have reported success with using Amazon AWS instances or similar free-or-cheap cloud computing options.
If started the assignment too late and your computer will not finish in time, you may use a prefix of the test suite. That is, treat the assignment as if you had been given the first 500 tests in the zip file (for some value of 500 that will allow it to run overnight on your setup). Note that to do this you will have to change your definition of "full coverage" to match what is provided by those 500 tests. If you do this you must explicitly indicate the time taken to compute coverage on all 1639 test cases, the reduced number of tests chosen (e.g., 500), and the maximum total coverage associated with your test subset in your report.

Coverage-Based Fault Localization

In the second part of this assignment we will implement coverage-based fault localization atop the output of gcov.

In addition to reporting summary coverage statistics, gcov also indicates how many times each line was visited. This information can be found in the produced .gcov files (which are actually just text files!), as detailed in its documentation.

For example, consider a fragment of pngtest.c.gcov from an example run:

        6: 2020:         if (i == 1)
        2: 2021:            status_dots_requested = 1;
        -: 2022:
        4: 2023:         else if (verbose == 0)
        4: 2024:            status_dots_requested = 0;

The first number is the "number of visits observed" and the second number is the "line number in the original code". The rest of the text is the original source code to the program. So in this example, the if on line 2020 was visited 6 times. The true branch was taken 2 times, and the other four times, verbose was always 0. (You will only need to read the first "number", the colon and the second number.)

In class we discussed the Tarantula coverage-based fault localization algorithm. In this assignment you will implement the Ochiai coverage-based fault localization algorithm, as summarized in Pearson et al.'s Evaluating and improving fault localization (local copy). They are very similar, but Ochiai is believed to be better than Tarantula in the real world.

You must write a Python 3 program, faultloc.py, that takes the names of multiple .gcov files as command-line arguments. The arguments will all be named passi.gcov and faili.gcov. For example, you might receive pass0.gcov, pass1.gcov, pass2.gcov, and fail0.gcov, fail1.gcov as arguments. The former encodes coverage from passing runs, the latter from failing runs. Your program must compute the Ochiai suspiciousness rating for each applicable line and print out a sorted list of the top 100 most suspicious lines (and their suspiciousness values). (If there are fewer than 100 relevant lines, print all of them. If a line is marked ##### in all tests it is not relevant.) This printout should be formatted as a Python list of tuples. For example, if line 5 has suspiciousness 0.75 and line 9 has suspiciousness 0.2, then you would print:

[(5, 0.75), (9, 0.2)]

Note that you must print them in descending order of suspiciousness, with the most suspicious lines at the top. In case of a tie, break ties by printing the line with the smaller line number first.

Submission

Via the autograder, you should:

HW5a — submit delta.py (a Python 3 program)
HW5b — submit is-failure-inducing-change (a bash shell script that does not end in .sh)
HW5d — submit faultloc.py (a Python 3 program)

Via Gradescope, submit the PDF report for HW5c. You must indicate your name and UM id and partner's name and UM id (if applicable) in the report. There is no explicit format (e.g., for headings or citations) required. For example, you may either use an essay structure or a point-by-point list of question answers.