Homework Assignment #5 — Debugging Automation

In this assignment you will develop two debugging support tools to reduce the effort associated with software maintenance.

First, you will implement the Delta Debugging algorithm to minimize "interesting" sets.

Second, you will implement a Coverage-Based Fault Localization Algorithm to identify suspicious lines implicated in test failures.

You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. You must indicate your partner in the report. Use the Gradescope partner selection feature. Only one partner needs to submit the report on Gradescope, but if you both do, nothing fatal happens.

Delta Debugging

You must write a Python 3 program, delta.py, that implements delta debugging to find a minimal interesting subset of a given set. Your program takes two arguments:

the size n of the set to be minimized — for example, if n is 5, the program will find an interesting subset of {0,1,2,3,4}
a command that determines if a given subset is interesting — for example, if the command is is-interesting.sh, you may invoke is-interesting.sh 0 2 4 to probe if the subset 0 2 4 is interesting
- the command returns the exit code 1 if the subset is interesting and 0 otherwise
- note that command may be "bash ./is-interesting.sh" (or something similarly long) on the autograding server for security/permission reasons; your program should correctly handle the case where the single command string passed in contains spaces (for example, students using os.system() should treating it as one command: do not break it up)
- do not try to add your own "bash" or "./" or whatnot with string concatenation — really just simply use the command passed in and append some arguments
- students who are uncertain about how to invoke a shell script from Python are encouraged to check out Python functions such as subprocess.call(), subprocess.run(), or even os.system()

Your program should print out a minimal interesting subset in standard Python list format in sorted order. (This is the only output your program should produce. Do not do anything more. Do not submit a program with debugging output for a grade.)

If the interesting subset is empty, you would print the empty Python list.

There is no starter code, but the lecture slides provide direct pseudocode. This homework is asking you to implement the algorithm we covered in class.

Autograder Submission

We will evaluate your delta.py implementation in terms of the correct answers it generates (i.e., the minimal interesting subsets) and also in terms of the number of probes (i.e., calls to is-interesting.sh).

Minimizing Failure-Inducing Inputs

A standard use of Delta Debugging is to minimize failure-inducing inputs — inputs which cause a program to crash.

For example, consider this (inefficient) implementation of is-interesting.sh that checks to see if its command-line arguments contain both 3 and 6:

$ cat is-interesting.sh
#!/bin/bash
FIRST=0
SECOND=0
for i in $* ; do
  if [ $i -eq 3 ]; then FIRST=1 ; fi
  if [ $i -eq 6 ]; then SECOND=1 ; fi
done
if [ $FIRST -eq 1 ] ; then
  if [ $SECOND -eq 1 ] ; then
    exit 1 # interesting
  fi
fi
exit 0

To find a minimal interesting subset of [0,1,2,3,4,5,6,7,8], we would run:

$ python3 delta.py 9 ./is-interesting.sh
[3, 6]

The above example shows the only expected or required output for your program: the single line with the list of elements.

With debugging output turned on (do not submit with debugging output like this), we see:

$ python3 delta.py 9 ./is-interesting.sh
delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3
delta.py: calling ./is-interesting.sh 4 5 6 7 8
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [0, 1, 2, 3] )
delta.py: calling ./is-interesting.sh 4 5 6 7 8 0 1
delta.py: calling ./is-interesting.sh 4 5 6 7 8 2 3
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [2, 3] )
delta.py: calling ./is-interesting.sh 4 5 6 7 8 2
delta.py: calling ./is-interesting.sh 4 5 6 7 8 3
delta.py: dd(p= [4, 5, 6, 7, 8] , c= [3] )
delta.py: dd(p= [0, 1, 2, 3] , c= [4, 5, 6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3 4 5
delta.py: calling ./is-interesting.sh 0 1 2 3 6 7 8
delta.py: dd(p= [0, 1, 2, 3] , c= [6, 7, 8] )
delta.py: calling ./is-interesting.sh 0 1 2 3 6
delta.py: dd(p= [0, 1, 2, 3] , c= [6] )
[3, 6]

In this example we see that Delta Debugging made 9 probes to is-interesting.sh to find a minimal subset (compared to 2^9 = 512 to exhaustively find a minimal subset). Delta Debugging is usually very efficient.

Note that the debugging output shown above is only to help you see the algorithm in action and is not something you should have enabled in your final submission.

Minimizing Failure-Inducing Changes

To use Delta Debugging to minimize a different type of set, such as a set of failure-inducing changes, it suffices to change your is-interesting.sh script.

Consider the program wireworld.c from Rosetta Code. We have an original version of it as well as 10 patches:

$ cat patch.0
6a7
> this looks bad

$ cat patch.2
11c13
<     "+-----------+\n"
---
>     "+===========+\n"

$ cat patch.4
25c27
<   for (i = 0; i < w*h; i++) {
---
>   for (i = 0+0; i < w*h; i++) {

We observe that the original version compiles correctly:

$ gcc -c wireworld-original.c
$

But if we apply all of the patches, it no longer compiles:

$ cat patch.* | patch -p0 wireworld-original.c
patching file wireworld-original.c

$ gcc -c wireworld-original.c
wireworld-original.c: In function ‘next_world’:
wireworld-original.c:36:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
wireworld-original.c:36:11: note: each undeclared identifier is reported only once for each function it appears in

Given the correct definition of "interesting", Delta Debugging can find the bad subset of patches for us:

$ python3 delta.py 10 ./is-failure-inducing-change
delta.py: dd(p= [] , c= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] )
delta.py: calling ./is-failure-inducing-change 0 1 2 3 4
patching file wireworld-original.c
delta.py: calling ./is-failure-inducing-change 5 6 7 8 9
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [5, 6, 7, 8, 9] )
delta.py: calling ./is-failure-inducing-change 5 6
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [5, 6] )
delta.py: calling ./is-failure-inducing-change 5
patching file wireworld-original.c
delta.py: calling ./is-failure-inducing-change 6
patching file wireworld-original.c
foo.c: In function ‘next_world’:
foo.c:34:11: error: ‘j’ undeclared (first use in this function)
       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';
           ^
foo.c:34:11: note: each undeclared identifier is reported only once for each function it appears in
delta.py: dd(p= [] , c= [6] )
[6]

$ cat patch.6
34c36
<       out[i] = (hc == 1 || hc == 2) ? 'H' : '.';
---
>       out[j] = (hc == 1 || hc == 2) ? 'H' : '.';

You must write is-failure-inducing-change (a bash shell script) that determines if given patches (named patch.0 through patch.n), when applied to wireworld-original.c, result in a GCC compilation error. Your script is responsible for restoring wireworld-original.c to its pristine state after each run. You will likely have to read up on how the Unix patch utility works.

The inputs, outputs and arguments to is-failure-inducing-change are exactly the same as is-interesting from HW5a.

The wireworld example shown above is available for local testing.

Minimizing Test Suites

We can also use Delta Debugging to find minimal subsets of test suites that retain the same coverage. We will return to one of our favorite programs, libpng-1.6.34.tar.gz, since many students manually minimized test suites for it in an earlier homework.

HW5c is a puzzle!

Many students find that they have questions about aspects of HW5c. However, HW5c is intentionally a puzzle — a thought question in which you are being assessed on your creativity and your ability to synthesize and apply course concepts.

As a result, we are not in a position to answer questions about HW5c. (Student questions about HW5c on the forum are usually directed to a uniform thread that simply restates, with sympathy, that we are not in a position to answer questions about this intentional puzzle.)

For example, students may run into tricky situations when working on HW5c, and debugging can be difficult when you don't know the right answer (as in real life). We definitely acknowledge that that is a challenge. However, we believe that a smaller number of open-ended challenge questions like this one provide a net benefit in terms of what you learn, internalize and retain from the course.

The most direct hint we can offer is to read the requirements for the written report for HW5c, including the hint about how multiple answers could result in full credit.

You should use Delta Debugging with the goal of finding a minimal subset of these 1639 test cases such that that subset has the same line coverage as the complete set as reported by gcov (just as in Homework 1). Read the PDF report guidelines carefully before beginning.

A key sub-action in HW5c is "If someone tells me to only look at A.png, B.png and C.png, can I compute the coverage associated with just running those three files as test inputs?" We recommend that you first make sure you can do so manually. Once you are certain you can do it correctly manually, you can code it up as the heart of your notion of interesting.

(Students sometimes ask about how to access gcov output from inside a script. There are many possible answers here. For example, you might redirect the output to a file and then read in that file. We also note that is-interesting does not have to be a bash script: you could write a Python script instead, which may simplify text processing and reasoning. Similarly, sometimes students are tempted to rerun make or configure inside their is-interesting scripts. Double-check HW1 and HW2 for a refresher on coverage instrumentation and why you do not need to do that here.)

You must submit a short PDF report, via Gradescope, describing your experiences using Delta Debugging to minimize this test suite. In particular:

[interesting] What was your high level notion of interesting for this activity? How did you implement it?
[1 point for concept, 2 points for implementation]
[results] What was the original coverage of the test suite? What were the size and coverage of your final result? Were you able to use Delta Debugging to produce a minimal subset of the test suite with the same coverage? Why or why not?
- Hint: Note that both why and why not can potentially be full-credit answers, depending on the logical argument you present and the mastery of the course material you demonstrate. This is a tricky thought question.
- (You may be tempted to ask for clarifications on this point, but I'm afraid we must leave this aspect to your interpretation. Figuring out what it means to provide a high-quality writeup here is part of the assignment.)
[1 point for numbers, 5 points for explanation]
[time] How many probes (i.e., calls to "is-interesting") did Delta Debugging make to determine if a set was interesting? What was the overall runtime? How does this formally compare to the expected "Big-Oh" performance? How does this informally compare to a manual effort to minimize the test suite (either your own in an earlier homework or a hypothetical situation)?
- (As a possible hint, if you feel like you don't have enough information to reason about Big-Oh notation, consider running your program on inputs of smaller and larger sizes. For example, if you ran your program on the first 50 tests, then separately on the first 100, then separately on the first 250 (noting that you'll have to change the definition of "same coverage as the original" each time), etc., you could start making a plot of "number of probes" as a function of "size of input". That sort of graph would go a long way to helping you reason about the asymptotic complexity.)
[1 point for numbers, 3 points for comparisons]
[conclusion] Would you use Delta Debugging to minimize a test suite in the future? Why or why not?
[2 point for insight]

What if test suite minimization seems to be taking too long?

Planning your time in advance and scheduling so that you do not start too late are major themes in this course.

Most students are able to run the delta.py computing part of HW5c "overnight". If it seems to be running for too long, one possibility is that you have made a mistake in your programming (e.g., of Delta Debugging or the definition of interesting) and another is that you are running on a very slow setup (e.g., a virtual machine in an unplugged laptop with files stored on a networked drive, etc.).

You can roughly predict how long the entire operation will take by measuring how long it takes to generate coverage for the full test suite and then studying the Big-Oh running time for Delta Debugging.

If all else fails and you fear it will take too long:

Consider passing multiple png files to pngtest at once. (That may give slightly different coverage results than in HW1, but as long as you do it consistently, it is fine for this assignment.)
Some students cached the per-test coverage results and simply re-used them later (e.g., if you know the lines covered by test #1 and the lines covered by test #2, simply union them to get the lines covered jointly).
Some students have reported success with using Amazon AWS instances or similar free-or-cheap cloud computing options.
If started the assignment too late and your computer will not finish in time, you may use a prefix of the test suite. That is, treat the assignment as if you had been given the first 500 tests in the zip file (for some value of 500 that will allow it to run overnight on your setup). Note that to do this you will have to change your definition of "full coverage" to match what is provided by those 500 tests. If you do this you must explicitly indicate the time taken to compute coverage on all 1639 test cases, the reduced number of tests chosen (e.g., 500), and the maximum total coverage associated with your test subset in your report.

Coverage-Based Fault Localization

In the second part of this assignment we will implement coverage-based fault localization atop the output of gcov.

In addition to reporting summary coverage statistics, gcov also indicates how many times each line was visited. This information can be found in the produced .gcov files (which are actually just text files!), as detailed in its documentation.

For example, consider a fragment of pngtest.c.gcov from an example run:

        6: 2020:         if (i == 1)
        2: 2021:            status_dots_requested = 1;
        -: 2022:
        4: 2023:         else if (verbose == 0)
        4: 2024:            status_dots_requested = 0;

The lines begin with whitespace. Then the first number is the "number of visits observed" and the second number is the "line number in the original code". The rest of the text is the original source code to the program. So in this example, the if on line 2020 was visited 6 times. The true branch was taken 2 times, and the other four times, verbose was always 0. (You will only need to read the first "number", the colon and the second number.)

You should generate your own .gcov files and take a look at them; the format (e.g., whitespace, etc.) might be slightly different on your machine.

In class we discussed the Tarantula coverage-based fault localization algorithm. In this assignment you will implement the Ochiai coverage-based fault localization algorithm, as summarized in Pearson et al.'s Evaluating and improving fault localization (local copy). They are very similar, but Ochiai is believed to be better than Tarantula in the real world.

Note that passed and failed and the like are defined in the paragraph just above the formula in the paper on page 3. Just look a few centimeters up from the bottom right! Totalpassed is the number of passed runs, or in our case, the number of "pass.gcov" files, while passed(s) is the number of "pass" files that visit line (s), and so on.
Note that passed(s) is not the number of times the line was visited in one single run. Instead, it is the number of test cases that visit that line at all. Even if a line inside a loop was visited a million times on one test case, that single test case still just counts for "1" in this math.

You must write a Python 3 program, faultloc.py, that takes the names of multiple .gcov files as command-line arguments. The arguments will all be named passi.gcov and faili.gcov. For example, you might receive pass0.gcov, pass1.gcov, pass2.gcov, and fail0.gcov, fail1.gcov as arguments. The former encodes coverage from passing runs, the latter from failing runs. Your program must compute the Ochiai suspiciousness rating for each applicable line and print out a sorted list of the top 100 most suspicious lines (and their suspiciousness values). This printout should be formatted as a Python list of tuples using the standard Python print function. For example, if line 5 has suspiciousness 0.75 and line 9 has suspiciousness 0.2, then you would print:

[(5, 0.75), (9, 0.2)]

Details:

If the beginning of a line is marked #####, - or ===== in all tests it is not relevant.
Note that lines can actually start with whitespace (see example above). You may find python functions such as split, strip or rstrip helpful.
If there are fewer than 100 relevant lines, print all of them.
When you are computing the fault localization formula (which involves division):
- If the numerator is zero, then you should include that statement with a suspiciousness of 0.0 in your final list. (Example: 0/5 is 0.0.)
- If the denominator is zero, then that value is undefined, so you should not include that statement with a suspiciousness value of 0.0 in your final list. (Example: 5/0 is undefined.)
You must print the answers in descending order of suspiciousness, with the most suspicious lines at the top. In case of a tie, break ties by printing the line with the smaller line number first.
To print the output, please use Python's print function on the list-of-up-to-100-pairs. Do not individually iterate through the elements and print them or cast values from one number format to another.

Submission

Via the autograder, you should:

HW5a — submit delta.py (a Python 3 program)
HW5b — submit is-failure-inducing-change (a bash shell script that does not end in .sh)
HW5d — submit faultloc.py (a Python 3 program)

Via Gradescope, submit the PDF report for HW5c. You must indicate your name and UM uniqname (i.e., UM email address) and partner's name and UM uniqname (if applicable) in the report. There is no explicit format (e.g., for headings or citations) required. For example, you may either use an essay structure or a point-by-point list of question answers.

FAQ and Troubleshooting

In this section we detail previous student issues and resolutions:

Question: For HW5b, when minimizing failure-inducing changes, my approach appears to work locally but not on the autograder.

Answer: Some students report that applying the patches one by one (rather than using cat to apply them "all at once") resolves this issue.
Question: I'm seeing an error like ./is_interesting.sh: line 4: syntax error near unexpected token `$'do\r''.

Answer: Some students resolved this by running dos2unix on the is-interesting.sh script, since there are extra invisible characters (the \r above) that Windows uses.
Question: I'm seeing an error like sh: 1: ./is-interesting-2.sh: Permission denied on the autograder.

Answer 1: Make sure you're using os.system to invoke the interesting function.
Answer 2: Make sure you're treating the command passed in to you as a single command, even if it is a string that contains spaces. Don't break it up. Locally, try out both "/bin/bash ./is-interesting.sh" and ./is-interesting.sh as commands and make sure your script works with both.
Question: I'm seeing /bin/bash: /bin/bash: cannot execute binary file on the autograder.

Answer: You don't need to add an "extra" bash to the beginning of the command you run. Just run the command you are given.
Question: When I run gcov it gives me coverage for each file but does not give the total coverage.

Answer: This is a gcov version issue. You need to install gcov-5 to get the summary.