Homework Assignment #3 — Mutation Testing

In this assignment you will write a tool to carry out simple mutation testing on Python programs. Given a program, your tool must generate mutants of that program such that the mutation adequacy score correctly rank-orders a number of held-out test suites for that program in terms of their quality.

You may work with a partner for this assignment. If you do you must use the same partner for all sub-components of this assignment. Use the Gradescope feature to select your partner. Only one partner needs to submit the report on Gradescope, but if you both do, nothing fatal happens.

Cross-Platform Compatibility and Python

It is your responsibility to submit Python code that works on the grading server. This is true even if (indeed, especially if) the grading server uses a different version of Python than your development environment.

This assignment uses Python because the focus is on the mutation algorithm and not the compiler front-end. In theory it would be just as easy to do this assignment on C++ code using LLVM (e.g., via clang) or the like. In practice that would just add out-of-scope overhead without reinforcing course concepts. If you are interested in such topics, consider the Compilers and Programming Languages electives.

Python 3 This Time

The grading server uses Python 3.5.2 for this assignment. (Yes, this is a change from other Homeworks.)

HW3a — Mutation Testing (Python)

You mutation testing tool will take the form of a Python program called mutate.py. Your mutate.py accepts two command line arguments: first, the name of a Python source program to mutate; second, the number of mutants to produce (see below).

Your mutate.py must create a number of new Python source files in the same directory that are mutants of the input program. The mutants must be named 0.py, 1.py, 2.py, and so on. Each mutant is a copy of the original program with one or more mutation operators (see below) applied. Any other output (e.g., logging to standard output) your mutate.py produces is ignored.

The data structure used to represent Python source files is the abstract syntax tree (AST).

You should use the ast module (to read Python source files into abstract syntax trees) as well as the astor module (to serialize abstract syntax trees out to Python source files).

Many students report that the so-called "missing Python AST docs" are quite helpful here.

Other students and instructors have found this quick Python AST Tutorial to be quite helpful (thanks S. Ajami) for the basic concepts.

Your program must be deterministic in the sense that if it is run with the same arguments it should produce the same outputs. (So if you want to use "random numbers", either set the seed to zero or somesuch at the start or use the index of the file you are creating as the random number/seed.)

The ast.parse, ast.NodeVisitor, ast.NodeTransformer, and astor.to_source portions of the interface are all likely to be quite relevant for this assignment.

Mutation Operators

At minimum, you must implement and support the following three mutation activities:

  1. Negate any single comparison operation type (e.g., >= becomes <).
    • (You should strongly consider handling more than one comparison type. You have free choice for what you do, if anything, for other "corner-case" booleans such as is or True.)
  2. Swap binary operators + and -, as well as * and //.
    • (You have free choice of which AST node classes are relevant here, such as how or whether you deal with FloorDiv vs. regular division. Any non-empty choice is full credit. You should strongly consider handling more than binary operators as well.)
    • (You can swap operators "as you like". The most obvious approach is to swap + with -, and so on, but you can try something more creative if you like.)
  3. Delete an assignment or function call statement (i.e., cause it to have no effect).
    • (You have free choice of which AST node classes are relevant here. For example, you might decide that ``assignment'' means Assign, AnnAssign and AugAssign. You may also choose to deal with assignments, function calls, or both. Any non-empty choice is full credit.)
    • (Other edits that have the "effect" of deleting the assignment or function call but are not "officially" deletion are totally fine and are full credit.)

Note that implementing those three actions may not be enough to get full points on the autograder, but you do have to implement those three and describe them in your report to get full credit on the report. You may also need to implement more operators or heuristics to get full credit on the autograder.

Note also that you do not have to use all three with the same frequency (or even use them at all — you could set their probabilities to zero). You do, however, have to implement them as a minimal baseline.

Mutation Testing

For this assignment we will make use of the FuzzyWuzzy string matching library as a subject program. A version of the library merged into one file is available here; it is 686 lines long. You could also do git clone https://github.com/seatgeek/fuzzywuzzy.git, but the merged-into-one-file version prevents you from having to write a mutator that handles multiple files, and is thus highly recommended.

In addition, two public test suites are provided: publictest-full.py and publictest-half.py. The latter was produced by removing every other test from the former. They are thus not independent high-quality test suites, and you are encouraged to make your own higher-quality test suites (e.g., using the techniques from Homeworks 1 and 2).

Recall that one goal of mutation testing is to assess the adequacy of a test suite. We would thus like the automatic results of our mutation testing to agree with our intuitive assessment that the full test suite is of higher quality than the half test suite.

Here is how an example run might go:

$ python mutate.py fuzzywuzzy.py 10

$ cp fuzzywuzzy.py saved.py ; for mutant in [0-9].py ; do rm -rf *.pyc *cache* ; cp $mutant fuzzywuzzy.py ; python publictest-full.py 2> test.output ; echo $mutant ; grep FAILED test.output ; done ; cp saved.py fuzzywuzzy.py 
0.py
FAILED (failures=2)
1.py
FAILED (errors=24)
2.py
3.py
FAILED (errors=35)
4.py
FAILED (failures=5)
5.py
6.py
FAILED (failures=22, errors=1)
7.py
FAILED (failures=1)
8.py
9.py
FAILED (errors=40)

$ cp fuzzywuzzy.py saved.py ; for mutant in [0-9].py ; do rm -rf *.pyc *cache* ; cp $mutant fuzzywuzzy.py ; python publictest-half.py 2> test.output ; echo $mutant ; grep FAILED test.output ; done ; cp saved.py fuzzywuzzy.py 
0.py
1.py
FAILED (errors=11)
2.py
3.py
FAILED (errors=17)
4.py
FAILED (failures=3)
5.py
6.py
FAILED (failures=11)
7.py
FAILED (failures=1)
8.py
9.py
FAILED (errors=21)

In this example we first create ten mutants. We then run each of those ten mutants against the full test suite, and notice that seven of the mutants are "killed" by at least one test (i.e., fail at least one test). Finally, we run each of those mutants against the half test suite, and notice that only six of the mutants are killed. Thus, in this example, the mutation adequacy score for the full suite is 0.7 and the mutation adequacy score for the half suite is 0.6, so mutation testing has assigned test suite qualities that order the test suites in a way that matches our intuition.

(If you don't ever see FAILED when you run this locally, it typically means that your local installation doesn't have some Python libraries or modules installed. View the contents of gthe file test.output for more information.)

In the example above, note FAILED (failures=22, errors=1). The dinstinction between "failures" and "errors" is made by Python's Unit Testing framework and does not matter for our grading. Basically, a "failure" refers to failing the unit test while an "error" refers to any other exception. You are likely doing a better job if you get all failures and no errors, but our grading server just looks for FAILED (which includes both). See Python's Documentation for more information.

Ten mutants is not a large number for randomized testing. We could gain additional confidence by using a larger sample:

$ cp original_saved_fuzzywuzzy.py fuzzywuzzy.py 
$ python mutate.py fuzzywuzzy.py 100

$ cp fuzzywuzzy.py saved.py ; for mutant in [0-9]*.py ; do rm -rf *.pyc *cache* ; cp $mutant fuzzywuzzy.py ; python publictest-full.py ; done >& test.output ; grep FAILED test.output | wc -l ; cp saved.py fuzzywuzzy.py 
46

$ cp fuzzywuzzy.py saved.py ; for mutant in [0-9]*.py ; do rm -rf *.pyc *cache* ; cp $mutant fuzzywuzzy.py ; python publictest-half.py ; done >& test.output ; grep FAILED test.output | wc -l ; cp saved.py fuzzywuzzy.py 
38

In the example above, 46 of the mutants are killed by the full suite but only 38 are killed by the half suite.

Held-Out Test Suites

Writing a mutation analysis that can tell high-quality test suites from low-quality test suites is the purpose of this assignment. If we give too many details about the held-out test suites there is a significant danger that students will "overfit" their answers, making something that satisfies the autograder rather than something that imparts useful software engineering knowledge.

As a result, we intentionally give only vague information about what is in the held-out test suites:

The most common student questions for this assignment are along the lines of "I'm not sure how to improve my score on the autograder. My mutation analysis cannot tell test suites X and Y apart — it gives them the same score. What's in those tests? What should I do?" We explicitly will not tell you more that what you see on this page.

Hints, Advice and Doing Well

Many students report that they find it easy to get a few points on this assignment but hard to get a high score. That is, it is hard to write a mutate.py that produces mutants that rank-order the high- and low-quality test suites.

If you find that your program is not producing enough high-quality mutants to assess the adequacy of a test suite:

You may want to avoid generating "useless" or "stillborn" mutants, such as those that do not parse, always exit with an error, are equivalent to a previously-generated mutant, etc. See Sections III and IV of the Jia and Harman survey for ideas. Here are some common student pitfalls. You should check locally to see if you are making any of these.

HW3b — Written Report

You must also write a short three-paragraph PDF report reflecting on your experiences creating a mutation tesitng tool for this assigment. Consider addressing topics such as:

Rubric:

The grading staff will select a small number of excerpts from particularly high-quality or instructive reports and share them with the class. If your report is selected you will receive extra credit.

Sanity Checking Your Submission

Historically, many students find it difficult to produce a mutation testing program that works as well on our grading server as it does locally. Ultimately, it is your responsibility to overcome this challenge. However, to provide additional clarity, this assignment features a ``Sanity Check'' autograder submission. You may submit to the Sanity Check option as often as you like; your score there has no influence on your final grade (for good or for ill).

The Sanity Check submission does include significantly more debugging output than the normal submission. Click on the right arrow icons to expand them and read the output carefully. You can use this to debug issues such as bad imports, invalid mutants, and so on.

The Sanity Check submission setup is just like the final submission setup, except that it uses the AVL tree Python program from a previous homework, coupled with a very simple test suite. You can get all of those files here (hw3-sanity.zip) for your own local testing.

Note that when you do the sanity checking submission, it is normal to see debugging output about the AVL program failing tests. That is the point of mutation testing: you create mutants in the hopes that they will fail the tests. You want to create mutants that are just strong enough so that they distinguish between test suites of varying quality, but not mutants that are so strong that they fail everything and every test suite gets the same score.

Submission

Submit a single mutate.py file to the autograder. Submit a single PDF report via Gradescope. In your PDF report, you must include your name and UM email id (as well as your partner's name and email id, if applicable).

For the autograder submission, we have manually prepared a number of private (i.e., held-out) test suites of known quality. For example, Private Test Suite A is known to be better than Private Test Suite B. You receive points if your mutant adequacy score shows that A is is better than B (i.e., if A kills more of your mutants than B). We check "A > B" and "A >= B" separately; the former is worth more points.

While there are visible tests with private content (in that you don't get to see what they contain), there are no hidden tests (here or on any other homework in this course), so the score you see on the Autograder is your final score.

FAQ and Troubleshooting

In this section we detail previous student issues and resolutions:

  1. Question: When I try to create mutants, even locally, the to_source function complains.

    Answer: If you create a Python AST that is missing critical information (like an important tree child, or location information) or otherwise create something totally invalid, the library won't even let you print it out. For example, if you had "x = y + z" and you remove the "z" node entirely without doing anything else, the library might not even print out the result at all because it is not valid Python. Even something as simple as saying ast.GtE instead of ast.GtE() can cause this.

  2. Question: When I submit my mutate.py to the sanity checker, I get:

    Compute Mutation Score
    Exit status: 0
    Stdout:
    
    =============================================
    = Your mutate.py did not run on the server. =
    =============================================
    Inspect the output nearby carefully to debug the problem.
    
    Stderr:
    
    Traceback (most recent call last):
      File "mutate.py", line 2, in 
        import strangelibrary
    ImportError: No module named 'strangelibrary'
    

    Answer: In such cases you will want to read the output carefully. In this particular example, the problem is that the submitted mutate.py tries to import strangelibrary, but that library is not available on the grading server. Remove the import and try again.

  3. Question: When I submit my mutate.py to the sanity checker, it somewhat works, but I see this if I scroll down to the bottom:

    Traceback (most recent call last):
      File "privatetest-a.py", line 6, in 
        import avl 
      File "/home/autograder/working_dir/avl.py", line 42
        else:
           ^
    IndentationError: expected an indented block
    

    Answer: You are creating mutants that are not valid Python programs. In this example, the student is removing an entire statement right after an else: — but then Python gets confused because it is expectint a tabbed-over statement there. So the mutant is not a valid Python program. Check out the hints on how to deal with this sort of issue.

  4. Question: My submission does not seem to create any mutants on the autograder and/or it can't seem to find them:

    ls: cannot access '[0-9]*.py]': No such file or directory
    

    Answer: This almost always means that your mutate.py has an error (such as importing a library not found on the autograder, or raising an exception) that means it is not running to completion.

    This issue is, in some sense, the most common student problem. There is no silver bullet here: test locally and make use of the sanity check submission.

  5. Question: It feels like the autograder is non-deterministic. I feel like I am submitting the same mutate.py twice and getting different results.

    Discussion: There are many possible issues here that all result in the same observed symptom.

    Answer 1: Your mutate.py may be non-deterministic. (This is rare, since student are careful about it.)

    Answer 2: Your mutate.py may be creating mutants that are, themselves, non-deterministic. This is much trickier, and happens surprisingly often.

    Answer 3: Your test automation may be using stale information or cached values. Even with the rm -rf *.pyc *cache*, you may be testing on mutants created by a previous run of mutate.py if you are not careful.

    Answer 4: There may be differences between the autograder setup and your local setup. For example, the autograder uses export PYTHONHASHSEED=0 to try to determinize the order in which hashmaps are iterated.

    Discussion: Ultimately, it may not be worth your time to track this down. Remember that we use your best submission, not your last one. Start early.

  6. Question: When I run locally, I get

    ImportError: No module named 'astor'
    

    Answer: Make sure you install (again!) with pip3 and python3 rather than pip and python.

  7. Question: I am trying to delete things, but I get this error:

    assert node is None, node
    AssertionError: 
    

    Answer: You have ast.Pass but want ast.Pass() instead (note the parentheses). However, you may actually not want to "delete" things like this at all (look around for other hints).

  8. Question: Is fuzzywuzzy.py being modified? When I run all of my mutants to test them it really looks like fuzzywuzzy is being changed.

    Answer: Good observation. Your mutate.py does not change it directly, but the example shell script commands we have above do (look for cp which means "copy"). Just copy a saved version back over the original.

  9. Question: When I run my mutate.py, I get this error when trying to print out my mutants:

    "AttributeError: 'Assign' object has no attribute 'value' "
    

    Answer: This error occurs when the AST library tries to pretty-print your abstract syntax tree data structure back out to (ASCII) Python source. The AST library expects each node object to have certain fields (properties) filled out. Sometimes a node operation that can look totally reasonable to you (e.g., making a new node, setting it as the child of some other node) can result in a node object not having all of the fields set. If you are changing the types of nodes directly, I would recommend that you instead use their friendly constructors — that may well end up fixing this problem for you. (This is explicitly a bit of a challenge; dig around first and get used to the AST library before asking for help here.)

  10. Question: How can I get the parts of an AST node? Suppose I want to figure out the identified on the left-hand side of an assignment:

    Assign(targets=[Name(id='x', ctx=Store())], value=Num(n=9))
    

    Answer: Check out the node documentation. In this particular example, some students report success with node.targets[0].id to extract the id from an instance of the the Assign class.

    Similarly, suppose you want to change the right-hand side of the assignment to "None", but leave the left-hand side alone. Have your node transformer do something like:

    return ast.copy_location(ast.Assign(targets=node.targets, value=ast.NameConstant(value=None)), node)
    
  11. Question: I've just started deleting statements, but now I get:

    AttributeError: module 'string' has no attribute 'strip'
    

    Answer: If you look closely, you'll see that fuzzywuzzy.py has if PY3: string = str on lines 14-15. If you make big changes there you end up changing which module the code will use, which can have far-reaching effects. Be careful!

  12. Question: I'm getting errors like

    AttributeError: module 'ast' has no attribute 'Constant' 
    
    or
    AttributeError: 'Classdef' object has no attribute 'starargs'
     

    Answer: The autograder uses Python 3.5 with AST 8.1. You'll want to make sure you're using the same versions.