Overview

This webpage layouts out potential ideas for future research, experiments and methodologies related to the intersection of medical imaging and software engineering.

Future fMRI Experiments

Here are some subsequent fMRI experiments we are considering (in no particular order). Please let me know if you might be interested in collaboration.

Program Creation

With our approved fMRI-safe keyboard, we could study program or patch creation. Just as our first paper contrasted code review, code comprehension and prose review, we might contrast patch/program creation, prose/poetry writing, and mathematical equation solving. Programming tasks might be taken from classroom exercises or real bug reports and feature requests. In addition to contrasting code creation to prose creation, we could also correlate with expertise. Key challenges on the software side include communicating the context required to fix a defect and assessing the quality of the resulting patch. A key challenge on the medical imaging side involves experimental design for short time windows (see below).

Feedback creation: We have been viewing code review as a binary-valued activity (i.e., would you accept this or not). In practice, people give feedback for improvement. How does that feedback vary between code and prose?

Longitudinal Learning

Following longitudinal fMRI learning studies in psychology (e.g., which correlate learning rates with patterns of brain connectivity or activation), we could carry out a longitudinal fMRI learning study in which participants are exposed, over days or weeks, to MOOC videos and quizzes on a new topic. The MOOC video style controls for instructor quality and what the student is exposed to, as well as providing a standard functional outcome approach to measure learning. "What correlates with effective learning of CS concepts? What predicts rapid CS learning? Which parts of the brain are active during CS learning?" We might also investigate the relationship between student background, educational effectiveness, and neural activation.

If we go in this direction, we should check in with the UMich MIDAS Center, which is quite interested in the use of data analytics techniques that relate to education.

Code Review and Provenance

Following psychology research suggesting that our brains behave differently in the presence of social hierarchies, perceived cheating, or mental-vs.-emotional aspects of trolley problems, we could ask participants to perform code review on patches of various labeled provenances. If participants are working on security-critical software, how do they judge patches apparently written by trusted colleagues, untrusted enemies, "think of the children / we have to ship soon to make money" emotional appeals, or the lab-coated experimenter? In addition, by seeding defects in the patches, we can relate code review "accuracy" to patterns of neural activation and patch provenance labels.

Expertise

Working with partners in industrial research labs, we could replicate our protocol (either code review or code creation, etc.) with experts who have years of post-education job experience (i.e., not students). How do their patterns of neural activation differ from novices? In addition, we might more formally ask them to justify or explain their decisions and correlate those with imaging results. For example, when participants say they were paying attention to feature X of the problem, can we find any evidence of that, or was the justification perhaps subconsciously generated post facto? Additionally, we might more directly construct stimuli with correct answers and then compare two dimensions of expertise (years of experience as well as correctness) with imaging results.

Technical Interviews

We might investigate patterns of neural activation in "CS interview questions" of the type used by Google, Facebook and Microsoft (e.g., solve the Dutch flag problem, write code to insert a node into this doubly-linked list). Trained interviewers (like Adam Brady or Pieter Hooimeijer) might be willing to conduct the interviews as well as give time-stamp annotations (e.g., at this point I thought the participant was doing well, at this point I thought the coding answer was weak, etc.) and final grades. Are there patterns of neural activation that correlate with interview success? Are the patterns of neural activation in technical interviews similar to those in CS tasks like code review or code creation? These interviews are a huge part of modern CS employment but are relatively under-studied. One challenge might be to contrast interview neural activation with "actual software engineering" neural activation (cf. the criticism that there is only a tenuous relationship between CS technical interview performance and subsequent employee performance).

Data Structure Representations

Following psychology results that show patterns of neural activation correlate with the angular difference in mental rotation tasks, we might present participants with data structure tasks (e.g., rebalance this red-black tree) and look for corresponding spatial patterns of activation. If your brain appears to make a model and sweep out the angle to solve mental rotation tasks, does it also make an identifiable model for graphs, trees, skip lists, etc.? If certain problems admit model making and others do not, does that correlate with accuracy or speed?

Aging

Following psychology results that suggest more diffuse patterns of activation with age (as if the brain recruits nearby areas to help solve problems that used to be solved locally), we could replicate our comprehension and review protocol with more senior participants. Finding that senior participants recruit visual-spatial centers to help with CS problems (as opposed to verbal ones, etc.) has implications for retraining an aging workforce. Do we then see more rapid or effective learning with learning materials tailored to visual-spatial reasoning (as opposed to verbal reasoning, etc.)?

Distractions

We might study the effects of visual and audio distractions on CS tasks such as code review, code comprehension or patch creation. What are the relative effects of various audio distraction stimuli (e.g., people talking about unrelated code, people talking about exactly this code, people talking about non-CS topics) on accuracy and speed? Similarly for visual stimuli, such as IM or email notifications in the corner of the screen. These questions are increasingly relevant as companies move to open cube farm floor plans and mandate the use of particular IM suites (cf. IBM).

Future Medical Imaging Experiments

We are also interested in using other medical imaging or psychology techniques, including tachistoscopes, transcranial magnetic stimulation, functional near-infrared spectroscopy, eye tracking, and event-related potentials.

Priming

We could adapt the "priming" paradigm from psychology to investigate which concepts are related in the brains of participants for certain SE tasks. For example, "table" and "house" prime each other in standard psych task formulations, but random words do not. We might examine whether some concepts prime for a given task regardless of the subject code at hand and whether others are subject-code specific. We could adapt the "masking" paradigm from psychology to see which sorts of information influence response times in CS tasks (and are thus presumably at least partially processed subconsciously). For example, psychology research has found that realistic chess endgames help experts but random chess boards do not. Do we see a similar pattern for well-, non- or randomly-indented programs presented quickly and masked? In the same way that words like "anger" reduce the time to identify a face as unhappy, but also generate detectable but below-threshold activation patterns in the arm about to press that button, can we see similar results in CS? Concretely, for both masking and priming we might study rapid fault localization or code review tasks and prime/mask with information like a relevant type, a verbal description of the bug, the color of the syntax highlighting near the bug, the spatial location of the bug, etc. Which ones result in more rapid activation times and thus presumably have subconscious processing?

Transcranial Magnetic Stimulation

We might augment a "read-only" fMRI analysis with transcranial magnetic stimulation. Suppose previous experiments have determined that certain brain regions are associated with particular CS tasks (e.g., code review, code comprehension, code creation, etc.). How do subjects carry out those tasks when those regions of the brain are temporarily disabled? For example, what sort of patches can you create without the "verbal center" of your brain (imagine for simplicity there is such a thing)? Without the "analytic reasoning center"? Without the "moral judgment center"? How do those relate to qualities of the produced patches, such as correctness, commenting quality and clarity, and security? One challenge would be to frame the novelty and scientific argument for a computer science audience (cf. "if we turn off the obvious parts of your brain you cannot read code or write programs") or otherwise construct a careful experimental control.

Functional Near-Infrared Spectroscopy

This technique is cheaper and more flexible than fMRI and may allow for more ecologically valid experimental settings (i.e., outside of a small tube). fNRIS admits better spatial resolution than EEG, but still only measures signals relatively close to the cortical surface. Since the signal diffuses through the scalp, localization is not as precise as fMRI. Resolution is influenced by how many sources and detectors the cap has. Temporal resolution is theoretically closer to that of an EEG, but since fNRIS would still be measuring hemodynamic response, it would be similar to fMRI for our purposes. For certain questions this might be a better or worse fit than using an EEG.

Event-Related Potentials

Following research in psychology suggesting that college students (but not young children) process text abbreviations as automatically as they do regular reading, we might use event-related potentials to further investigate expertise and automaticity. Event-related potentials could be a great way to add more detail to what an fMRI can show. For example, does a line of clearly incorrect code (e.g., the last keyword is word) elicit an N400 by expert (but not novice) programmers? If so, it further supports the notion that skilled programmers process code like language. ERPs may also help elucidate what level of programming experience it takes to be fluent in code. For example, since ERPs enjoy a greater time sensitivity and fMRIs, we could evaluate whether differences in skill translate to automaticity at the millisecond level.

Eye Tracking

Following work done by Dror Feitelson and others relating code comprehension to eye tracking (e.g., programmers may look over code in a very non-linear pattern), we might combine eye tracking with fMRI. We could investigate using fMRI to get more information related to what is going on and why in such tasks (e.g., which parts of the brain are engaged when the eye is focused on a particular syntactic or semantic element of a program). In practice, most imaging centers have the necessary equipment to combine eye tracking with fMRI, allowing us to present a participant with a stimulus and simultaneously record patterns of neural activation and eye tracking information --- and thus informally link up what participants are looking at with what they are thinking.

Experimental Design

We note a critical issue associated with experimental design for medical imaging and software engineering.

A major challenge for future fMRI studies in this area will be correct stimulus design and trial length. The assumption of linearity in the BOLD signal breaks down after about twenty seconds. A related issue is the possibility of more trialwise variability in the signal itself. That is, the evoked response may vary not only with the content of individual trials, but also over time within a particular trial (especially if participants read code in a non-linear fashion, etc.). For standard GLM-based analyses, such variability can dilute the estimated response when we're trying to average over a bunch of trials. This will absolutely be necessary if we want to investigate things longitudinally or across groups (e.g. age-related or expertise-related, where we statistically require the within-subject variance to be smaller than between-subjects).

Acknowledgments

This webpage reflects research discussions and brainstorming with, or summarizes ideas presented or inspired by: Claire Le Goues, Dror Feitelson, Emily Jasinski, Kevin Leach, Tyler Santander.

These researchers should not be construed as endorsing these ideas, but we wish to thank them for sharing their insights.