A video showing an interpretation of the output as spatially resolved: even though the system predicts a single value per channel, we can interpret it as predicting a HxW image.