Method
Our model consists of two components:
The program generator reads the text of the question
and outputs a program that can be executed to answer the question.
The program generator is is implemented as LSTM sequence-to-sequence
model.
The execution engine executes programs on images to answer
questions, implemented as a neural module network [1]. It learns a separate
module for each basic function; these modules are assembled according
to the predicted program, giving a customized neural network
architecture for each question.