YouCook: An Annotated Data Set of Unconstrained Third-Person Cooking Videos

Primary Contributors: Chenliang Xu, Pradipto Das, and Richard F. Doell and Philip Rosebrough and Jason Corso   (Email Contact)

Overview: This data set was prepared from 88 open-source YouTube cooking videos. The YouCook dataset contains videos of people cooking various recipes. The videos were downloaded from YouTube and are all in the third-person viewpoint; they represent a significantly more challenging visual problem than existing cooking and kitchen datasets (the background kitchen/scene is different for many and most videos have dynamic camera changes). In addition, frame-by-frame object and action annotations are provided for training data (as well as a number of precomputed low-level features). Finally, each video has a number of human provided natural language descriptions (on average, there are eight different descriptions per video). This dataset has been created to serve as a benchmark in describing complex real-world videos with natural language descriptions.

Examples (with three selected human descriptions each):

  • The woman has all the ingredients ready for making muffins.She shows all the ingredients like flour,chocolate chips essence eggs etc. She first mixes a little flour with the chocolate chips.Then she mixes the other ingredients like eggs flour and the rest of the items.Then she pours the batter into muffin trays and bakes it.Muffins are ready!
  • In this video, a woman pours ingredients into a large metal mixing bowl from several smaller bowls of various colors and sizes. A toaster oven, microwave and coffee maker can be seen in the background. She prepares the mixture on a counter-top near a sink. She then adds sour cream to the bowl from a plastic container. She then measures some dark liquid, possibly vanilla extract, and places it into the bowl.
  • A well organised kitchen with a microwave and cooking range in the background,the chef begins by pointing to the contents of all the bowls lined up in front of her. Then picking up the mixing bowl a whisk and a spatula to explain the basic tools used, she puts them aside and picks up the bowl that has flour in it. She mixes perhaps, baking powder and cinnamon and empties part of this powder mixture into the flour mixing it well with the spatula. Part of this flour goes into the bowl with chocolate chips and dry coats the chips well mixing with her fingers. Then picking up the mixing bowl she pours melted butter,milk,2 eggs and perhaps cream with a dash of vanilla essence and then maybe almond meal and some white liquid. Picking up the whisk, she stirs the liquid mixture a little. The flour is poured in next and whisked well. Then using the spatula, she drops in the flour coated chocolate chips and folds it in the batter well. Then using what looks like an ice cream scoops, she cleans the spatula of all cake batter. Then picking up a scoop of cake batter, she drops it into the paper cup cake cases that are set in a cup cake pan. Then taking a can of some spray, she sprays the white liquid in between the cakes on the pan and places the pan in the oven. After a while, the baked chocolate chip cup cakes are removed from the oven and arranged neatly on a serving plate.
  • In this video, the man cook some food on the direct heating pan. Two pieces of fish is already cooking on the direct heating pan. He place two more toasted food, possibly fish from the greenish yellow pate, on the pan. He take cooked fish from the pan and transferred into green plate.Two towel, one spoon, we can see on the cooking table.
  • A barbecue stove is placed in a garden.A man places slices of fish and two other yellow color pieces on it.He takes the fish slices as soon as it is ready.He places it on a plate. The other two slices are still on the grill.
  • In this video, the man placed two slices of fish seasoned with herbs and spices onto a hot grill. Leave it to cook. Meanwhile two slices of salmon was placed onto the grill to cook. He then removed the salmon from the grill and placed them on the plate.
  • In this video, a man is using 2 pieces of wheat bread slices as base. He is spreading butter in bread slices. Also spreading some brown cream(seems to be chocolate cream)in that. Spreading light brown colour paste in another bread slice. Keeping the bread slices one on other. Cutting into 4 pieces and served in a white colour plate.
  • In this video, a cook kept two pieces of bread over the cutting board and spreads butter,then he spreads jam over one bread piece and cream colored cream over the other and cuts into four pieces. Then he keeps all four pieces in a plate. there were 6 bowls with ingredients.
  • Two pieces of brown bread are taken and butter is applied on one slice.. Some kind OF jam is spread over it.All the ingredients are kept in bowls around the tray.PEANUT BUTTER IS APPLIED ON ONE SLICE.Both the slices are joined together and it is cut into triangular pieces for serving.A man appears on the screen.


  • README File for the data set.
  • Main data set: please fill in this form to be redirected to the download link. (If you have some problems with your browser, just email us for the link.)
  • Precomputed Hog3D Features: youcook_hog3dfeatures.tbz (2.6GB)
  • Extracted video frames (as jpg): youcook_videoframes.tbz (6GB)


[1] P. Das, C. Xu, R. F. Doell, and J. J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013. [ bib | poster | data | .pdf ]

This work is supported in part by NSF CAREER IIS-0845282 and DARPA Mind's Eye W911NF-10-2-0062.

