EECS 542: Winter 2024

Schedule

Lecture	Date	Topic	Materials
Lec. 1	Wed, Jan. 10	Introduction	Logistics
	Mon, Jan. 15	No class
Lec. 2	Wed, Jan. 17	Classification by synthesis	Tian et al., Learning Vision from Models Rivals Learning Vision from Data Li et al., Your Diffusion Model is Secretly a Zero-Shot Classifier
Lec. 3	Mon, Jan. 22	Correspondence	Doersch et al., TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement Tang et al., Emergent Correspondence from Image Diffusion
Lec. 4	Wed, Jan. 24	Diffusion models	Ho et al., Denoising Diffusion Probabilistic Models Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ (tutorial) https://calvinyluo.com/2022/08/26/diffusion-tutorial.html (tutorial)
Lec. 5	Mon, Jan. 29	Efficient image generation	Rombach et al., High-Resolution Image Synthesis with Latent Diffusion Models Song and Dhariwal, Improved Techniques for Training Consistency Models Sauer et al., Adversarial Diffusion Distillation (optional) Dao, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (optoinal)
Lec. 6	Wed, Jan. 31	Image manipulation	Brooks et al., InstructPix2Pix: Learning to Follow Image Editing Instructions Hertz et al., Prompt-to-Prompt Image Editing with Cross Attention Control
Lec. 7	Mon, Feb. 5	Compositional generation	Geng et al., Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models Du et al., Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC
Lec. 8	Wed, Feb. 7	Video generation (Guest instructor: Jeongsoo Park)	Blattmann et al., Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets Bahmani et al., 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
Lec. 9	Mon, Feb. 12	Neural radiance fields	Barron et al., Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields Kerbl et al., 3D Gaussian Splatting for Real-Time Radiance Field Rendering
Lec. 10	Wed, Feb. 14	Generating 3D models	Poole et al., DreamFusion: Text-to-3D using 2D Diffusion Wu et al., ReconFusion: 3D Reconstruction with Diffusion Priors
Lec. 11	Mon, Feb. 19	Architectures	Dosovitskiy et al., An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Liu et al., A ConvNet for the 2020s
Lec. 12	Wed, Feb. 21	Emerging architectures	Jabri et al., Scalable Adaptive Computation for Iterative Generation Nguyen et al., S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces Gu and Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces (optional)
	Mon, Feb. 26	No class
	Wed, Feb. 28	No class
Lec. 13	Mon, Mar. 4	Pose estimation	Teed and Deng, DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras Smith et al., FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow
	Wed, Mar. 6	No class
Lec. 14	Mon, Mar. 11	Recognition	Kirillov et al., Segment Anything Li et al., Exploring Plain Vision Transformer Backbones for Object Detection
Lec. 15	Wed, Mar. 13	Language models	Alayrac et al., Flamingo: a Visual Language Model for Few-Shot Learning Tschannen et al., Image Captioners Are Scalable Vision Learners Too Sharma et al., A Vision Check-up for Language Models (optional)
Lec. 16	Mon, Mar. 18	Vision and language	Surís et al., ViperGPT: Visual Inference via Python Execution for Reasoning Girdhar et al., ImageBind: One Embedding Space To Bind Them All
Lec. 17	Wed, Mar. 20	Vision and sound	Chen et al., Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation Ng et al., From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Lec. 18	Mon, Mar. 25	Self-supervision	He et al., Masked Autoencoders Are Scalable Vision Learners Bar et al., Visual Prompting via Image Inpainting Oquab et al., DINOv2: Learning Robust Visual Features without Supervision (optional) Mizrahi et al., 4M: Massively Multimodal Masked Modeling (optional)
Lec. 19	Wed, Mar. 27	Human pose estimation	Pavlakos et al., Reconstructing Hands in 3D with Transformers Bogo et al., Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image
Lec. 20	Mon, Apr. 1	Datasets	Grauman et al., Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Schuhmann et al., LAION-5B: An open large-scale dataset for training next generation image-text models
Lec. 21	Wed, Apr. 3	Ethics and explainability	Buolamwini and Gebru, Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification Liang et al., Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation
Lec. 22	Mon, Apr. 8	Vision and robotics	Loquercio et al., Learning Visual Locomotion with Cross-Modal Supervision Brohan et al., RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Lec. 23	Wed, Apr. 10	Vision for science	Presenters pick, based on their interests: Yu et al., Inferring Hybrid Neural Fluid Fields from Videos Xia et al., DREAM: Visual Decoding from Reversing Human Visual System Zeni et al., MatterGen: a generative model for inorganic materials design
Lec. 24	Mon, Apr. 15	New work	Several short presentations on work that has come out over the past few months.
Lec. 25	Wed, Apr. 17	Class presentations
Lec. 26	Mon, Apr. 22	Class presentations

Course information

This is a seminar-style graduate-level class covering very recent advances in computer vision. The main focus of the class will be on reading and critiquing recent research papers. In each lecture, you will present and critique several recent research papers. You will also explore ideas covered in the course via a final self-directed project. This class is not intended to be an introduction to computer vision or deep learning. There will be no problem sets.

Lectures: Lectures will take place on Monday and Wednesday, 3:00 - 4:30pm. Since this is a discussion-based class, your attendance is required. Missing more than two classes without an excuse will negatively affect your grade. We will take attendance on randomly chosen days.

Attending: The class will be in-person only. Lecture recordings will be available until the end of the course.

Prerequisites: This is an advanced vision course. Students are expected to have taken an introductory vision course before enrolling (EECS 442, 504, or equivalent), so that they will be prepared to read and discuss recent research.

Paper reviews: You'll be required to submit short paper reviews each week (one per class), beginning the week of Lecture 2. Your review should be based on the paper itself, rather than the discussion. It is therefore due before the paper is presented in class (i.e. at 3pm on Monday or Wednesday).

Summarize the paper. For most papers, this means explaining technical contributions, such as key mathematical insights, algorithms, and architectures.
Briefly explain how the paper relates to previous work, and why its contributions might be (or might not be) important.
Summarize the key experiments.
Discuss the paper's shortcomings: e.g. limitations to the methods, unconvincing aspects of experiments, presentation issues.

Reviews will be graded as: ✓+, ✓, ✓−, 0. We will not accept late submissions without a valid excuse. However, we will drop your 5 lowest review scores.

Q&A: This course has a Piazza forum, where you can ask public questions. We also appreciate it when you respond to questions from other students. If you have an important question that you would prefer to discuss over email, you may email the course staff (eecs542-w24-staff@umich.edu), or you can contact the instructor by email directly.

Textbooks: In this class, we'll mostly be reading research papers, rather than textbooks. The following might be useful as reference, though:

Goodfellow, Bengio, Courville. Deep Learning. (available for free online)
Szeliski. Computer Vision: Algorithms and Applications, 2nd edition draft (available for free online)

If you have feedback for the author of the Szeliski book draft, please submit it here, and we'll pass it along!

Presentations: You will give 1-2 presentations throughout the course (depending on enrollment). We will distribute one list of papers during the first week of class, and another a few weeks into the semester. You will rank the list of papers, and we will assign one to you. Most classes will also contain an background talk, which will provide background material to help you understand the papers.

Peer reviews: Your presentation grade will in part be based on peer reviews from other students. You will be randomly assigned to do peer reviews for 3 papers.

GPU computing: For the project, you may require GPUs to train deep learning models. One common option is to use Google Colab. Please note, however, that the free version comes with significant limitations.

Grading: Final grades will be computed as follows:

Final project	45%
Reviews	25%
Class presentations	20%
Participation and peer reviews	10%

Academic integrity: While you are encouraged to discuss homework assignment with other students, your programming work must be completed individually. You must also write up your solution on your own. You may not search for solutions online, or use existing implementations of the algorithms. Please see the Michigan engineering honor code for more information.

Support: The counseling and psychological services center (CAPS) provides support for a variety of issues, including mental health and stress.

Presentation guidelines

You will be in charge of teaching one class, as part of a group of 3 people (starting Lec. 4). Each class will be organized around a topic of ongoing research. We'll send up a sign-up sheet after the first class, where you will rank

Organization: We suggest organizing most classes as follows:

Background (20 mins)
Paper 1 (20 mins)
Paper 2 (20 mins)
Discussion (20 mins)

The background section is usually the most important part of the class. It should resemble a mini-lecture, covering the "basics" that students will need to understand the paper presentations. For example, if the class is covering papers about recent transformer papers, this section should review what a transformer is, and it should touch on any relevant findings that are necessary to understand the papers. Often, this will involve also describing prior attempts to solve the problems that the (much more recent) papers address.

Each paper section should be a critical presentation the work in the paper. You should explain what problem the researchers were addressing, their motivation for what their solution was, and how well they succeeded at that goal. Unlike introductory courses, where methods are largely well-understood and have passed the "test of time", the papers in this class will often have important limitations. We therefore encourage you to take a critical approach to reading the papers, and to describe possible shortcomings. We also encourage you to discuss things in the paper that you do not think were well-justified, and choices by the authors that you did not understand.

Finally, for the (optional) discussion, you will lead a brief interactive session, where students can debate the issues at stake in the papers. For example, you might run a Q&A session where you ask: should we really consider language-based supervision to be "unsupervised", or do we need to interact with the world to learn good representations?.

Slides: You are allowed to use existing figures, but please clearly credit the authors. Since part of the purpose of the presentation is to have you go through the process of slides, we would like you to make the slides yourself. However, you are permitted to reuse a small number of existing slides (just make it clear that the whole slide was taken from another source). Please submit your slides to us in PDF form. By default, we will post your slides only on Canvas, so that they are only visible to those enrolled in the class. We'd also be happy to post them publicly if you'd like.

Signing up: We'll assign people to presentation timeslots in two phases (i.e. the first and second halves of the class). You'll fill out a questionaire indicating which classes you'd like to participate in. If you happen to have a group of 3 in mind already, please indicate this on the form, and we will try to assign you to a single topic (we unfortunately cannot accommodate groups with other sizes).

Project guidelines

You'll do a self-directed group project, due at the very end of the course. Groups should be at most 4 students, unless you are given permission from the instructor. Deliverables include:

Project proposal
Report
Presentation (a 5-min talk)

Please sign up for a presentation timeslot here.

Staff & Office Hours

Andrew Owens

Instructor

Jeongsoo Park

GSI

Name	Office hours time
Andrew Owens	Monday 4:30 - 5:00pm in EECS 4231
Jeongsoo Park	Tuesday 1:00 - 1:30pm over Zoom

EECS 542: Advanced Topics in Computer Vision

Instructor: Andrew Owens Winter 2024

Schedule

Course information

Presentation guidelines

Project guidelines

Staff & Office Hours