EECS 598-008: Special Topics, Winter 2019 Advanced Data Mining
This course will cover a number of advanced topics in data mining. A mix of lectures and readings will familiarize the students with recent methods and algorithms for exploring and analyzing large-scale data and networks, as well as applications in various domains (e.g., web science, social science, neuroscience). The focus will be on scalable and practical methods, and the students will have the chance to analyze large datasets. The advanced topics will include: ranking, classification, clustering and community detection, summarization, similarity, anomaly detection, node representation and deep learning in the graph setting.
Objectives
This course aims to introduce students to advanced data mining, with emphasis on interconnected data or graphs or networks. Students will become familiar with the challenges of processing large amounts of data, state-of-the-art methods and algorithms for analyzing them, and applications of data mining in various domains. We expect that by the end of the course, students:
will have a thorough understanding of the graph mining foundations, and
will be able to:
critique data mining methods,
formulate and solve new problems, and
analyze large-scale datasets (in distributed and other settings).
Prerequisites
Students are expected to (1) have basic knowledge of linear algebra, (2) be familiar with probability theory and statistics, and (3) have good programming skills (e.g., Python, JAVA, C, Matlab, R, or any programming language of their preference). Basic knowledge of machine learning is helpful.
** Advanced-standing undergraduates or other students who do not meet the prerequisites may enroll with permission of the instructor.
Instructor:Danai Koutra Office Hours: after class & by appointment
Teaching Assistant:Jiong Zhu Office Hours: Tue 3-4pm @ Duderstadt Center 2nd Floor Vizhub 7
Fri 12-1pm @ BBB Learning Center
!! The topics and dates of the lectures are subject to change. The following schedule outlines the topics that we will be covering in this course. The paper readings have been updated!
[Background paper] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14). [CODE]
[02/07 - 30 mins] Leonardo F. R. Ribeiro, Pedro H. P. Saverese, Daniel R. Figueiredo. Struc2vec: Learning Node Representations from Structural Identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17).
Manish Purohit, B. Aditya Prakash, Chanhyun Kang, Yao Zhang, and V.S. Subrahmanian. Fast Influence-based Coarsening for Large Networks. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '14).
Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summarization with bounded error. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08).
[30 mins] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, Vivek Ramavajjala. Smart Reply: Automated Response Suggestion for Email. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '16).
[45 mins] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18).
Other readings:
Mihajlo Grbovic and Haibin Cheng. 2018. Real-time Personalization using Embeddings for Search Ranking at Airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 311–320. KDD 2018 Best Applied Data Science Paper
Evangelia Christakopoulou and George Karypis. Local Item-Item Models For Top-N Recommendation. RecSys 2016. Best Paper Award
Cheng Li, Michael Bendersky, Vijay Garg and Sujith Ravi. Related Event Discovery. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM 2017),
[Instructor] Mark Heimann, Haoming Shen, Tara Safavi, and Danai Koutra. 2018. REGAL: Representation Learning-based Graph Alignment. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM '18).
[45 mins] Giannis Nikolentzos, Polykarpos Meladianos, Michalis Vazirgiannis. 2018. Matching Node Embeddings for Graph Similarity. AAAI Conference on Artificial Intelligence.
[Background] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning Research (JMLR) 12(Sep):2539−2561, 2011.
[45 mins] Daniel Zügner, Amir Akbarnejad, and Stephan Günnemann. 2018. Adversarial Attacks on Neural Networks for Graph Data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 2847-2856. Best Research Track Paper Award.
Other readings:
Hanjun Dai, Hui Li, Tian Tian, Xin Huang, Lin Wang, Jun Zhu, and Le Song. 2018. Adversarial Attack on Graph Structured Data. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and Andreas Krause (Eds.), Vol. 80.
Aleksandar Bojchevski, Oleksandr Shchur, Daniel Zügner, and Stephan Günnemann. 2018. NetGAN: Generating Graphs via Random Walks. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research)
Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. GraphGAN: Graph Representation Learning With Generative Adversarial Nets. AAAI Conference on Artificial Intelligence.
Other topics that may be of interest (not covered in class, but potentially related to your projects)
Danai Koutra, Abhilash Dighe, Smriti Bhagat, Udi Weinsberg, Stratis Ioannidis, Christos Faloutsos and Jean Bolot. PNP: Fast Path Ensemble Method for Movie Design.In Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '17). Video
Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, Thomas S. Huang. Heterogeneous Network Embedding via Deep Architectures. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15).
Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. Collaborative Knowledge Base Embedding for Recommender Systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16).
Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, Mark A. Hasegawa-Johnson, and Thomas S. Huang. 2017. Streaming Recommender Systems. In Proceedings of the 26th International Conference on World Wide Web (WWW '17), 381-389.
Jimeng Sun, Dacheng Tao, Christos Faloutsos. Beyond Streams and Graphs: Dynamic Tensor Analysis. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’06).
Check the course website on Canvas to find pointers to datasets, code, and tools that will be useful for your assignments and projects.
Assignments
The coursework will comprise at most three short, practical assignments that will familiarize the students with the challenges of large-scale graph analysis. Each assignment will be done individually.
Semester-long Project
The most important component of this course is a semester-long project (related to topics discussed in class) that will be selected by students. The projects will be done in groups of 3-4 students. We will arrange brainstorming sessions to facilitate group formation. Feel free to use Piazza to pitch ideas and find groupmates.
For the project deliverables, you are required to make only one submission. To post on Canvas on behalf of a group, first go to the "People" tab, then to the Group tab, and then search for the relevant homework or project. Join with your groupmates the same group.
Ideas for Projects:
You might find ideas for your projects by exploring the topics of various data science competitions:
Yelp Dataset Challenge (deadline passed, but you can still get the data): open task, data (including 200K photos) from local businesses in 12 cities across 4 countries
Survey (1-2 pages, ACM format, 15% of the project grade).
You will need to pick a research topic for your project and read 6-8 relevant papers. Ideally the survey will help you identify the specific problem you want to address, and will lead to the project proposal naturally. The survey will be part of your final report. It should be a well though-out synthesis of the papers that you will read, not just a repetition of the paper's abstracts / introductions.
Your survey should provide answers to the following questions:
What is the common theme of the papers you read? Give the problem definition(s).
What are the challenges of the area?
How do the papers relate to each other?
Are they solving a new problem or improving an existing method?
What are the main techniques that they are using?
What are 3 strengths and 3 weaknesses of each paper?
What are the limitations of each method?
Think about some future directions. What would you do better? Think about scalability issues, generality (e.g., weighted, directed, time-evolving, attributed networks), applicability to various domains.
>> Don't forget to include the names of all the group members in the pdf. If you want to submit a longer survey, please ask me first.Project Proposal (2 pages in PDF format, 15% of the project grade).
Your proposal should include the following sections:
Problem definition
Challenges
Most related prior work and its shortcomings
Proposed approach
Data that you will use
Evaluation plan
>> Don't forget to include the names of all the group members in the pdf.Mid-term Report (4-5 pages, ACM format, 20% of the project grade).
See below for the sections that your final report should have. At this point, for your midterm report, you should start editing the following sections:
Section 2. Data: Describe the synthetic and real data that you will use, and explain the data collection process (if applicable).
Section 3. Proposed Method: Introduce the method that you propose, give the necessary definitions, potentially give proof of concept.
Section 4. Experiments: Give some preliminary experiments (on synthetic or real data).
Section 5. Progress and Next Steps (temporary section): Outline your next steps and whether you are on track. Now that you have had time to work on your projects, if anything has changed with respect to your proposal, mention it.
Section 6. Division of work (your grade will depend on your contribution to the project)
>> Don't forget to include the names of all the group members in the pdf.Final Report (8 pages excluding citations, ACM format + CODE, 50% of the project grade). A. Report Structure: Your report should have the form of a paper with (at least) the following sections:
Section 0. Abstract
Section 1. Introduction
Section 2. Data
Section 3. Proposed Method
Section 4. Experiments
Section 5. Related Work
Section 6. Conclusions (include what you learned)
Section 7. Division of work (your grade will depend on your contribution to the project)
B. Code: Organize your code in a folder called "CODE". Include a README file and MAKEFILE. Your code should be running on horton.eecs.umich.edu. >> Submit a zip file with the pdf and the CODE/ folder.>> Don't forget to include the names of all the group members in the pdf.
For more information, look out for the announcements on Canvas.
Grading
Class Participation
7%
Class Presentations: 1 presentation (15%) + 2 discussions (7%)
For the assignments and project submissions, check out the schedule on the website.
For assignments, you will have 4 late days in total (no questions asked). If needed, you can use all the late days for one assignment or split them between the three assignments. Late days are rounded up to the nearest integer. For example, a submission that is 4 hours late will count as one day. Beyond that, you will get a zero for that assignment.
Since the projects require coordination of 3-4 students, there will be NO late days. If you submit AFTER the deadline, you will get a zero on that component of the project. Please submit at least 30 minutes before the regular deadline as a safety measure.
We have run into situations in the past (rare) where students miss the regular deadline by 2-3 minutes for a project. Sometimes, this is because of last-minute project work or slow servers. We will give a one-time waiver of the penalty if you miss the regular submission deadline for a project by 5 minutes or less. Beyond that, your project submission will not be graded and you will receive a zero. Don't forget that this is less strict than what happens with conference deadlines; if you miss the deadline even by a few seconds, you will need to submit to another conference or wait for a year until the next submission cycle :)
For extreme circumstances, like medical emergencies, no-penalty extensions will be granted. Email eecs598dm-w19 [AT] umich.edu with written documentation (e.g. doctor's note).
Honor Code
All students (including LS&A and Engineering) are required to observe the Engineering Honor Code in all assignments. A copy of the honor code can be found here. Please make sure that you clearly understand what constitutes cheating. If you are not sure in any specific case, you should ask the teaching staff. The University takes honor code violations seriously, and penalties can be severe. You are not allowed to make use of assignment solutions by others, including solutions from previous semesters.
Any suspected violations of the honor code will be reported.
Disabilities and Conflicts
Students with disabilities that are documented with the Services for Students with Disabilities (SSWD) Office should contact the professor during the first three weeks of class to make appropriate arrangements.
In class:
[30 mins] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explor. Newsl. 19, 1 (September 2017), 22-36. DOI: https://doi.org/10.1145/3137597.3137600 [30 mins] Srijan Kumar, Justin Cheng, Jure Leskovec, and V.S. Subrahmanian. 2017. An Army of Me: Sockpuppets in Online Discussion Communities. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). Best Paper Award.
Other readings:
Tara Safavi, Maryam Davoodi, and Danai Koutra. 2018. Career Transitions and Trajectories: A Case Study in Computing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). Allison Morgan, Dimitrios Economou, Samuel Way, and Aaron Clauset. Prestige drives epistemic inequality in the diffusion of scientific ideas. EPJ Data Science. Xi Chen, Yiqun Liu, Liang Zhang, and Krishnaram Kenthapadi. 2018. How LinkedIn Economic Graph Bonds Information and Product: Applications in LinkedIn Salary. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). Alexandre Bovet and Hernán A Makse. 2019. Influence of fake news in Twitter during the 2016 US presidential election. Nature communications 10, 1 (2019), 7. C. Danescu-Niculescu-Mizil, R. West, D. Jurafsky, J. Leskovec, C. Potts. No Country for Old Members: User lifecycle and linguistic change in online communities. ACM International Conference on World Wide Web (WWW), 2013. Best paper award. A. Anderson, D. Huttenlocher, J. Kleinberg, J. Leskovec. Engaging with Massive Online Courses. ACM International Conference on World Wide Web (WWW), 2014. Best paper runner-up. J. McAuley, J. Leskovec. Discovering Social Circles in Ego Networks. ACM Transactions on Knowledge Discovery from Data (TKDD), 2014. L. Backstrom, J. Kleinberg. Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook. Proc. 17th ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW), 2014. Justin Cheng, Lada A. Adamic, Jon M. Kleinberg, Jure Leskovec. Do Cascades Recur? WWW 2016 I. Kloumann, L. Adamic, J. Kleinberg, S. Wu. The Lifecycles of Apps in a Social Ecosystem. Proc. 24th International World Wide Web Conference, 2015. S. Myers, J. Leskovec. The Bursty Dynamics of the Twitter Information Network. ACM International Conference on World Wide Web (WWW), 2014. L Backstrom, P Boldi, M Rosa, J Ugander, S Vigna. Four Degrees of Separation. Proc. 4th ACM Int'l Conf. on Web Science (WebSci), 2012. Best Paper Award.