EECS 584: Advanced Database Management Systems

EECS 584: Advanced Database Systems, Fall 2011
Overview	Schedule/Notes	Final Project

Your final research project comprises a substantial part of your 584 educational experience, and a major component of your final grade for the class. Most of the work you put into this class, and much of the benefit you derive from it, will be centered on your final project. It's worth your time to make it an interesting project that you will be proud of.

Your formal project proposal is due October 12, but I encourage you to start thinking about it far before then. You may want to do some reading into several ideas before choosing a single one. Sometime in the next couple of weeks I will describe what your project proposal document should contain.

Projects should contain original database-oriented research, broadly considered. In scope they should be equal to a workshop paper; projects may go on to become conference papers or parts of students' dissertations. Consider choosing a topic that combines well with your other research interests. For example, someone interested in natural language processing might focus on text search topics, while someone in networking research might focus on distributed database topics.

All projects will be done in groups of two. In rare circumstances I will consider allowing people to work alone, but experience shows that two-person projects are usually more successful.

I've listed a few ideas below to get you thinking. These projects are "big" enough that multiple people could do a single one and still yield separate interesting projects. You are welcome and encouraged to suggest your own ideas - if you like, please come to office hours and discuss them.

Information Extraction and Data Integration

Managing non-traditional data, in particular Web-derived data, is my primary research focus. You may like reading some of the research papers posted on my own site.

Data Integration for Spreadsheets - Spreadsheets are the "databases of the people" and often contain surprisingly sophisticated data. However, even very high-quality spreadsheets contain very messy or unclear structural information. For example, what is the "schema" for the data in a spreadsheet?. In this project, you will attempt to write software that can automatically combine information from different spreadsheets, with little or no user assistance. Last-year Ph.D. student Bin Liu's dissertation focuses on spreadsheet management, and has offered to be an unofficial advisor to anyone interested in this project.

Fast and Derivative Website Maker - Oftentimes people want to build a site that integrates data from many other sources. For example, the DBLife site (http://dblife.cs.wisc.edu/) combines database papers and people derived from many different university sites. At Michigan, we would like to build a system that monitors many different department and faculty pages and integrates any relevant updates into a Twitter feed. Unfortunately, using data from other sources usually involves lots of manual effort to build the relevant extractors. Is it possible to build a fast and dirty site integrator that requires very little human guidance? We might be willing to put up with an integrated site that’s not quite as nice as the DBLife site, if the task were easy enough. You will need to do some amount of reading of related work before you start here.

Instant Search for Datasets - Google recently introduced Google Instant, a tweak to the Google search engine that uses your partially-typed query to guess the full query, and then presents query results for that estimated query. You will try to build a system that does something similar for structured datasets instead of Web pages. Structured datasets can include spreadsheets, extracted tables from the Web, information derived from a Web service like Amazon, or data drawn from governmental sources. Your search engine will take a keyword query and then rank all of the structured datasets according to relevance to your query. The first research challenge is to make a ranking mechanism that can handle this different kind of data. The second research challenge is to introduce helpful query commands that would not be possible in a text-only setting. For example, you might include "AVG(price)>100" to mean that the average of values in the price column must be more than 100.

Mobile Data

The following projects involve data management on smartphones. We have a number of Windows Mobile phones donated by Microsoft for you to use, and we may be able to get our hands on some Android devices if your project requires them. You are, of course, welcome to use your own phone if you have one. It may also be possible to do a joint project with students in the graduate networking class.

Dynamic Data Logging for Deployed Mobile Apps - Mobile apps benefit from data logging as much as Web apps, building unusual datasets that can then be the target of various data mining algorithms. But app store policies mean deploying mobile app updates is very difficult; mobile app developers have very little flexibility when deciding what app-level data to log. If the developer wants to modify the app even slightly to log a single novel piece of data, she must resubmit a whole new version to the app store for a lengthy approval process. In this project, you will try to build a "hassle-free" infrastructure for changing data logging behavior in mobile applications. The goal is to avoid the error-prone and time-consuming cycle that dominates small app updates today. Perhaps the compiler could use a data-mining workload to examine a mobile app and automatically determine the most useful data to log. Perhaps the app could use a dynamic instrumentation framework similar to Sun's DTrace.

Handheld OLAP - Handheld networked phones have become a standard tool. However, these devices have extreme constraints on screen size and the number of keystrokes a user can be expected to enter. OLAP queries are traditionally considered very important but also relatively difficult to write. Is it possible to build high-quality data analysis tool that works well on mobile phone form factors? This will require a deep reexamination of OLAP interfaces, and perhaps aggressive automatic assistance from the data system. Students who pursue this project will be expected to conduct user studies that yield quantitative data comparing possible interface approaches.

Data Mining with Mobile Sensor Streams - Modern smartphones are studded with sensors - GPS, camera, orientation, microphones, and so on. Although most phones turn on these sensors only when an application demands, they could also be used to gather long-running descriptions of the user's daily activity. When combined with online data sources like Facebook and Twitter, it might be possible to use data mining techniques to learn interesting things about the user's life. For example, if the GPS indicates lots of movement, does the user post more often or less often to Facebook? Is it possible to discover the "loud" and "quiet" parts of the day? Of the user's geography? This project is relatively open-ended: we will have to choose a specific data mining task to explore.

Data Management for Ad Hoc Mobile Workgroups - Groups of people can now easily form temporary electronic collaboration groups using their smartphones. Although group interaction often involves working on a shared document or database, modern smartphones give very little support for local group management, and almost none for the data products that groups create. It should be possible for physically-nearby groups to create temporary data storage areas with some standard features: reliable storage, versioning, preserved user identity, report generation, query processing of the shared data, and so on. Students who pursue this project will choose several implementation methods and measure their relative advantages and disadvantages.

Large-scale Data Processing

Working with extremely large datasets, often using the MapReduce framework, is another research interest of mine. You may want to jump ahead of the reading schedule and read Dean and Ghemawat, 2004, to get a flavor for some of this work.

Opaque Binary Optimization - There are times when we want to optimize data-processing programs, but we may not have or want to change the source code. For example, there may be an ancient and reliable numerical analysis package that no one understands enough to modify. This project aims to improve a program's performance even though the code cannot be edited. Using instrumentation of the program, perhaps with a VM, it should be possible to gather information about a program's data-access patterns. Does it stream through a large data file? Does it scan a small number of bytes over and over again, but write out a huge amount of data? Does it perform random accesses over a large memory set? Depending on the answer, we can modify the program's environment for improved performance. For example, a small input file could be loaded into a RAM disk; a large incoming stream can be prefetched, etc.

MapReduce-Backed Spreadsheets - MapReduce and HDFS/GFS have become very popular in corporate environments, but most administrators and users in such environments are not sophisticated programmers. It would be better to have a spreadsheet-like interface to data that is stored in a MapReduce cluster, thereby giving far more people access to "big data" functionality. Spreadsheet user operations would be implemented as MapReduce jobs, and the contents of the spreadsheet might reflect an approximation of the MapReduce job's eventual output. Since spreadsheet users expect immediate feedback, and MapReduce jobs generally take a long time to complete, you will need to use some combination of user interface tricks, approximation, and sampling to create the illusion of immediate responsiveness. Experiments may examine which combination of techniques is most effective at maintaining the illusion.

Miscellaneous

Is it possible to do a project that uses this thing? I don't know what the research angle would be yet, but it would definitely be awesome.