North Michigan, 2018

Wenjia He
Ph.D. student at University of Michigan

4957 BBB Building
2260 Hayward Street
University of Michigan, Ann Arbor
Ann Arbor, MI 48109

Email: wenjiah [at] umich.edu


I am a Ph.D. student in Computer Science and Engineering at the University of Michigan, Ann Arbor, advised by Prof. Michael Cafarella. Currently, I'm visiting MIT CSAIL as a research intern in the Data Systems Group. My research interest lies in the data management for video streams, ML-based and statistical model-based query optimization, large language model applications, approximate query processing, and causal inference.

Prior to UMich, I received my B.S. in Mathematics and Applied Mathematics from the School of the Gifted Young, University of Science and Technology of China (USTC), in 2018.

You can find my CV here.


What's New

June 2023
Our demo paper was accepted to VLDB '23: PAINE Demo: Optimizing Video Selection Queries With Commonsense Knowledge.
March 2023
I gave a short talk and presented a poster about "Optimizing Video Selection Queries With Commonsense Knowledge" at The 13th Annual North East Database Day (NEDB Day).
September 2022
I started visiting the Data Systems Group at MIT CSAIL.
August 2022
I was invited to present on the topic "Database Management System for Videos" at ByteDance.
June 2022
I gave a talk on our paper at the 2022 ACM SIGMOD/PODS Conference.
May 2022
I gave a lightning talk and a poster presentation about "Opaque Filter Query Optimization for Video Analytics" at The Workshop on Video Analytics (WoVA).
May 2022
I started my internship as a Software Engineer Intern at Meta, supervised by Bin Zhang.
December 2021
Our paper was accepted to SIGMOD '22: Controlled Intentional Degradation in Analytical Video Systems.
June 2020
I gave a talk on our paper at the 2020 ACM SIGMOD/PODS Conference.
March 2020
Our paper was accepted to SIGMOD '20: A Method for Optimizing Opaque Filter Queries.
January 2020
I passed the prelim exam and became a Ph.D. candidate.
August 2018
I moved to Ann Arbor and started my Ph.D. life at UMich.
June 2018
I was awarded Excellent Graduation Thesis Award in USTC (top 5%) and Outstanding Graduate of USTC.
April 2018
Our paper was accepted to USENIX ATC '18: Metis: Robustly Optimizing Tail Latencies of Cloud Systems.
September 2017
I was awarded National Scholarship, Ministry of Education of China (top 1% nationwide).
July 2017
I started my internship in Systems and Networking Research Group at Microsoft Research Asia (MSRA), supervised by Lead Researcher Chieh-Jan Mike Liang.
August 2014
I started my college life at University of Science and Technology of China.

Research Projects

Paine
Paine provides a novel indexing mechanism for optimizing video selection queries that select videos containing target objects. This mechanism builds a lossy index to save resources at index time, and then predicts the missing information at query time. This prediction process relies on our probabilistic models constructed from commonsense knowledge.

Smokescreen
Smokescreen offers administrators a profile that illustrates the tradeoff between increased analytical accuracy and increased amounts of video degradation for the problem of controlling the appropriate amount of degradation in analytical video systems. It incorporates approximation algorithms to provide tight upper bounds of analytical error for video degradation-accuracy tradeoff curves.

Voodoo Indexing
Voodoo indexing is an efficient two-phase mechanism for optimizing queries with selection predicates that are implemented with user-defined functions (UDFs), called opaque filter queries. This method builds a hierarchical index structure that groups similar objects together before any query arrives, then builds a map of how much each group satisfies the predicate and exploits this map to avoid processing irrelevant data.

Metis
Metis is an effective service for robustly auto-tuning configurations of modern cloud systems, used by several Microsoft services. It implements a customized Bayesian optimization method, including diagnostic models and novel acquisition functions, to optimize tail latencies.

Publications

1

PAINE Demo: Optimizing Video Selection Queries With Commonsense Knowledge VLDB '23

Wenjia He, Ibrahim Sabek, Yuze Lou, Michael Cafarella.
49th International Conference on Very Large Data Bases (VLDB '23 Demo)

Because video is becoming more popular and constitutes a major part of data collection, we have the need to process video selection queries --- selecting videos that contain target objects. However, a naive scan of a video corpus without optimization would be extremely inefficient due to applying complex detectors to irrelevant videos. This demo presents Paine; a video query system that employs a novel index mechanism to optimize video selection queries via commonsense knowledge. Paine samples video frames to build an inexpensive lossy index, then leverages probabilistic models based on existing commonsense knowledge sources to capture the semantic-level correlation among video frames, thereby allowing Paine to predict the content of unindexed video. These models can predict which videos are likely to satisfy selection predicates so as to avoid Paine from processing irrelevant videos. We will demonstrate a system prototype of Paine for accelerating the processing of video selection queries, allowing VLDB'23 participants to use the Paine interface to run queries. Users can compare Paine with the baseline, the SCAN method.
@inproceedings{he2023paine, title={PAINE Demo: Optimizing Video Selection Queries With Commonsense Knowledge}, author={He, Wenjia and Sabek, Ibrahim and Lou, Yuze and Cafarella, Michael}, journal={Proceedings of the VLDB Endowment}, volume={16}, number={12}, year={2023} }
2

Controlled Intentional Degradation in Analytical Video Systems SIGMOD '22

Wenjia He, Michael Cafarella.
2022 ACM SIGMOD International Conference on Management of Data (SIGMOD '22)

It is increasingly affordable for governments to collect video data of public locations. This video can be used for a range of broadly valuable analytical tasks, such as counting traffic, measuring commerce, or detecting accidents. Governments also have a range of policy goals --- preserving privacy, reducing bandwidth use, and legal compliance --- that may be obtained by degrading the video at some potential cost to analytical accuracy. Ideally, public administrators could employ controlled intentional video degradation to achieve policy goals while still obtaining the required analytical accuracy. Unfortunately, the optimal amount of induced degradation is data- and query-dependent, and so is difficult to determine even when public policy preferences are well-known. We propose a video degradation-accuracy profiling model for the problem of controlling the appropriate amount of degradation. It offers administrators a profile that illustrates the tradeoff between increased analytical accuracy and increased amounts of degradation. Computing the true tradeoff curves requires full access to the non-degraded video stream, so a primary technical contribution of this work lies in methods for accurately approximating the curves with only limited information. In addition, we propose a profile repair policy to further improve tradeoff curves' accuracy. We describe our prototype system, Smokescreen, plus experiments on two video datasets, two detection models and four aggregate query types. Compared with competing methods, we show our upper bound estimation of analytical error is up to 155% tighter, and Smokescreen enables 88% more accurate tradeoffs.
@inproceedings{he2022controlled, title={Controlled Intentional Degradation in Analytical Video Systems}, author={He, Wenjia and Cafarella, Michael}, booktitle={Proceedings of the 2022 International Conference on Management of Data}, pages={2105--2119}, year={2022} }
3

A Method for Optimizing Opaque Filter Queries SIGMOD '20

Wenjia He, Michael R. Anderson, Maxwell Strome, Michael Cafarella.
2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20)

An important class of database queries in machine learning and data science workloads is the opaque filter query: a query with a selection predicate that is implemented with a UDF, with semantics that are unknown to the query optimizer. Some typical examples would include a CNN-style trained image classifier, or a textual sentiment classifier. Because the optimizer does not know the predicate's semantics, it cannot employ standard optimizations, yielding long query times. We propose voodoo indexing, a two-phase method for optimizing opaque filter queries. Before any query arrives, the method builds a hierarchical "query-independent" index of the database contents, which groups together similar objects. At query-time, the method builds a map of how much each group satisfies the predicate, while also exploiting the map to accelerate execution. Unlike past methods, voodoo indexing does not require insight into predicate semantics, works on any data type, and does not require in-query model training. We describe both standalone and SparkSQL-specific implementations, plus experiments on both image and text data, on more than 100 distinct opaque predicates. We show voodoo indexing can yield up to an 88% improvement over standard scan behavior, and a 79% improvement over the previous best method adapted from research literature.
@inproceedings{he2020method, title={A Method for Optimizing Opaque Filter Queries}, author={He, Wenjia and Anderson, Michael R and Strome, Maxwell and Cafarella, Michael}, booktitle={Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data}, pages={1257--1272}, year={2020} }
4

Metis: Robustly Optimizing Tail Latencies of Cloud Systems USENIX ATC '18

Zhao Lucis Li, Chieh-Jan Mike Liang, Wenjia He, Lianjie Zhu, Wenjun Dai, Jin Jiang, Guangzhong Sun.
Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC '18)

Tuning configurations is essential for operating modern cloud systems, but the difficulty arises from the cloud system’s diverse workloads, large system scale, and vast parameter space. Building on previous space exploration efforts of searching for the optimal system configuration, we argue that cloud systems introduce challenges to the robustness of auto-tuning. First, performance metrics such as tail latencies can be sensitive to nontrivial noises. Second, while treating target systems as a black box promotes applicability, it complicates the goal of balancing exploitation and exploration. To this end, Metis is an auto-tuning service used by several Microsoft services, and it implements customized Bayesian optimization to robustly improve auto-tuning: (1) diagnostic models to find potential data outliers for re-sampling, and (2) a mixture of acquisition functions to balance exploitation, exploration and re-sampling. This paper uses Bing Ads key-value store clusters as the running example – compared to weeks of manual tuning by human experts, production results show that Metis reduces the overall tuning time by 98.41%, while reducing the 99-percentile latency by another 3.43%.
@inproceedings{li2018metis, title={Metis: Robustly tuning tail latencies of cloud systems}, author={Li, Zhao Lucis and Liang, Chieh-Jan Mike and He, Wenjia and Zhu, Lianjie and Dai, Wenjun and Jiang, Jin and Sun, Guangzhong}, booktitle={2018 $\{$USENIX$\}$ Annual Technical Conference ($\{$USENIX$\}$$\{$ATC$\}$ 18)}, pages={981--992}, year={2018} }