I am a Ph.D. student in the Computer Science Department of University of Illinois (UIUC), working with Professor Aditya Parameswaran. I am broadly interested in developing systems that can reduce data analysts’ effort in dealing with large datasets. My current research project focuses on version control system, OrpheusDB, for collaborative data analytics. I also work on a "any-k" sampling engine named Needletail.
I graduated with B.S. in Computer Science and Mathematics from University of Wisconsin, Madison in 2015 (Go Badgers!). Before I came to the U.S., I studied at Nanjing Foreign Language School, China.
PublicationsOrpheusDB: Bolt-on Versioning for Relational Databases
Silu Huang, Liqi Xu, Jialin Liu, Aaron Elmore, Aditya Parameswaran
43rd International Conference on Very Large Data Bases (VLDB), Munich, Germany. September, 2017
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. While git and svn are highly effective at managing code, they are not capable of managing large unordered structured datasets efficiently, nor do they support analytic (SQL) queries on such datasets. We introduce OrpheusDB, a dataset version control system that “bolts on” versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database “for free”, while the database itself is unaware of the presence of dataset versions. We develop and evaluate multiple data models for representing versioned data, as well as a light-weight partitioning scheme, Lyresplit, to further optimize the models for reduced query latencies. With Lyresplit, OrpheusDB is on average 103× faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to 20× relative to schemes without partitioning. Lyresplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by 10× on average.
OrpheusDB: A Lightweight Approach to Relational Dataset Versioning
Liqi Xu, Silu Huang, Sili Hui, Aaron Elmore, Aditya Parameswaran
International Conference on Management of Data (SIGMOD), Chicago, USA. June, 2017
(Best Demo Honorable Mention)
We demonstrate OrpheusDB, a lightweight approach to versioning of relational datasets. OrpheusDB is built as a thin layer on top of standard relational databases, and therefore inherits much of their benefits while also compactly storing, tracking, and recreating dataset versions on demand. OrpheusDB also supports a range of querying modalities spanning both SQL and git-style version commands. Conference attendees will be able to interact with OrpheusDB via an interactive version browser interface. The demo will highlight underlying design decisions of OrpheusDB, and provide an understanding of how OrpheusDB translates versioning commands into commands understood by a database sys- tem that is unaware of the presence of versions. OrpheusDB has been developed as open-source software; code is available at http://orpheus-db.github.io.
Optimally Leveraging Density and Locality to Support LIMIT Queries
Albert Kim*, Liqi Xu*, Tarique Siddiqui, Silu Huang, Samuel Madden, Aditya Parameswaran
Under review for VLDB'18
(* Equal Contribution)
Existing database systems are not optimized for queries with a LIMIT clause—operating instead in an all-or-nothing manner. In this paper, we propose a fast LIMIT query evaluation engine, called Needletail, aimed at letting analysts browse a small sample of the query results on large datasets as quickly as possible, independent of the overall size of the result set. Needletail introduces density maps, a lightweight in-memory indexing structure, and a set of efficient algorithms (with desirable theoretical guarantees) to quickly locate promising blocks, trading off locality and density. In settings where the samples are used to compute aggregates, we extend techniques from survey sampling to mitigate the bias in our samples. Our experimental results demonstrate that Needletail returns results 4× faster on SSDs and 9× faster on HDDs on average, while occupying up to 23× less memory than existing techniques.
An Empirical Evaluation of Machine Learning Approaches for Angry Birds
Anjali Narayan-Chen, Liqi Xu, Jude Shavlik
International Joint Conference on Artificial Intelligence (IJCAI) Symposium on AI in Angry Birds, Beijing, China. August, 2013
Angry Birds is a popular video game in which players shoot birds at pigs and other objects. Because of complexities in Angry Birds, such as continuously-valued features, sequential decision making, and the inherent randomness of the physics engine, learning to play Angry Birds intelligently presents a difficult challenge for machine learning. We describe how we used the Weighted Majority Algorithm and Naive Bayesian Networks to learn how to judge possible shots. A major goal of ours is to design an approach that learns the general task of playing Angry Birds rather than learning how to play specific levels. A key aspect of our design is that the features provided to the learning algorithms are a function of the local neighborhood of a shot's expected impact point. To judge generality we evaluate the learning algorithms on game levels not seen during training. Our empirical study shows our learning approaches can play statistically significantly better than a baseline system provided by the organizers of the Angry Birds competition.
The Generation Google Scholarship | Google | 2014
The Grace Hopper Celebration (GHC) Scholarship | Palantir | 2014
Clarice Cox Scholarship | University of Wisconsin - Madison | 2014