Adaptive Query Processing for Crowd-Powered Database Systems
Current project, NSF funded. See funding section below for details.
Goals and Challenges
Hybrid human-machine query processing systems, such as crowd-powered database systems, aim to broaden the scope of questions users can ask about data by incorporating human computation to support queries that may be subjective and/or require visual or semantic interpretation. In this work, we focus on filtering data with predicates evaluated by people. While traditional database systems use cost-based optimization to devise a low-cost query execution plan, in a hybrid system statistics that an optimizer would use, e.g., predicate cost and selectivity, are not available at run time.
The goal of this project is to develop an adaptive query processing approach for filter queries that learns the relevant optimization statistics as the query is running and adjusts the execution plan accordingly. In such an adaptive approach, the choice of which predicate a human worker should evaluate next for an item is decided on-the-fly, with the aim of minimizing the number of tasks that workers must complete. The challenges include devising an effective strategy for estimating statistics that considers the characteristics of human computation and crowdsourcing platforms. For example, query cost can include both latency and number of worker votes to reach concensus; crowd platforms are a dynamic environment in which workers come and go and can choose how many tasks they want to complete.
In our HCOMP 2017 paper, we describe an adaptive approach called Dynamic Filter for dynamically reordering filtering predicates. Adapted from work on Eddies by Avnur and Hellerstein, Dynamic Filter uses a lottery scheduling algorithm to learn the relative predicate selectivities and costs as the query is running and adjusts the order in which it chooses the next task to give to a human worker; a task consists of asking the worker if a given item satisfies a given predicate. We use simulations to show that Dynamic Filter requires fewer tasks to process queries than a baseline algorithm that randomly chooses a predicate for each task, as well as algorithms that use a fixed predicate ordering for the duration of the query. For example, the figure below shows the distribution of total tasks required for Dynamic Filter and the baseline algorithms for a two-predicate query on hotel data. The baselines are: random, fixed order of predicate 1 then predicate 2, and fixed order of predicate 2 then predicate 1. In this query, the two predicates have similar cost but different selectivities. See the paper for more!
Our simulations use a data set of real worker responses for predicates about hotels and restaurants that we previously collected using Amazon's Mechanical Turk. Our current results reflect an algorithm that issues one task at a time to workers. Next steps include investigating the impact on Dynamic Filter if there are a multiple outstanding worker tasks.
The code for this project is available at github.com/trush/dynamicfilter
Presentations and Publications
Students have shared their progress on this project at a variety of venues. On campus at Harvey Mudd, they have given a research talk during the summer as well as participating in the campus-wide poster session in the fall. They have also presented their work at an academic conference, Human Computation 2017, describing the following publication:
Doren Lan, Katherine Reed, Austin Shin, Beth Trushkowsky. Dynamic Filter: Adaptive Query Processing with the Crowd. In Fifth AAAI Conference on Human Computation, pp.118-127, 2017.
Project Funding and Personnel
This material is based upon work supported by the National Science Foundation under Grant No. 1657259, CRII: III: RUI: Adaptive Query Processing for Crowd-Powered Database Systems.
Principal Investigator: (Katherine) Beth Trushkowsky
Duration (expected): June 2017-May 2019
A number of undergraduate students have been involved:
- Flora Gallina-Jones
- Israel Jones
- Doren Lan1,2
- Mahlet Melaku
- Jake Palanker
- Katherine Reed2
- Ashley Schmit
- Austin Shin2
- Matthew Simon
1. Previously supported by an REU, grant no. 1359170
2. Also worked on project before current grant began