CS133 Lab 3: Query Optimization

Deadlines:

Part 1: Exercises 1-3: Wednesday, October 30 11:59 PM PT
Final: Exercise 4-5: Wednesday, November 6 11:59 PM PT

Lab Description

In this lab, you will implement a query optimizer on top of SimpleDB. The main tasks include implementing a selectivity estimation framework and a cost-based optimizer. You will be implementing something similar to the Selinger cost-based optimizer discussed in class. The remainder of this document describes what is involved in adding optimizer support and provides a basic outline of how you might add this support to your database. As with the previous labs, we recommend that you start as early as possible.

Quick jump to exercises:

Section 2.1 for Exercise 1
Section 2.2.2 for Exercise 2-3
Section 2.2.3 for Exercise 4
Section 2.3 for Exercise 5

Jump to Submission instructions.

1. Getting started

You should begin with the code you submitted for Lab 2. (If you did not submit code for Lab 2, or your solution didn't work properly, contact us to discuss options.) We have provided you with extra test cases as well as source code files for this project that are not in the original code distribution you received. We reiterate that the unit tests we provide are to help guide your implementation along, but they are not intended to be comprehensive or to establish correctness. You will need to add these new test cases to your release. The easiest way to do this is to untar the new code in the same directory as your top-level simpledb directory, as follows:

You will need to add these new files to your release. The easiest way to do this is to untar the new code in the same directory as your top-level simpledb directory, as follows:

Make a backup of your Lab 2 solution into a new directory for Lab 3 by typing :

$ cp -r cs133-lab2 cs133-lab3

Download the new tests and skeleton code for Lab 3 from https://www.cs.hmc.edu/~beth/courses/cs133/lab/cs133-lab3-supplement.tar.gz:
```
$ wget https://www.cs.hmc.edu/~beth/courses/cs133/lab/cs133-lab3-supplement.tar.gz
```
Extract the new files for Lab 3 by typing:
```
tar -xvzf cs133-lab3-supplement.tar.gz
```

Now all files from lab 2 and lab 3 will be in the cs133-lab3 directory.

To work in Eclipse, create a new java project named cs133-lab3 like you did for previous labs.

1.1. Implementation Overview

You will be writing code in three Java files: IntHistogram.java, TableStats.java, and JoinOptimizer.java.
Here's a rough outline of how you can proceed with this lab. More details on these steps are given in Section 2 below.

Implement a class for creating a histogram over integer data; a skeleton IntHistogram is provided for this purpose.
Implement the methods in the TableStats class that allow it to estimate selectivities of filters and cost of scans, using histograms (the IntHistogram class).
Implement methods in the JoinOptimizer class that allow it to estimate the cost and selectivities of joins.
Write the orderJoins method in JoinOptimizer. This method produces an optimal ordering for a set of joins (using a dynamic programming strategy), given the statistics computed in the previous two steps.

As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test and systemtest. See Section 3.3 for a complete discussion of grading and the tests you will need to pass.

2. Optimizer outline

Recall that the main idea of a cost-based optimizer is to:

Use statistics about tables to estimate "costs" of different query plans. Typically, the cost of a plan is related to the cardinalities of (number of tuples produced by) intermediate joins and selections, as well as the selectivity of filter and join predicates.
Use these statistics to order joins and selections in an optimal way.

In this lab, you will implement code to perform both of these functions.

2.1 Overall Optimizer Structure

Before getting started with the implementation, you need to understand the overall structure of the SimpleDB optimizer.

When you run SimpleDb, it runs the main() method in Parser.java. When the Parser is invoked, it will compute statistics over all of the tables (using statistics code you will write). When a query is issued, the parser will convert the query into a logical plan representation and then call your query optimizer to generate an optimal physical query plan.

The overall control flow of the SimpleDB modules of the parser and optimizer is shown in Figure 1.

Figure 1: Diagram illustrating classes, methods, and objects used in the parser and optimizer.

The key at the bottom explains the symbols; you will implement the components with double-borders. The classes and methods you will implement will be explained in more detail in the text that follows (you may wish to refer back to this diagram), but the basic operation is as follows:

Parser.java constructs a set of statistics (stored in the statsMap container) for each table in the Catalog when it is initialized. It does this by calling TableStats.computeStatistics(), which instantiates a new TableStats for each table: TableStats s = new TableStats(tableid, IOCOSTPERPAGE); Note that the SimpleDb catalog is parsed from the catalog text file provided by the user, as described in Section 2.7 of Lab 2.

It then waits for a query to be input from the user, and calls the method parseQueryLogicalPlan on that query.

parseQueryLogicalPlan first constructs a LogicalPlan that represents the parsed query. It then calls the method physicalPlan on the LogicalPlan instance it has constructed. The physicalPlan method returns a DBIterator object that can be used to actually run the query.

In the exercises to come, you will implement the methods that help physicalPlan devise an optimal plan.

Exercise 1: Parser.java

This exercise will ask you to briefly read some of the code that formulates a physical query plan after parsing a declarative SQL query, and answer a few questions. You should include your responses in your writeup.

Skim parseQueryLogicalPlan() in Parser.java to see how a LogicalPlan object is created. You should focus mainly on reading the comments throughout this method and note to yourself when tables, joins, filters, and aggregates are added to the LogicalPlan. For seeing joins and filters added, you'll end up looking at processExpression() in Parser.java. You should also read the method descriptions for addJoin, addFilter, and addScan in LogicalPlan.java

Question 1: How many LogicalJoinNode, LogicalScanNode, and LogicalFilterNode objects would you expect to be added to the LogicalPlan if the query was:

SELECT * 
FROM Students S1, Students S2, Enrolled E 
WHERE S1.sid=E.sid 
	AND S2.sid=E.sid 
	AND E.cid = 133;

Question 2: The method physicalPlan() in LogicalPlan.java creates a physical query plan using the logical query plan and statistics about each table in the plan. Read through the method description. Note that the return type of physicalPlan() is DbIterator. Why is the physical query plan a DbIterator?

Question 3: Now read physicalPlan() up until you see the call to orderJoins(). The first two while-loops iterate over the LogicalScanNodes and LogicalFilterNodes. Note the updates to the local variable filterSelectivities. After these first two while-loops, what do you expect each entry in the instance variable subPlanMap to represent? You can ignore entries that weren't added in physicalPlan(). Hint: look for the gets and puts on subplanMap.

Finally, skim the rest of physicalPlan() to get a sense of how joins and other operators are added to the plan.

2.2. Statistics Estimation

Accurately estimating plan cost is quite tricky. In this lab, we will focus only on the cost of sequences of joins and base table accesses. We will assume the Selinger-style heuristics of pushing selections down and evaluating them on the fly. We won't worry about access method selection (since we only have one access method, table scans) or the costs of additional operators (such as aggregates). You are only required to consider left-deep plans for this lab although we provide methods that will help you search through a larger variety of plans.

2.2.1 Overall Plan Cost

We will write join plans of the form

p=t1 join t2 join ... tn

, which signifies a left deep join where t1 is the left-most join (deepest in the tree). Given a plan like p, its cost will include the I/O costs of scanning and joining each relation, as well as the CPU cost of the joins.

Scan I/O cost

To make I/O and CPU cost comparable, typically a constant scaling factor is used to calculate the I/O cost per page, rather than just counting the number of pages in the table. For this lab, you can ignore the effects of caching in the buffer pool (e.g., assume that every access to a table incurs the full cost of a scan). Thus scancost(t1), the I/O cost of scanning table t1, is simply number_of_pages_in_t1 x SCALING_FACTOR.

Join I/O cost and CPU cost

When using nested loops joins, recall that the cost of a join between two tables t1 and t2 (where t1 is the outer) is simply:

  
  scancost(t1) + ntups(t1) x scancost(t2) // IO cost for tuple-nested-loop join
                 + ntups(t1) x ntups(t2)  // CPU cost

Here, scancost(t1) is the I/O cost of scanning table t1 and ntups(t1) is the number of tuples in table t1.

2.2.2 Filter Selectivity

Note that the value of ntups can be directly computed for a base table by scanning that table. However, estimating ntups for a table after one or more selection predicates have been applied can be trickier -- this is the filter selectivity estimation problem. The approach we will use is using equi-wdith histograms. The next two exercises will walk you through creating and using histograms to estimate selectivity.

Exercise 2: IntHistogram.java

You will need to implement a way to record statistics used for selectivity estimation for an attribute of a relation. We have provided a skeleton class, IntHistogram that will do this. You should complete the constructor and the methods estimateSelectivity and addValue for IntHistogram; hints and details are below. You may also wish to implement the toString method to help with debugging.

Constructor: You should create a data structure you will use for the histogram's buckets, as well as any other instance variables you deem necessary. There are a fixed number of buckets; each bucket represents the number of records in a fixed range of the domain of the attribute of the histogram. For example, if a field f ranges from 1 to 100, and there are 10 buckets, then the first bucket might contain the count of the number of records between 1 and 10, the second bucket has a count of the number of records between 11 and 20, and so on. Note that the min and max values are inclusive and that the last bucket may represent a different number of values than other buckets. Be careful: this is a histogram over integer data, so your bucket width should be integer-valued (and not zero).

addValue(): To populate a histogram with information about an attribute, the TableStats class will iteratively call addValue with an attribute's value from each tuple in a relation. Implement addValue to update the count of values in the correct bucket.

estimateSelectivity(): After TableStats finishes adding values, the histogram is ready to provide selectivity estimates for a given value and operation (e.g., determine selectivity of the filter "age = 40" or "age > 7" in a histgram on age). You can compare op to the possible Predicate.Op values (see Predicate.java) to compute the appropriate selectivity. You can treat Predicate.Op.LIKE the same as Predicate.Op.EQUALS. Watch out for boundary cases such as v < min.
Here is the approach to calculating selectivity that we discussed in class:
- To estimate the selectivity of an equality expression, f=const, compute the bucket that contains value const. Suppose the width (range of values) of the bucket is w, the height (number of tuples) is h, and the number of tuples in the table is ntups. Then, assuming values are uniformly distributed throughout the bucket, the selectivity of the expression is roughly (h / w) / ntups , since (h/w) represents the expected number of tuples in the bucket with value const.
- To estimate the selectivity of a range expression f>const, compute the bucket b that const is in, with width w_b and height h_b. Then, b contains a fraction b_f = h_b / ntups of the total tuples. Assuming tuples are uniformly distributed throughout b, the fraction b_part of b that is > const is (b_right - const) / w_b, where b_right is the right endpoint of b's bucket. Thus, bucket b contributes (b_f x b_part) selectivity to the predicate. In addition, buckets b+1...NumB-1 contribute all of their selectivity (which can be computed using a formula similar to b_f above). Summing the selectivity contributions of all the buckets will yield the overall selectivity of the expression. Figure 2 illustrates this process.
- Selectivity of expressions involving less than can be performed similar to the greater than case, looking at buckets down to 0.
Figure 2: Diagram illustrating the histograms you will implement in Lab 3.

We have provided a class StringHistogram that uses IntHistogram to compute selectivites for String predicates. You may modify StringHistogram if you want to implement a better estimator, though you do not need to in order to complete this lab.

After completing this exercise, you should be able to pass the IntHistogramTest unit test. Note that your histogram will be more thoroughly tested after finishing Exercise 3.

Exercise 3: TableStats.java

The class TableStats contains methods that compute the number of tuples and pages in a table and that estimate the selectivity of predicates over the fields of that table. The query parser we have created creates one instance of TableStats per table, and passes these TableStats structures into your query optimizer.

You should fill in the following methods and classes in TableStats:

Constructor: The TableStats constructor should create and populate a histogram for each field in the relation specified by tableid. Depending on the type of the field, you should create either an IntHistogram or a StringHistogram. Note that you do not need a min or max value to construct a StringHistogram. You will likely need to scan the table multiple times to accomplish this, e.g., this approach:
- Compute the minimum and maximum values for every attribute in the table (by scanning it once).
- Construct a histogram for every attribute in the table. Note that TableStats.java has an instance variable NUM_HIST_BINS for the number of buckets.
- Scan the table again, selecting out all fields of all of the tuples and using them to populate the counts of the buckets in each histogram.
To create a SeqScan to iterate over tuples in a relation with id tableid, you will need to use a Transaction. You can do something like the following:

Transaction t = new Transaction(); t.start(); SeqScan s = new SeqScan(t.getId(), tableid, "t"); // do stuff with s t.commit();

You will also likely make use of the classes HeapFile, TupleDesc, and Type to determine number of pages in the table, the fields of the table, and to distinguish types. Note: unfortunately, the only parent class of IntHistogram and StringHistogram is Object, so a data structure that contains both will need to have elements of type Object.

estimateSelectivity(int field, Predicate.Op op, Field constant): Using the histogram for the field with index field, estimate the selectivity for that field given the operation op and the given constant. You will likely need to cast constant as an IntField or StringField to get its value.

estimateScanCost(): This method estimates the cost of sequentially scanning the file, given that the cost to read a page is ioCostPerPage, the value passed into the constructor. You can assume that there are no seeks and that no pages are in the buffer pool. This method may use costs or sizes you chose to compute in the constructor.

estimateTableCardinality(double selectivityFactor): This method returns the number of tuples that would be in the relation after a predicate with selectivity selectivityFactor is applied. This method may use costs or sizes you computed in the constructor.

totalTuples(): Returns the total number of tuples in the relation.

After completing these tasks you should be able to pass the unit tests in TableStatsTest. Note that these tests have randomness; it is ok to fail it once every ten to twenty times you run it.

2.2.3 Join Cardinality and Cost

Finally, recall that the output of one join can be the input to the next join. To evaluate this expression, you need some way to estimate the size (i.e., ntups) of t1 join t2. This join cardinality estimation problem is harder than the filter selectivity estimation problem. For this lab, you can use simple heuristics such as those described below in Exercise 4.

Exercise 4: Join Cost and Cardinality Estimation

The class JoinOptimizer.java includes all of the methods for ordering and computing costs of joins. In this exercise, you will write the methods for estimating the selectivity and cost of a join, specifically:

estimateJoinCost(LogicalJoinNode j, int card1, int card2, double cost1, double cost2): This method estimates the cost of join j, given that the left input is of cardinality card1, the right input of cardinality card2, that the cost to access the left input is cost1, and that the cost to access the right input is cost2. You can assume the join is a tuple-nested-loops join, and use the formula mentioned earlier in Section 2.2.1. Note that cost1 and card1 represent the cost and cardinality of the whole left subtree, respectively, so pipelining is being accounted for.

estimateJoinCardinality(LogicalJoinNode j, int card1, int card2, boolean t1pkey, boolean t2pkey): This method estimates the number of tuples output by join j, given that the left input is size card1, the right input is size card2, and the boolean flags t1pkey and t2pkey indicate whether the left and right (respectively) field is unique (a primary key). This method calls estimateTableJoinCardinality(), which is where you will write your code.
When implementing your simple solution, you should keep in mind the following:
- For equality joins, when one of the attributes is a primary key, the number of tuples produced by the join cannot be larger than the cardinality of the non-primary key attribute.
- For equality joins when there is no primary key, it's hard to say much about what the size of the output is. It's fine to use a simple heuristic: the size of the larger of the two tables.
- For inequality joins, it is similarly hard to say anything accurate about sizes. The size of the output should be proportional to the sizes of the inputs. It is fine to assume that a fixed fraction of the cross-product is emitted by range scans (say, 30%). In general, the cost of an inequality join should be larger than the cost of a non-primary key equality join of two tables of the same size.
To implement the simple calculations, you likely won't need many of the arguments to sestimateTableJoinCardinality(). You will be comparing joinOp to the possible operations in Predicate.Op; you can treat Predicate.Op.LIKE the same as Predicate.Op.EQUALS.

After implementing these methods, you should be able to pass the unit tests in JoinOptimizerTest.java, excluding the tests that involve orderJoins.

2.3 Join Ordering

Now that you have implemented methods for estimating costs, you will implement a Selinger-style optimizer. For these methods, note that joins are expressed as a list of join nodes (i.e., a logical join between two relations; see LogicalJoinNode.java) as opposed to a list of relations to join as described in class.

Translating the algorithm to the join node list form mentioned above, an outline in pseudocode would be as follows.

Hint: We discussed this algorithm in detail in class!

1. j = set of join nodes
2. for (i in 1...|j|):  // First find best plan for single join, then for two joins, etc. 
3.     for s in {all length i subsets of j} // Looking at a concrete subset of joins
4.       bestPlan = {}  // We want to find the best plan for this concrete subset 
5.       for s' in {all length i-1 subsets of s} 
6.            subplan = optjoin(s')   // Look-up in the cache the best query plan for s but with one relation missing
7.            plan = best way to join (s-s') to subplan  // Now find the best plan to extend s' by one join to get s
8.            if (cost(plan) < cost(bestPlan))
9.               bestPlan = plan // Update the best plan for computing s
10.      optjoin(s) = bestPlan
11. return optjoin(j)

As an example, a size i=2 subset s of join nodes might be the two joins t1 join t2 and t3 join t1. The pseudocode in lines 4-10 would compare (1) joining t2 as the inner with t3 join t1, and (2) joining t3 as the inner with t1 join t2.

To help you implement this algorithm, we have provided several classes and methods to assist you:

First, the method enumerateSubsets(Vector v, int size) in JoinOptimizer.java will return a set of all of the subsets of v of size size. This method is not particularly efficient, so using it a lot will surely slow your code down!

Second, we have provided the method:
```
    private CostCard computeCostAndCardOfSubplan(HashMap<String, TableStats> stats, 
                                                HashMap<String, Double> filterSelectivities, 
                                                LogicalJoinNode joinToRemove,  
                                                Set<LogicalJoinNode> joinSet,
                                                double bestCostSoFar,
                                                PlanCache pc) 
```
Given a subset of joins (joinSet), and a join to remove from this set (joinToRemove), this method computes the best way to join joinToRemove to joinSet - {joinToRemove}. It returns this best method in a CostCard object, which includes the cost, cardinality, and best join ordering (as a vector). computeCostAndCardOfSubplan may return null, if no plan can be found (because, for example, there is no linear join that is possible), or if the cost of all plans is greater than the bestCostSoFar argument. The method uses a cache of previous joins called pc (optjoin in the psuedocode above) to quickly lookup the fastest way to join joinSet - {joinToRemove}. The other arguments (stats and filterSelectivities) are passed into the orderJoins method that you must implement as a part of Exercise 4, and are explained below. This method essentially performs lines 6--8 of the psuedocode described earlier.

Note: While the original Selinger optimizer considered only left-deep plans, computeCostAndCardOfSubplan considers all linear plans.

Third, we have provided the method:

    private void printJoins(Vector<LogicalJoinNode> js, 
                           PlanCache pc,
                           HashMap<String, TableStats> stats,
                           HashMap<String, Double> selectivities)

This method can be used to display a graphical representation of the join costs/cardinalities (when the "explain" flag is set via the "-explain" option to the optimizer, for example).

Fourth, we have provided a class PlanCache that can be used to cache the best way to join a subset of the joins considered so far in your implementation of the Selinger-style optimizer (an instance of this class is needed to use computeCostAndCardOfSubplan).

Exercise 5: Join Ordering

In JoinOptimizer.java, the orderJoins method operates on the joins class member, returning a new Vector that specifies the order in which joins should be done. Item 0 of this vector indicates the bottom-most join in a linear plan. Adjacent joins in the returned vector should share at least one field to ensure the plan is linear.

The method signature looks like this:

  Vector orderJoins(HashMap<String, TableStats> stats, 
                   HashMap<String, Double> filterSelectivities,  
                   boolean explain)

Here stats is an object that lets you find the TableStats for a given table name that appears in the FROM list of the query. filterSelectivities allows you to find the selectivity of any predicates over a table; it is guaranteed to have one entry per table name in the FROM list. Finally, explain specifies that you should output a representation of the join order for informational purposes.

To implement orderJoins, you can use the helper methods and classes described above. Roughly, your implementation should follow the psuedocode above, looping through subset sizes, subsets, and sub-plans of subsets, calling computeCostAndCardOfSubplan and building a PlanCache object that stores the minimal-cost way to perform each subset join.

Some things to keep in mind, particularly for using the helper methods/classes:

CostCard.java: This class encapsulates the cost and cardinality information for a given ordering of join nodes (the ordering is the Vector).
PlanCache.java: You will likely create a PlanCache and update entries in it via its addPlan method.
computeCostAndCardofSubPlan(): This method will be helpful! It has a number of arguments, so be sure to read those carefully. Note that this method can return NULL; for example it will refuse to do a cross-product. Be sure to check for that.
enumerateSubsets(): This method is very helpful! However it can be tempting to use it more than necessary, which can really slow down your code.
The biggest double value is DOUBLE.MAX_VALUE
Be careful with your for-loop range for when you implement line 2 of the pseudocode.
There are different ways to return the Vector ordering at the conclusion of this method, but one way would be to look up the ordering from the PlanCache using getOrder.

After implementing this method, you should be able to pass the test orderJoins tests in JoinOptimizerTest. You should also pass the system test QueryTest.

You have now completed the lab. Good work!

3. Submission and Grading Details

You must submit your code (see below) as well as a short (1 page, maximum) writeup describing your approach. This writeup should:

Include your answers from Exercise 1
Describe any design decisions you made, including any data structure(s) used for your histograms.
Discuss and justify any changes you made to the API.
Describe any missing or incomplete elements of your code.
Describe how long you (and your partner) spent on the lab, and whether there was anything you found particularly difficult or confusing.

3.1. Collaboration

This lab can be completed alone or with a partner. Please indicate clearly who you worked with, if anyone, on your writeup. Only one person needs to submit. On Gradescope, click "Group Members" at the bottom of the page after uploading your files to add your partner.

3.2. Submitting your assignment

For the first deadline, you should submit a writeup containing your answers for Exercise 1 (you will add to this document for the remaining parts of the writeup for the final submission), as well as your code tarball so far. For the final deadline, submit your updated writeup and final code.

Generating Tarball

You can generate the tarball by using the ant handin target. This will create a file called cs133-lab.tar.gz that you can submit. You can rename the tarball file if you want, but the filename must end in tar.gz. If you prefer, you can create the tarball with just your source code on the command line:

  $ cd cs-133lab1
  $ tar czvf cs133-lab.tar.gz src/

The autograder won't be able to handle it if you package your code any other way!

Submitting on Gradescope

Click Lab 3 on your Gradescope dashboard. For the first deadline, upload cs133-lab.tar.gz and writeup.txt with your responses for Exercise 1.
For the final version: click Lab 3 and then click the "Resubmit" button on the bottom of the page ; upload both cs133-lab.tar.gz and writeup.txt containing your updated writeup.

If you worked with a partner, be sure to enter them as a group member on Gradescope after uploading your files.

3.3 Grading

Your grade for the lab will be based on the final version after all exercises are complete.

Exercise 1 is 12.5% of your grade and 62.5% of your grade will be based on whether or not your code passes the test suite. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.

Important: before testing, we will replace your build.xml and the entire contents of the test directory with our version of these files. This means you cannot change the format of .dat files! You should also be careful changing our APIs. You should test that your code compiles the unmodified tests. In other words, we will untar your tarball, replace the files mentioned above, compile it, and then grade it. It will look roughly like this:

$ tar xvzf cs133-lab.tar.gz
[replace build.xml and test]
$ ant test
$ ant systemtest

If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.

An additional 25% of your grade will be based on the quality of your writeup, our subjective evaluation of your code, and on-time submission for the intermediate deadlines.

ENJOY!!

Acknowledgements

Thanks to our friends and colleagues at MIT and UWashington for doing all the heavy lifting on creating SimpleDB!