Review for Paper: 8-An Overview of Query Optimization in Relational Systems by Chaudhuri

Review 1

The paper gives insight onto the topic of query optimization which has been researched upon since early 1970s. It is based on SQL processing since it has been the standard for relational databases. There are two key components for query evaluation – query optimizer and query execution engine. Query execution engine defines operators for the execution of operations such as sort, sequential scan, index scan etc whereas the query optimizer is responsible for generating the input for the execution engine so that it executes efficiently.

The author discusses and example of query optimizer – System-R Optimizer. It has been incorporated into many commercial optimizers and is the most widely used optimizer currently. It works well for less than 10 joins. Its cost of operations and result size are estimated using statistics and is approximate at best. The search space used is too large and only left-deep plans are considered.

Some important algebraic transformations have been discussed in the paper such as outerjoin and join, Group-By and Join. Transformations do not necessarily reduce cost and therefore must be applied in a cost-based manner by the enumeration algorithm to ensure a positive benefit. Techniques have been shown to reduce multi-block queries to single-block by merging views or nested subqueries.

Cost estimation of a query optimizer is its major focus since the queries are optimized to reduce costs. The parameters used for estimation has been derived from the System-R approach i.e. collecting statistical summaries of data that has been stored and computing costs of output data stream along with cost related to the execution of the operations. However, the data is not substantiated with figures or calculations and estimation methodologies are just speculated upon.

The paper succeeds in providing some of the fundamentals of query optimization. It also explains enumeration architectures such as Starburst and Volcano. A sufficient comparison has been done of similar techniques showcased. Some issues prevalent even today have been addressed too. But not many solutions are provided for the same. It provides a good introductory read.



Review 2

What is the problem addressed?
Because relational query languages provides high-level declarative(vague) interface to access data, there is much room for query evaluation optimization. In SQL database system there are two key components of the query evaluation component: query optimizer and the query execution engine.
The query execution engine implements a set of physical operators, for examples sort sequential scan, index scan, nested loop join,... and these physical operator build up possible execution of SQL queries. Just like the relation between axioms and proof of theorem, there should be multiple execution plan and the query optimizer is responsible for generating the "efficient" input for the execution engine. This paper presents some basic component about this optimization and corresponding techniques.
1-­‐2 main technical contributions? Describe.
In general query optimization can be viewed as a search problem among all possible execution plans. We can separate this task into three parts: a space of plan(search space), cost estimation, and enumeration algorithm. The search space for optimization is the set of possible plans which can be related to some algebraic transformation. The algebraic transformations can be either improve or hider the performance, so it is necessary to estimate if the transformation is desirable or not. Cost estimation is the tricky part and depends on what resources usage is measured: CPU time, memory, I/O cost... the second part is that the estimation should be bother accurate and efficient(or we can just run through all candidate). The basic estimation framework is derived from system-R: statistical summaries of data and statistical summary of the output data stream and estimated cost of executing certain operation. An enumeration algorithm can search through the execution space. More recent optimization can adapt changes in the search space with addition new transformation and change in the cost estimation technique, and it is called extensible optimizers The paper presents two cases of extensible optimizers: Starburst and volcano/cascade.
1-­‐2 weaknesses or open questions? Describe and discuss
Optimization is much more than transformation and query equivalence. "Designing effective and correct SQL transformations is hard,developing a robust cost metric is elusive, and building an extensible enumeration architecture is a significant undertaking." I believe that many different optimization are out of the article, and this short version gives a compact overview.



Review 3

This paper is fine!


This paper is fine!


Review 4

This paper provides an overview of query optimization in Relational Databases Systems, focusing on the optimization of SQL queries. There are two main components for a SQL database system, query optimizer and query execution engine. And in general, query optimization can be viewed as a complex search problem, which contains 3 main components: a search space (space of plans), a cost estimation (estimate cost for each plan in search space), and an enumeration algorithm (search through the execution space). This paper first takes one optimizer, System-R Optimizer, as an example to address optimization issues, and then talking more details on all these 3 components one by one. Finally, it represents some other issues such as parallelism, user-defined functions and materialized views.

The problem here is that it is not easy to implement a good optimizer. It is hard to design effective SQL transformations, to develop a robust cost metric, and build an extensible enumeration structure. Thus the author summarizes the foundations and present works in this area, and hopes that future contribution can be done with a better understanding of the existing work. In general, an optimizer is considered to be good, if

the plans in search space have low cost
cost estimation is accurate
enumeration algorithm is efficient

The major contribution in this paper is that it provides plenty of detailed examples when discussing optimization issues. For example, it provides graphs and query plans to demonstrate difference between linear join and bushy join, and explains why we often optimizes linear join (it will reduce the total number of plans in search space). Similarly, it provides an example of a nested query plan to explain how to reduce the nested query to be a single “flattened” query. The concrete examples make it easy to understand the concept behind it.

One interesting observation: I think this paper also provides several good open questions in the end. Take materialized views for example, it is hard to do optimization when taking into account the materialized views. First, the problem of reformulating the query to use one or more materialized views is hard due to the complexity of SQL. Second, when we generate all logically equivalent expressions and optimized them individually, the optimization cost will be increased a lot.


Review 5

In this paper, the author provides an overview of query optimization methods and related issues. Role of query optimizer and it's connections and interactions with other parts of DBMS is described. Also, the author addresses the dificalities and challenges in different aspects of query optimization, such as search space optimization and cost estimation for different plans in the search space. Further more, some real world examples and implementations of query optimizers are introduced, and an overview of how they deal with the challenges is provided.

Upsides:
-The main strength of this paper is its clear language in explaining different techniques and concepts, and also, well organized examples. This helps the researchers to understand the fundamental concepts and problems.
Also, there is a part about future research trends in the topic. I think the author did a good job in foreseeing the future and most of the prediction came to reality.

Flaws:
-The author simplifies some very complex situations without describing why we can do this simplification.
-There no performance evaluation of different techniques. It is important because otherwise readers can not have a good understanding of both merits and flaws of a technique.


Review 6

This paper attempts to summarize 30+ years of work in query optimization, and to show the current state of query optimizers. This is an extremely important aspect of database management systems - because users interact with the system in a declarative way, it is up to the system to optimize for performance. The user poses a query, and the database must come up with an efficient way to retrieve the data - this is the primary selling point of a DBMS. However, the system must be able to come up with an efficient plan quickly - it doesn't make sense to spend more time coming up with a plan than executing it. The author covers how the search space can be limited by limiting join order possibilities, and reducing multi-block queries into a single query, as well as using semijoins. The author then move on to cost estimation , and the statistics that are maintained by the DBMS. He then touches on some more advanced issues, such as using materialized views to further speed up queries.

This paper does a good job of providing an overview of query optimization... much like the title says. It presented aspects of the optimizer that I hadn't really considered - primarily, how efficient the query optimizer must be to function properly. It showed how large the plan space of query optimization really is, and why its so hard to find the best solution. I hadn't realized how much architecture had been built up to support query optimization, and this paper really made me appreciate the work that's been done.

I would have liked to have seem more on object-relational optimization, as well as the optimization of stored procedures. I think this would have been especially interesting, given that it is worth spending significantly more time finding the fastest plan for each query within the procedure. The optimization of object-relational mapping would prevent an entirely new issue - could queries be sped up by keeping the objects in their in memory form, and how could their mapping to relation form be optimized?



Review 7

The paper discuss the query optimization in relational system with some basic knowledge and samplings. The paper is proposed because the query optimization is an important work of database area for a long time.

The two most important parts in query optimization are the query optimizer and the query execution engine. In this paper, it mainly talks about query optimizer. It start from an elegant optimizer named System-R optimization framework to show the process of optimization. Then it discuss more detailed things about search space for optimizers. The paper researches detailedly about all the frequently used operations including different joins, group-by, merging, nested and so on.

The strength of this paper is that it uses a lot of figures and pseudo code to demonstrate the idea, which makes the paper much more readable since a large amount of text makes readers confused and makes the idea hard to understand. I appreciate the behavior of concept visualization.


Review 8

SQL is a high-level “declarative language” and leaves the problem of how the work should be done to the DBMS. And since different query plans can significantly affect execution efficiency, a good design of a query optimizer is very important, and, as stated in the paper, very challenging.

The paper introduced current works in the field of query optimization, and basically divide it into three basic problems: search space, cost estimation and enumeration.

The basic idea of System-R optimizer is first given as an example. The search space is set as Select-Project-Join(SPJ) queries. The optimizer represents a query into a query tree, and uses statistics based on the data to assign an estimated cost for each operator. Then a dynamic programming approach is adapted to give an optimized query plan. This optimize is simple but cannot extend to other logics easily, but it leads to many other extensible optimization architectures.

The first problem is search space. It should be all algebraic transformation that are equivalent in terms of query result. The paper then introduce how current approaches deal with generalized space containing operators such as outer joins and group by, and how to reduce multi-block queries to single block, and how to use semijoin like techniques to optimize multi-block queries.

The paper then discussed key problems and methods in cost estimation. Cost estimation should be both accurate since it determines the upper bound of the whole optimization, and it should also be efficient since such procedure can be invoked many times. A cost estimator should evaluate statistical summary of output data and cost of the operator, given statistical summary of the input data. Such statistical data can be histograms, both one-dimensional and two-dimensional, to show data distribution and correlation.

An enumeration algorithm is responsible for exploring the search space. Starburst and Volcano/Cascades are gave as two examples. Finally the paper mentioned some advanced topics such as parallel and distributed databases. The author concluded that query optimization is a very hard task, and exploring more deeply into the field may requires an understanding of existing frameworks.

The main contribution of the paper is to introduce major tasks for query optimizer and wrap up basic ideas of current query optimization architectures, and address many open questions unsolved.


Review 9

This paper goes over the various query optimization techniques that have emerged over the years, specifically for the relational database system. The paper visits each aspects of optimizing a query engine, and describes the major algorithms and techniques that have emerged in each area.

The various class of algebraic transformations that can occur define the size of the search space for optimization. This search space can be narrowed down through the following:
Optimizing commutativity among operators: join operations get restricted to linear sequences. join and outerjoin predicates can be combined to a single join.
Reducing multi-block queries: a multi-block query can be collapsed into a single block query by merging views or merging nested subqueries.
optimizing multi-block queries: selectivity of predicates across blocks can be exploited to minimize redundant computations.

Given the search space defined by the possible transformation, the cost estimation computes the how good a given operator tree is through the statistical summaries of the data. The cost computation includes the various resources that are consumed, such as CPU time, I/O cost, memory, communication bandwidth, etc.




Review 10

This paper illustrates the basics in Query Optimization in Relational Systems. It focuses on three main areas in the system-R optimizer: search space, cost estimation and enumeration algorithm. In addition, some hot topics in the research area are mentioned.

The query optimization is responsible for generating efficient query plans from the given query to query execution engine. It is a non-trivial task and consists of three main part:

search space: it is constructed by the equivalent algebraic transformation. The optimizer may use different representation of the query. It first constructs parse tree from given query, then transform it to logic operator trees(query trees). Some system uses query graph, instead of tree structured representation. It is simple in concept but does not support asymmetric operator. In addition, unnesting query is also important.
cost estimation: optimizer should efficiently evaluate the cost of a given operator tree. In system-R, it collect statistical summary of data stored, and estimates the output and the cost to execute each operation. The statistical summary usually stored in histogram and are estimated through sampling base data. In large databases, the accuracy and efficiency in this sampling becomes an issue. To compute the cost, it needs a costing model and lots of factor should be taken into consideration, like selectivity and modeling buffer utilization.
enumeration algorithm: one of the desired feature is extensible, that is, it can adapt to changes in the physical operator and cost estimation technique. This paper introduced two kind of extensible optimizer, Starburst and Volcanol/Cascades.

This paper illustrates the basic techs in query optimization in system-R full of details. However it focuses more on fundamentals but does not show the development in these eras, so limits the view of the big picture.



Review 11

This paper attempts to summarize the foundations of the field of query optimization. It presents techniques for navigating the search space of different query execution plans, estimating the cost of different plans, and creating extensible enumeration architectures to account for future evolutions of query optimization.

Search space:
In optimizing a query, the database will choose from many different query plans that will produce logically equivalent results. The database does this by mutating the query in a way that is logically equivalent. Some options include:

The database may try to switch the execution order of join and group-by operations, which may result in a more performant plan. For example, executing a group-by operation first may result in fewer records that need to be scanned when a later join occurs.

The database may also collapse nested queries or views into a single-level query, which could eliminate unnecessary joins.

The database may also choose to use views (or nested queries) when it knows that it can limit the record count for a more complex operation down the line.

Cost Estimation:
Sometimes the database can estimate how many results will be returned by certain operations. It does this by keeping histograms of the data distribution of key columns. This can help the database limit the search space for queries, as described above. A challenge arises because one-dimensional histograms of columns cannot be used to discern correlation between data of different columns. This means that databases may be blind to correlations that it could use to perform optimizations.
Actually using these estimates to guess the cost of an operation is hard, because factors of data attributes, I/O, buffer pool, operation properties etc. all play a role in determining the cost.

Enumeration Architecture:
The main goal of an enumeration architecture is to allow the searching of different plans to be extensible when new operations and estimation/optimization techniques need to be inserted. The two architectures explored by the paper both include “rules” that the database can apply to a candidate plan to mutate it into another valid candidate plan. New rules may be added when extensions are needed.



Review 12

The paper outlines the topic of query optimization in relational database systems, signifying several important work in the past. “Declarative” nature of relational query languages had a large contribution to what has made relational database system successful. Relational query languages enable a user to simply input what he/she wants to see without having to worry how it is done. Database systems are responsible for figuring out “how” and the magic behind it is called query optimization. There has been extensive work in query optimization since the early 70s as the paper states. This is not surprising at all considering its direct impact on the performance of database and the complexity of the problem itself.

A query is generally executed by a relational database system with two key components, query optimizer and query execution engine. The query optimizer is responsible for generating an execution plan that generates answers to the query. The query execution engine is responsible for the given execution plan. The query optimizer takes an SQL query as an input and aims to produce an efficient execution plan for the given SQL query. An execution plan is usually represented in a tree structure, called operator tree. Implementing an efficient query optimizer is a hard problem because 1) there are a large number of possible operator trees that are equivalent given a SQL query; 2) it needs to accurately estimate the cost of each plan; 3) it has to calculate the cost of each plan efficiently. This makes query optimization a difficult search problem.

The paper discusses many techniques that have been invented for query optimization. The use of dynamic programming and interesting orders are important techniques for an enumeration algorithm to efficiently search through the execution space. Query optimizers exploit commutativity of operators to reduce the search space. The paper also explains about how database systems compute statistical information from its data and query optimizers estimate the cost of an operator tree utilizing such information. For enumeration algorithms, the author provides two real-world examples, Starburst and Volcano/Cascades. The paper concludes with its remark on topics that go beyond the fundamentals, such as distributed database, user-defined functions, etc.

To sum it up, the paper provides a good overview over the topic of query optimization, which has been popular research area since the advent of declarative relational query languages. The author has done a great job of explaining the fundamentals and foundations of query optimization. Personally, I think that it is too bad that the scope of the paper has been limited to relational databases due to the large body of work in the area. It would have been more interesting if the paper includes the techniques used in more recent non-relational databases, such as HBase.


Review 13

Query optimization is an extremely important component of any database system. In this paper, Chaudhuri describes the different components that of a good query optimizer and also displays some of the work done in query optimization. The query optimization discussion is in the context of relational databases in order to all the author to hone the focus of the paper.

The paper begins with a short introduction into query optimization lays out three key tasks an optimizer should accomplish. It should examine a search space which includes low cost plans, possess an accurate cost evaluation technique, and use an efficient enumeration algorithm. The paper, then, introduces the System-R optimizer. The System-R optimizer introduced many useful ideas and techniques for query optimization such as the use of dynamic programming and the idea of interesting orders.

After the discussion of System-R, the paper examines each of the three tasks an optimizer should accomplish. Some of the techniques and ideas from the System-R framework are re-introduced as well as issues to consider when designing an optimizer. The paper finally concludes with a few, more advanced issues to consider in query optimization design, especially in relation to commercial systems.

The paper did not have any original research, due to it being an overview of query optimization in relation databases. However, it did a decent job of presenting the fundamental ideas in designing a query optimizer as well and present some of the work done in the field. I though the examples and diagrams helped to support some of the explanations in the paper. Without them, I feel like I would have had more trouble understanding some of the techniques used in query optimization.


Query optimization is an extremely important component of any database system. In this paper, Chaudhuri describes the different components that of a good query optimizer and also displays some of the work done in query optimization. The query optimization discussion is in the context of relational databases in order to all the author to hone the focus of the paper.

The paper begins with a short introduction into query optimization lays out three key tasks an optimizer should accomplish. It should examine a search space which includes low cost plans, possess an accurate cost evaluation technique, and use an efficient enumeration algorithm. The paper, then, introduces the System-R optimizer. The System-R optimizer introduced many useful ideas and techniques for query optimization such as the use of dynamic programming and the idea of interesting orders.

After the discussion of System-R, the paper examines each of the three tasks an optimizer should accomplish. Some of the techniques and ideas from the System-R framework are re-introduced as well as issues to consider when designing an optimizer. The paper finally concludes with a few, more advanced issues to consider in query optimization design, especially in relation to commercial systems.

The paper did not have any original research, due to it being an overview of query optimization in relation databases. However, it did a decent job of presenting the fundamental ideas in designing a query optimizer as well and present some of the work done in the field. I though the examples and diagrams helped to support some of the explanations in the paper. Without them, I feel like I would have had more trouble understanding some of the techniques used in query optimization.


Query optimization is an extremely important component of any database system. In this paper, Chaudhuri describes the different components that of a good query optimizer and also displays some of the work done in query optimization. The query optimization discussion is in the context of relational databases in order to all the author to hone the focus of the paper.

The paper begins with a short introduction into query optimization lays out three key tasks an optimizer should accomplish. It should examine a search space which includes low cost plans, possess an accurate cost evaluation technique, and use an efficient enumeration algorithm. The paper, then, introduces the System-R optimizer. The System-R optimizer introduced many useful ideas and techniques for query optimization such as the use of dynamic programming and the idea of interesting orders.

After the discussion of System-R, the paper examines each of the three tasks an optimizer should accomplish. Some of the techniques and ideas from the System-R framework are re-introduced as well as issues to consider when designing an optimizer. The paper finally concludes with a few, more advanced issues to consider in query optimization design, especially in relation to commercial systems.

The paper did not have any original research, due to it being an overview of query optimization in relation databases. However, it did a decent job of presenting the fundamental ideas in designing a query optimizer as well and present some of the work done in the field. I though the examples and diagrams helped to support some of the explanations in the paper. Without them, I feel like I would have had more trouble understanding some of the techniques used in query optimization.


Review 14

This paper discuss about the process of optimization of SQL queries in relational database systems. A query optimizer takes a SQL query as input and generate an efficient execution plan for the execution engine. There are so many possible choices for the optimizer given a SQL queries since an algebraic representation can be re-written into many other algebraic representations and each of them can still be transformed into many different operator trees (execution plans). As a result, the query optimizer is an extremely important component in a DBMS.

The task of optimization is formulated into a search problem in this paper with the following three components. The author uses the System-R optimization framework as an example and then discusses the three components in detail.

1. A search space that contains possible plans:
It is important to define which plans should be in the space since they are candidates of the final decision. There should be a balance between numerous choices of plans and an acceptable cost for enumerating the space. For example, bushy join sequences might provide rather good plans but they also significantly expand the cost of exploring the plan space; so most systems still focus on linear join sequences and only subsets of bushy join sequences. The author introduces several transformations that allow different orderings of operators such as natural join versus outer join and group-by versus join, using associative and commutative properties. Views and nested subqueries can also be unfolded of flatten to transform the original query.

2. A cost model that estimates the amount of resources needed for a given plan in the search space:
Cost estimation is a critical part that affects the accuracy and efficiency of a query optimizer since it is repeatedly invoked. To do the estimation, we need the statistical summary of data (logical property) and also the estimated cost of operations (physical property).

3. An enumeration algorithm that goes through the space:
The enumeration algorithm explores the plan space and picks an execution plan among all the plans in the space. Many recent query optimizers (note that this paper is from 1998) exploit an adaptive scheme and are called "extensible optimizers". The author provides two examples to illustrate such extensible optimization.

The main contribution of the paper is that it formulates query optimization into three clear sub-problems, provides some typical solutions to these sub-problems, and also takes some real world systems as examples. The author also use some SQL queries to explain how the approaches are applied.

The weakness of this paper in my point of view is that the author does not discuss much about the cost computation part in the cost model. There are many resources involved and it would be nice to provide some examples on how the recent query optimizers are computing the aggregate cost of different resources. However, the author does mention that this is one of the difficult open issues.


Review 15

This paper covers the optimization of SQL queries in relational database systems, and covers the foundations and present work in the area. The goal of the optimizer is to have a search space that contains plans of low cost, to have an accurate costing technique, and to have an efficient enumeration algorithm.

Relational query languages provide an interface that is high-level and declarative to access data stored in relational databases, with SQL as the standard. The key components of the SQL DB system are:
1. The query optimizer – takes parsed SQL query as input and generats an efficient execution plan to input to the query execution engine.
2. The query execution engine – implements a set of physical operators (sort, sequential scan, index scan, nested loop join), which takes multiple data streams and outputs a data stream. The query execution engine also implements the execution plan (aka the physical operator tree) that generates answers for the query.

To come up with the most efficient plans, the following are needed:
1. A search space for plans
2. A cost estimation technique to assign each plan in the search space with a cost
3. An enumeration algorithm to allowing searching through the execution space

Overall, I enjoyed the paper and it’s clear descriptions of the query engine and the query optimizer. I felt that it could’ve been more clear and provided numerical data on how the three tree views example under semijoin like techniques is more efficient than just going through the query.



Review 16

This paper discusses at a high level some of the prominent past research on query optimization in relational database systems. The author admits that there is much past research on query optimization outside the scope of SQL in RDBMS systems and attempts to restrict himself to this area.

The author starts by formalizing the set of problems he refers to as query optimization. This includes the definition of a search space for query plans, a cost function for estimating the resources needed by a given query, and the algorithm used to search through the search space and be optimized for query performance. The paper then goes on to talk about specific types of queries and situations that allow significant optimization as well as the method by which to measure the effect of the optimization, namely by measuring CPU time, I/O cost, memory and communication bandwidth. The author goes into some explanation of System-R, Starburst, Volcano, and Cascades technologies, as well as some other issues with distributed, parallel databases, user-defined functions, materialized views and a couple other subjects.

This paper provides a good first insight into optimization in RDBMS systems and the types of queries that have received lots of attention in previous work. The objective of this paper is stated to be an overview of the foundations with some samplings of certain query optimizations which are explored in the paper. Some of the specific examples in the paper were easy to follow and had clear motivation for other research in this area.

The author uses a few concerning terms in his objective statement. He says that he wants to present his bias, which is not the way that research should be presented and then he immediately apologizes to contributors to the research area and say that he may have failed to include important findings due to his oversight. If he wants to present a clear introduction and explanation of problems in this field he should do a thorough review of literature in this area. I struggled to understand a few of the examples in the paper and I think they would have been easier to understand if the author also provided the goal of the user in designing the query rather than just showing short variable names that the reader is supposed to infer an understanding of.


Review 17

Part 1: Overview

This paper summarized query optimization procedures for relational database systems. Relational databases provide declarative programming interface to its users. Query processing components includes query optimizer and query execution engine. Physical operator tree is used for query execution engine and it is also called an execution plan. The query optimizer takes care of generating an efficient execution plan for the query execution engine.

Query optimization can be viewed as an complex search problem. We need search space, estimated cost assigned to different execution plans, and enumeration algorithm to iterate through the search space. A query optimizer would always want to reduce the search space by combine query blocks or semijoin for improved selectivity. The statistics and cost estimation are still hot spots in research fields and leave many open question to researchers as for example, secret of correlations between query costs are still unfold. Three enumeration architectures are introduced in this work, Starburst Project, Volcano, and Cascades.

Part 2: Contributions

This paper provides a great overview of query optimizer designs for relational databases. It decomposes the database query processing into modules, query optimizer and query execution engine, and further divide query optimization problem into three aspects, search space, cost estimation, and enumeration algorithm respectively.

This paper points out some crucial ideas for query optimization. Keeping the search space small and compact is always good for performance improvement. Join operations can be utilized for reducing the search space we are going to feed the query optimizer. Combining query blocks can also efficiently reduce the workload. Views and nested subqueries can be merged in order to shrink the search space. Semijoin performs as a good filter along the way of doing selection.

Part 3: Possible ways to improve

This is an high level overview of query optimizer design for relational database system. There is no simulation or statistics from the literature to strongly support author’s ideas. Examples are not so convincing for showing the ability of reducing search space by merging blocks in the way that the environment could vary and there may be limitations of this method.

User defined functions may be an interesting topic. The reader may wonder why the cost estimation remains a difficult problem in this part. However limited by space, the author did not get a chance to explicitly explain the difficulty of analyzing such as WHERE clauses in user defined functions.



Review 18

The paper discussed an overview of SQL query optimization. A query optimizer generates operator tree from a parsed SQL query input which further is used as an input for the execution engine. In this regard, the paper covered important steps including: finding search space of operator trees, cost estimation using statistical data, and enumeration algorithm that can search through the search space in cost effective manner. For query optimization, the first step is understanding how the search space for all possible operator trees can be built which is done by transforming a query to another equivalent query. One of the main technique in transformation is commuting among operators. For example, since join operations are commutative and associative, the sequence of join in an operator tree can be altered accordingly. The other technique is reducing multi-block queries to single block query. This includes merging views and merging nested subqueries which effectively simplify the query and result in a potentially low cost query.

After being able to generate the search space, the second question is which query trees to select. In this regard, accurate and efficient cost estimation techniques are important. In order to do cost estimation, it is necessary to capture statistical information including: number of tuples and number of physical pages in a table. These numbers determine costs associated with table scan and memory requirement. In a large database system, it is important to construct a histogram per column bases in order to have a more detailed information for the cost estimation. The actual cost computation is then calculated in different categories including: CPU, I/O, and communication (for parallel database) costs based on the above statistical data. The other aspect of query optimization is the enumeration algorithm which is used to explore the search space. The purpose of the enumeration algorithm is to pick a cost effective execution plan as it traverse the search space. While the System R optimizer chooses only an optimal linear join order, the paper discussed that recent enumeration algorithms are able to adapt to new changes in the design including: due to addition of new transformation rules, new physical operators and new cost estimation techniques.

The strength of the paper lies in its identification of pioneering ideas from different contributors. In addition, it categorically isolated the advanced techniques from fundamental techniques. It also has an interesting discussion section on future optimization issues including considering additional resources like memory in cost estimation process.

Although the author start by apologizing to the contributors who are not mentioned in the paper, the ones which are already mentioned are more than I expected considering the size of the paper. Query optimization is a broad subject. I believe the author could have concentrated on one of the areas either cost estimation or enumeration algorithm or search space. The paper is packed with many technical contributions where some of them are left without an in depth explanation.



Review 19

The main goal of this paper is to provide an overview of the foundations and some important work in query optimization for relational systems. Query optimizer is one of the two components in query evaluation. The responsibility for query optimizer is to generate an efficient execution plan. A desired optimizer should be low cost, accurate and efficient. The discussion includes aspects like: search space, cost estimation and enumerating the search space.

The paper starts with an optimizer example from System-R. It contains an elegant query optimizing mechanism but failed to be freely scaled. Then the paper dives into sub topics of the above three: how to efficiently and effectively get search space, how to accurately estimate cost and finally how to enumerate the search space. Different techniques are introduced in each section.

We use dynamic programming in building search space. Limited to SPJ queries, the key assumption is that K join steps equals to (K-1)joins plus the final join. And the order of the first K-1 doesn’t influence the efficiency of the last join. This assumption greatly shrink the search space compared to the naive one. Our first priority here is to keep correctness during query transformation, and at the same time not too large to enumerate later on. Joins can be reordered. Non commutative operations is hard to handle, but different techniques can be applied to work it around.For cost estimation. The paper introduced the current methods for estimations. It is also pointed out that correlations between columns should be considered but remain as an open topic in this area.

This paper provides a great overview of the key aspects for relational systems query optimization. But it would be better to provide more related data to support the arguments.


Review 20

This paper surveyed query optimization. It mainly discussed query optimizer, which consists of three difficult questions: 1) search space contains low cost execution plans, 2) cost estimation for each execution plan is accurate, and 3) search algorithm for execution plans is efficient.
Then this paper used System-R Optimizer as a good example to illustrate current solutions to problems above.
After that, each of the three problems is explored in depth.
For search space:
There are a few known method to generate another query based the original query: commuting between operators, reduce multi-block query to single block ones and using semijoin like techniques to optimize query. Each of these methods suit different set of queries and can be used to generate a large set of new candidate queries.
For cost estimation:
The most important thing for cost estimation is accurate using a relatively simple algorithm. Method used in system-R is to use the stored statistical data to estimate the expected cost for each operator.
For enumeration:
Two algorithms, the starburst and Volcano/Cascade are studied.
Other optimization topics such as parallel database and user-defined function are also mentioned.

Contribution:
This paper gives a good overview of query optimizer, about what it does and how it does it. As the conclusion mentioned, query optimization is not only about theoretical parts such as transformation and query equivalent.

Weakness:
This paper briefly discussed several different optimizations. I think it might help future research if the paper not only lists these optimizations, but gives some relations between these optimizations.



Review 21

This paper aims to capture the progress of work done in query optimization since the beginning and emphasizing that query optimization is not a trivial problem.

The query optimization problem essentially requires the following
1. A space of plans (Search space)
2. A cost estimation technique to assign a cost to each plan
3. An enumeration algorithm to search through the different plans

The author starts off with the System-R implementation where the joins are represented as a linear sequence of join operations. In case there are k join operations, the optimal plans of (k-1) joins are considered and weighted in order to decide the final best plan. The System-R implementation also explores interesting orders where even if the cost of one subplan x is lesser than the subplan y, if the optimizer determines that using y will reduce the cost of subsequent operations. However

The paper also speaks about how in certain specific queries, evaluation of a group-by operator preceding the other operations such as join could significantly reduce the cost of the query. I think the group and join query tree was not really clear and this idea could have been explained better by expanding the example the author mentions at the end of this section. (Calculating total sales for all products in each division)

One of the significant ideas in this paper that I liked was the idea of flattening nested queries into a single query. There is definitely complexity involved when there are aggregate functions in the sub-query. The complementary approach focuses on avoiding redundant computation in the views or nested sub-queries which can definitely improve the cost of query executions.

This paper majorly seems to concentrate on examples concerning joins and cost and efficiency of implementing them but more examples regarding optimization involving aggregate operators and databases with varying structures could have been explored in further detail.



Review 22

This paper presents a general introduction to the basics of relational database query optimization. It illustrates the fact that relational query optimization is a not a trivial with many of its components. In this paper, many of the possible query optimization methods are explored and analyzed.

The most important step is the transformation of the query plans, which are usually done by searching through the ‘search space’ of the query for optimization. The most easily seen optimization is based on commuting between the operators. The trivial example in the paper is to alter the sequence of the group-by operator and the join operator to reduce computation. Another interesting technique is Merging multiple-block queries to single block. The practical use can be seen as transform the related sub queries into a single query. The last approach mentioned is to utilize semijoin like idea to optimize multiblock queries. Since subqueries can take a long time to calculate, the author suggested to break down the unrelated subqueries into many small independent queries by using views.

As important as the enumeration used to find the optimal query plan in the search space, statistics for cost estimation is also essential in getting the best query plan. Based on the metadata provided by the database, the query optimizer calculate an approximate cost of each plan that is being explored, and choose the best among them to decide the operation order of the query executor. Those data are originally on the table scale, but they will be eventually propagate up through the operators in the query plan. And the operational cost in terms of CPU time, I/O will be considered in the final evaluation of a query plan.

In summary, this paper serves well as an overview of how the query optimization is done and what are the supporting material like the using statistical information and dynamic programming algorithm to help the optimizer making a better choice. But one observation interesting is that, this paper only provides conceptual evidence to prove the effectiveness of the explored optimization techniques. If further performance test result can be append to end, it might help better persuade the reader to understand the strength of those techniques.




Review 23

The paper gives a overview of the optimizations of SQL queries in relational database systems. The query optimizer is responsible for generating executing efficient plan for execution engine. The query optimizer have three parts: search space, cost estimation technique and enumeration algorithms. The paper first briefly talks about query optimizer in System-R and then talks about techniques in these three part of optimizer in details.
Search Space is find the transformations to generate different operator trees. The paper talks about commuting between operators(especially about join sequence, outer join and group-by), and some potential benefit when commuting such operators. Then talks about how to unfolding some kind of multi-block query into a single block query. At last a complementary approach of collapsing multi-blocks query, semijoin like technique.
Then the paper introduce the method for cost estimation. Query optimizer use statistic data of database to estimate the cost and output of each query. The technique used in this part is how to choose statistic data and how to get the statistic data, cost computation.
Enumeration algorithm is for picking an inexpensive execution plan for a given query by exploring the search space. The paper introduces two representative examples of extensible optimizers, starburst and Volcano/cascades, which is adapt to changes in search space and cost estimation techniques.
At last, the paper introduces some further topics about query optimizer, issues of optimizer in distributed and parallel databases and optimization of stored procedure.
The paper briefly gives us a introduction about the query optimizer and some techniques in the query optimizer with detailed example. This paper is a good material to have a first view of query optimizer and have clearly point out the reference in different techniques. Further, the paper introduce some advanced issues which is important in commercial systems.
Drawback:
1. As the three component of query optimizer are not three totally separate steps and interweave with each other. The author should first point out this and then talks about these part separately. Otherwise would be confusing.
2. When talking about the trade-offs, the author just points out that there is tradeoff between two performance(i.e. In semijoin like technique, tradeoff between cost of computing views and the benefit of having the views to reduce cost of computation). It is better to give some specific example of under what circumstances should have such choice.


Review 24

The author discusses contemporary query optimization in relational databases. He presents the problem of query optimization as a search problem, where the search space of all possible queries is enormous (possibly infinite). He discusses the System-R optimizer as an example which uses a cost function that takes into account various statistics about the state of the system, estimates about the sizes of intermediate and final data, and CPU and I/O costs. The System-R optimizer reduces the search space of the optimization problem through the use of dynamic programming. However, as the search space is still too large to enumerate, the optimizer relies on pruning to reduce the size of search space further.

The author then moves on to discussing more general query optimization (that System-R cannot handle). In this section, he cites many examples. E.g. reducing queries that have nested queries to a single, flat query. The author then discusses the impact of cost estimation on the performance of a query optimizer. In order to have a good cost estimator, it is important to gather statistics about the data that has been stored. The author then discusses two extensible query optimizers that can adapt to new query ("physical") operations and changes to the cost estimations. The first has two stages: the first stage applies appropriate rewrite rules to the query, the second tries to combine operators into higher-level operators. The second query optimizer has two sets of rules, the first maps algebraic expressions onto algebraic expressions, and the second set maps algebraic expressions onto operator trees. The author ends with a discussion of advanced topics in commercial databases.

This paper presents a well-written overview of query optimization in modern databases. It moves forward by gradually introducing complexity and addressing the limitations of simple systems. It also highlights the importance of having accurate statistics and cost estimation which I feel is something that is easy to overlook in academic research.

I felt that the paper should have presented some empirical data. Optimization techniques require measurement, and considering that this paper is supposed to provide an overview of existing work, it should be possible to find representational workloads or configurations that demonstrate the different performance characteristics of different query optimizers.


Review 25

This paper describes the software concepts behind the query optimizer in relational database systems. It is important for the query optimizer to run efficiently and give a near optimal answer because an optimized query can run much faster than a non-optimized query. Most modern DBMS have query optimizers that turn declarative queries into query plans that the query executor can fulfill.

The query optimizer usually consists of three parts: generating the search space, giving costs to all of the operations, enumerating the search space while bounding using the costs calculated for the operations. An important consideration when creating the search space is to decide whether joins can be done in any order and whether group by clauses can be executed before or after joins. Queries can then be further optimized by looking at the sub-queries and doing one of the following: flattening the sub-queries so that it is one query or splitting up the sub-query and the query into multiple views and joining them.

Finally, the paper talks about calculating the cost while enumerating through the search space. We can use the database’s schema to calculate statistics about certain queries that may be able to help with calculating the cost. However, because this cost is calculated in each inner loop, we want it to be as efficient as possible.

The following are some concerns and weaknesses that I found with the paper:

1. The author talks about how under certain situations group by clauses can be executed before joins to make the query faster. Can the same be done with sort by clauses and what are the performance implications of that? How come the author did not mention any other clauses like LIMIT and SORT BY?

2. In the section that describes the semijoin like technique for multi-block queries, the author explains the technique and goes through an example, but it is not clear what the difference is between this technique and flattening. I would have liked to see experiments comparing the two techniques to explain which is useful under what circumstances. Can we take advantage of both to make a better decision when the table sizes are small or big? Does indexing play a role in the performance of both these techniques?

3. While the author does mention that the schema can help determine the cost, no other metric was mentioned. What are the pros and cons of also taking into account the size of the database as well? Can we use machine learning to refine the statistics and optimize future queries?



Review 26

This paper gives an overview of many of the difficulties involved in writing efficient query optimizers, how these problems have been dealt with to this point, and which problems are still open questions. Humans do not always write efficient queries, so it is necessary to have query optimizers that can recognize when more efficient queries are available and rewrite them as such. Furthermore, it is infeasible for a computer to enumerate all possible plans in the search space for a given query, so it is critical to define heuristics that efficiently narrow the search space to only the most cost-effective queries.

One of the issues addressed by the paper is trying to optimize nested queries in which the subquery contains aggregate operators. While some nested queries can be flattened by pulling join conditions up into the main block, this does not work when the nested query contains aggregates, since we must preserve all tuples that would be output by the nested query, even ones for which the predicate fails. This scenario highlights how difficult it is to rewrite queries for efficiency while preserving the same results.

The paper also emphasizes the difficulty of maintaining accurate statistics about data streams. The author notes that even the estimation of base data in a table is provably error prone. It is then even more difficult to propagate this statistical data through several steps of a query plan and obtain accurate statistics about the intermediate results that will be used in later operations. This is an important area of focus, because the availability of statistics about intermediate results is crucial to developing efficient query plans. As the author states, "optimization is only as good as its cost estimates", so if our cost estimates lose accuracy, the entire optimization engine can fall apart.

My main complaint about the paper is that author begins to rush through topics in the last few pages, giving only cursory examples. He goes into significantly detail on topics like merging views and flattening nested queries, but he essentially skips over interesting topics such as estimating the cost of user-defined functions and materialized views. UDFs, in particular, represent a very interesting problem, and I would have hoped for a more thorough discussion of the topic.


Review 27

The paper summarized important components of the query optimizer and some advanced issues related to it that are important in commercial systems. The query optimization has been researched for many years. These extensive works demonstrated its importance in DBMS. This paper explained the foundations of query optimizer and some recent advanced topics on it.

The paper first identifies the optimization problem. It says that the query optimization can be viewed as a search problem, which consists of a space of plans, a cost estimation technique, and an enumeration algorithm. The work is to make the search space include low-cost plan, estimate the cost accurately, and have an efficient enumeration algorithm. The example of System-R optimizer demonstrated these components. The search space is huge because the commuting between operators. The basic estimation framework is derived from the System-R approach, which includes collecting statistical summaries of stored data and estimating the cost of executing the operation. The foundation of query optimizer is the first part of the paper.

In addition to the foundation mentioned above, the paper talks about some recent advanced topic related to query optimization, including distributed and parallel databases, user-defined functions and materialized views. Distributed databases introduce issues of communication costs, which will have an impact on cost estimation. User-defined functions provide a powerful mechanism to reduce client-server communication; so new optimization questions arise when these stored procedures appear in the query. Also, taking materialized views into account, the goal is to optimize the query given the present views.

To sum up, this paper starts from the foundation of query optimization, and then discuss some advanced topics mentioned above related to it. In my viewpoint, this paper is good for understanding because it starts from introducing basic components of the topic, and then goes deeper.



Review 28

The goal of this paper was to summarize the basics of query optimization in relational databases and discuss what the problems are and how they are solved. It does a good job of giving an overview and explaining the basics but at times lacks in depth analysis in my opinion.

It does a good job of explaining what query optimization is and why it is important. It provides a good summary of the 3 goals a good query optimizer has and some of the challenges faced. The 3 goals are:
1.) Creating a search space with low cost plans. The main focus of search space was different types of joins and varying which order they go in. It also talked about using semijoin for optimizing multi-block queries and avoid unnecessarily moving data from block to block when it can be selected out earlier.
2.) Having an accurate costing technique. This is important because you want the costing technique to be accurate so when you think you are picking the lowest cost query from your plans you actually are. This is accomplished by collecting statistics on stored data and using those to estimate CPU I/O.
3.) Having an efficient enumeration algorithm for the enumeration of various plans. Obviously you want this to be quick because the longer this takes the more time you waste up front before running a query. This is tough to make quick while still providing the optimal plans however so tradeoffs must be made.

A few of the downsides of this paper are:
1.) There was no quantitative analysis. It talks a lot about the importance of improving performance and would have been nice to see some analysis of how some optimizations improved performance quantitatively.
2.) The graphics were poorly notated. It was not clear what the letters meant in some of the graphics (particularly the join graphics) and I did not feel they were explained well.
3.) The queries that they used as example queries were not easy to decipher. I don’t know how solvable this problem was but some of the detail in the queries felt unnecessary and made it harder to read and understand the query. The reason I say this was a hard problem to solve is because most of these queries were nested queries which are hard to understand if you didn’t write them anyways, but I can’t help but feel like some of the WHERE clause arguments were unnecessary.

Overall I would say this paper did a good job of giving a basic overview and helped provide an understanding of the goals and problems of query optimization, however it severely lacked quantitative analysis and visuals.



Review 29

The purpose of this paper is to give an overview of the components involved in optimizing queries in relational database systems. The optimization of a query can be formulated as finding the best execution path within a search space (which is vastly big and requires heuristic estimation in order to perform well). By streaming statistical information from past queries and using other information in order to predict the cost of certain execution paths, pruning can be implemented in order to trim the boundaries of the search space.

As this paper’s intent is to give a broad overview of query optimization, it does a good job of describing advanced features and potential future directions for query optimization. For example, distributed and parallel databases are discussed in that they may be exploited to optimize with higher efficiency. In addition, the use of examples (i.e. System-R) helped to elucidate the machinations of an optimizer. This paper is not drenched in technical details, making it easy to read and understand high-level concepts of query optimization, especially with the framing of ideas within the realm of SQL, a readily available and understandable language.

One thing that could help improve this paper is a more in-depth discussion of the strengths/weaknesses of each optimization technique. This may be partly intentional, as it is mentioned that the scope of the paper was limited to fundamental issues in query optimization. While this may be why there is little discussion of solutions to many issues discussed, there may have been a way to provide an introductory context of such problems to the reader without leaving one to do their own research.


Review 30

This paper outlined the various issues and approaches involved in the optimization of SQL queries. The paper explains the two core components of query optimization (query optimizer and query execution engine) by analyzing the approaches taken by the System-R Optimizer.

The author breaks down the problems in query optimization beginning with defining the search space. The search space depends on the possible transformations that result in equivalence and the physical operators that are supported. In order to both reduce the search space and discover meaningful transformations many different techniques are outlined. To make use of the search space, an efficient enumeration algorithm is imperative to constructing an optimal operator tree. Finally, for an enumeration algorithm to be useful, the cost estimation of operators must be accurate. Defining cost, and subsequently cost estimation are non trivial problems. Cost can be defined as CPU resources including CPU time, memory or I/O latencies. Cost estimation is derived mostly through statistical summary of data that has been stored.

While this paper does a good job of outlying the issues in query optimization, it felt disorganized and lacked explanation in some areas. There is a lot of explanation regarding the complexity of the search space and its importance, however there is no dedicated section to enumeration algorithms. Rather, it is sprinkled in with the discussion of the search space with no comprehensive discussion of enumeration techniques. The paper also lacked literature on cost estimation. The author states how cost estimation is paramount to query optimization, yet it is only lightly discussed in a small paragraph at the end of the paper. It would have been helpful if there was more explanation as to how/what statistics are used to compute cost and what approaches are common.


Review 31

This paper discusses query optimization in relational database system (particularly in SQL Server). There are two key components in query evaluation: query optimizer and query execution engine. The paper focuses on query optimizer. It takes the example of System-R Optimizer, in the context of SPJ queries and uses cost model. The paper further explains about the enumeration algorithm employed in the System-R: dynamic programming and interesting orders. Next, it elaborates on the transformation process techniques, like commuting between operators, reducing multi-blocks queries to single-block, and using semijoin like technique to optimize the multi-blocks queries.

The cost model assigns an estimated cost to any partial or complete plan in search space and also determines the estimated size of data stream for output of every operator in the plan. Thus, it relies greatly on statistical data of relations and indexes, on formulas to estimate of predicates, and finally on formulas to estimate the CPU I/O. The collected statistical summaries for the input data stream are then used to determine the statistical summary of its output data stream and to estimate cost of its execution. It is important that this cost estimation to be accurate and efficient. In latter section, the paper discusses the enumeration architecture, taking the example of Starburst and Volcano/Cascade. The paper also briefly touched some implementation issues such as distributed and parallel databases, user-defined functions, and materialized view.

The strength of this paper lies in the detailed explanation of the working of query optimizer. It shows us that query optimizer is not only about query transformation. One vital point of query optimization is the statistic and cost estimation to determine the one that consumes the least resources. The paper emphasizes that transformation do not necessarily reduce cost and therefore must be applied in a cost-based manner by the enumeration algorithm to ensure a positive benefit.

Query optimization is a vast subject. Even the writer admits that the paper would only touch the fundamental technicalities.



Review 32

The purpose of this paper is to provide an overview of query optimization and how it has worked in a specific setting. The author does admit that he has greatly restricted himself to the discussion of SQL queries in relational databases, but this is so that he can present a more complete overview of query optimization in a reasonable amount of space without getting bogged down in the details of how it works differently in many different systems.

The technical contributions of this paper are to run through some basics in Query Optimization as they occur in SQL queries in relational databases. The paper provides a summary and brief justification for certain techniques and approaches to query optimization. It also runs through some specific types of query optimizations that are frequently employed for different types of queries (such as Single- and Multi-block queries). The paper does do a good job of pointing to relevant resources for all of the different techniques discussed. The paper has an extensive references section that the interested reader can explore for more in-depth information about any given presented topic. I think this aggregation of other more in-depth resources might be the strongest contribution of this paper. As I discuss below, I didn’t not get a satisfactory feel for query optimization in such a short paper where many key concepts seem to be glossed over (and explained in references) due to space restrictions. In other words, I think the author could have more carefully picked which topics to explain more thoroughly.

As far as weaknesses go, I don’t actually think that this paper reaches its goal of being an effective survey paper. Even at the end, the author states the importance of understanding the existing engineering framework, but I don’t believe that the paper achieves this goal. Even though the author tries to restrict the scope of the paper, there is still too little detail to grasp the mechanisms of much of the query optimization details presented in the paper.



Review 33

Title: An Overview of Query Optimization in Relational Systems
This is a summary paper on the topic of query optimization of SQL queries in relational database systems. This paper focuses on SQL since it has become the standard relational query language. The paper first provides a brief summary of the two key components of the SQL database system, the query optimizer and the query execution engine. It then uses System-R optimizer as an example and explained the concept of query optimization and how it is done in the case of R optimization. In the following sections the paper provides detailed description of how the optimization problem is solved in the form of a constrained search problem. As the same time it also discusses about some advanced topics such as distributed system and database parallelization.

In section 5.2 the paper states the necessity of pre-computing/estimating the cost of a task, which is the cost estimation step. Conventionally there are many ways of doing such estimation. However the paper also states that it still remains a challenging problem of making accurate estimation in such situations. With the recent and rapid development of machine learning techniques, such estimation problem may as well be formulated into a regression problem, which, given enough training data, can be solved with a satisfying accuracy by a deep neural network or simply a regression model depending on the difficulty/complexity of the data.

In the second last section of User-Defined Functions, I am curious to know whether providing such freedom and accesses to users will raise security hazards besides the discussions and analysis of how to optimize the query in the existence of such user-defined functions. It would be an interesting discussion topic if it had been mentioned in the paper.



Review 34

Summary:
This paper gives a general overview of Query Optimization in relational systems. It first given an introduction of optimizer, then defines a desirable optimizer as:
1. The search spaces as well as plans have low cost
2. The technique to provide the estimation of cost is accurate
3. The enumeration algorithm is efficient.
After that, it introduces System-R framework for the basis of a query optimization framework. Then it presents the search space and algebraic transformations. In the next section it introduces the cost estimation techniques, especially using the histograms. Then it discusses the enumeration architectures, with an emphasis on two case studies of extensible optimizer architectures: the starburst and the volcano/cascades. At the end the paper discuss some future development of query optimization, including those in parallel or distributed databases and the issue with applying cost estimation to UDFs.

Strengths:
1. This paper gives a concise high level overview as well as some detail information of query optimization which covers most of the aspects.
2. The structure of the paper is well organized to help its reader to understand the concepts. It first introduces the optimizer architecture with an real world example: system R and then introduces general techniques of each component of the optimizer.

Weaknesses:
1. Though this paper gives a good overview of the query optimization techniques, it does not provide any data or experimental results to show the performance improvement with these techniques. It would be more convincing if the paper could provide those information.




Review 35

This paper introduces the common architecture and techniques used in query optimization of relational database. The query optimization usually takes a parsed SQL query (logical plan), selects the optimal one from a set of logically equivalent physical plans by estimating their cost, and then gives an efficient enumeration algorithm. Taking the popular System-R as an example, it search through equivalent join plans of Select-Project-Join queries. It generates equivalent plans by changing join sequences, use different join operators and move predicates as early as possible. It assigns cost to each plan by the help of statistics data and estimated predicate selectivity and operator cost. It also use dynamic programming to reduce cost estimation overhead and interesting orders to accommodate the effect of sorting on records.
The search space of plans depends on the set algebraic transformations and physical operators. Joins can be transformed by manipulating the sequence. Non-linear (bushy) joins are sometimes considered in some systems. In some situations, group-by can be carefully moved around join operations. It’s alos possible to merge some multi-block queries into single-block. Semijoin like techniques can be applied to multi-block queries.
The cost estimator is a key component in query optimization as it largely determines the quality. It calculates statistics data on columns. A popular way of record data distribution on a column is to divide the values on a column into k buckets and generate a histogram on top of that. To provide information of correlations among columns, it keep records of distinct pairs of selected columns. On large databases with extensive amount of data, sampling is used to provide estimated data distribution. The statistical information are propagated through operators in the process of cost estimation.
After searching through the search space, an enumeration algorithm is used to pick an inexpensive execution plan. As new transformation techniques are continuously introduced, current optimization architectures are extensive. They use generalized cost functions and physical properties with operator nodes. They also exposes knobs for finer tuning the system.
At the end the author discussed more advanced techniques such as support for user-defined functions and materialized views.