This work introduces aspects of consideration when designing a query optimizer. The paper first presents an example of system-r optimizer and how dynamic programming concept can help optimize SPJ (Select-Project-Join) queires; it then introduces optimization for space of plans, cost estimation, and enumeration algorithm. - For search space optimization, the paper first introduces how join, outerjoin, and group by operations can be re-organized; it then introduces multi-block query optimization techniques including merging and using semijoins. - For cost estimation, the paper points out the estimation needs statistical summaries of data, and estimations for output data and operation costs. Statistical information on base data focuses on data histograms and column correlations. Computation costs include CPU, I/O, and communication costs. - Enumeration algorithm is used to pick inexpensive execution plan for a given query by exploring the search space. The paper introduces Starbust and Volcano/Cascades. They are similar in that they both used cost punctions and physical properties with operator nodes, userule engine to modify query, and use knobs to tune system behaviors; they differ in the optimiaztion phases, whether do mapping from algebra to physical operators in single step, and how to apply optimization rules. The paper also points out things to consider when using distributed and parallel databases (scheduling being the most important), and when using user-defined functions (which is hard to estimate cost). Some other issues such as considering memory and object-oeriented system optimizer are mentioned. In general, the paper provides a broad view of aspects to consider in a system to oprimize queires. However, for many optimization methods, the paper only gives some trivial examples and doesn't provide detailed algorithm for the methods mentioned. This lack of algorithmatic details leaves many points not clearly explained. |
This paper focusses on the query optimization of relational database management systems. Optimization is more than simple query transformations. Query optimizer can make decision on choosing the most efficient SQL query execution plan among all the possible plans differs in algebraic representation and underlying implementations. This paper discusses following factors which are important for a good optimizer: 1. Low cost query plans in the search space 2. Accurate costing estimation 3. Efficient enumeration algorithm. Some of the strengths of the paper are: 1. The example of Select-Project-Join query is a good choice on explaining the context of the search space, cost estimation and enumeration algorithm. 2. The application of dynamic programming in enumeration algorithm significantly increases the efficiency of enumerating plans. 3. The discussion of the rising importance in commercial products on parallel database and user-defined functions is profound. Some of the drawbacks of the paper are: 1. The objective section of the paper didn’t clearly present what is the problem that this paper works on and what’s the methods this paper is using. Although they are mentioned in following parts, it’s better to have an overview in the beginning. 2. The author used so many passive words like “my biased and incomplete view”, “not to be comprehensive”, which makes the paper less convincing. 3. It’s better to introduce the statistics and cost estimation part before the search space part so that we can analyze how each technique reduces the query cost. 4. There’s no quantitative example and analysis in cost estimation and enumeration part, which make it hard to compare the effectiveness of different methods. |
The paper gave an overview of the optimization of SQL queries in relational database systems. The author explained the foundations and presented samplings of significant work, as it is necessary to understand the existing framework to make effective contribution to the area of query optimization. Query optimizer is to generate an efficient execution plan. A desirable optimizer is one where (1) the search space includes plans that have low cost (2) the costing technique is accurate (3) the enumeration algorithm is efficient. System-R optimization framework uses cost-based optimization, dynamic programming and interesting orders, which strongly influenced subsequent developments in optimization. However, it cannot be easily extended to incorporate other logical transformations (beyond join ordering) that expand the search space. The search space for optimization depends on the set of algebraic transformations that preserve equivalence and the set of physical operators supported in an optimizer. The Transformations include how to exploit commutativity among operators, how to collapse a multi-block SQL query into a single block SQL query under some conditions, and how to exploit the selectivity of predicates across blocks. The basic estimation framework consists of statistical summary(logical property) and the cost estimation of a plan(physical property). The statistical summary consists of statistical parameters for the stored data used in cost optimization, efficient ways of obtaining such statistical information, and how to propagate such statistical information. And the cost estimation focuses on how to choose an appropriate set of parameters to determine cost, and it is still difficult to do accurate cost estimation and propagate statistical information on data streams. In enumeration architecture design, we would want the extensible optimizers, which are adapt to changes in the search space. In this case, generality in the architecture must be balanced with the need for efficiency. Two representatives are Starburst and Volcano/Cascades. Optimization issues are (1)Distributed databases introduce issues of communication costs and an expanded search space. (2)Parallel databases raise the need for communication among the processors to exchange data. (3)Stored procedures have issues in determining the cost model and enumeration algorithm. (4)Materialized views introduce problems of reformulating the query and individually optimization which leads to increased cost. The paper is well-structured and author clearly summarized the milestones of query optimizer and related issues. Besides, this paper reminded people of the importance of fully understanding the existing framework before trying to make contribution to this area. The paper has several drawbacks: 1). It only discussed query optimizations of relational databases, but nowadays, for real-life application we also use NoSQL databases, and clearly the techniques are quite different. 2) For issues that are not currently solved, the author just presented the issues without giving some personal opinions on how we may solve them. |
Query optimization for relational database systems started out in the late 1970s, and was arguably a strong reason for the success of relational database systems, making these systems much easier to use and maintain compared to their predecessors (IMS and CODASYL). Even today, it continues to be an active field. As with any field which has been worked on for so long, a great deal of knowledge has been accumulated that can be initially daunting for someone new to the field to approach. In this vein, this paper “An Overview of Query Optimization in Relational Systems” by Surajit Chaudhuri seeks to summarize the foundations of query optimization, as well as present some works significant in shaping the field over the years. The paper starts by framing the problem of converting a query into actual instructions for the database system as a complex search problem, with three issues to be addressed: search space, cost estimation, and the enumeration algorithm. Of course, the goal is to have an optimizer that is good at all three aspects (low cost search space, accurate estimation, efficient enumeration algorithm). As an example of this line of thinking, Chaudhuri describes the System-R optimizer, an early framework that had a great deal of influence on future work in the field. Its search space consists of operator trees that correspond to a linear sequence of join operations. The cost is estimated by using various formulas for output data stream size, CPU and I/O costs for each operator, and statistics on relations and indices. Finally, the enumeration algorithm utilizes two techniques: dynamic programming and the use of interesting orders. The rest of the paper goes in further detail on each of these aspects. For example, commutative properties for operators can be taken advantage of to reduce the search space, while statistics can be used to more cheaply estimate cost (rather than enumerating everything). Also, the author mentions two enumeration architectures, Starburst and Volcano/Cascades. He concludes with a quick treatment on additional issues in query optimization beyond the basics like user-defined functions. This paper is meant to summarize the basics of query optimization to inform readers not familiar with the field, rather than communicate any novel research findings. For the most part, it was well written; breaking down the problem into the three components made it easy to follow along with the author’s points. Also, the decision to teach these concepts using a real world model helps in showing the motivations for subsequent developments in the field. One weakness that I believe this paper suffers from is that there is only one example being explored, perhaps due to restrictions in paper length. As a result, it does not achieve its original stated goal of providing an overview of the major developments (since only addressed one system). Showing off a succeeding system would have helped in seeing how newer system(s) tackled the same three fundamental issues. If further detail cannot be provided in a single paper, then perhaps it could have been better served by narrowing the scope to something more appropriate. Regardless, it does not detract from the value of this paper, which I still found to be easily worth reading. |
Lee Rutledge Barzan Mozafari - EECS 584 An Overview of Query Optimization in Relational Systems - Reading Review The purpose of this paper is to analyze how to optimize declarative queries in relational systems. We can think of this as an operator tree or execution plan, which is how we accomplish the goal of the query. This problem is important in understanding the challenges associated with optimizing the goal-based queries that this database management system model accepts. Since there are many ways to get to the goals we must efficiently evaluate that space of potential plans and determine the most optimal plan of execution. The paper’s approach to this analysis is first looking at the basics of the software components of the optimizer, including evaluating the search space of the query based on order of operations, possible merging of nested subqueries to a single block, statistics gathering and cost estimation. Next the author introduces more advanced issues like distributed databases which have additional sub-operations to take into consideration when optimizing, and optimizing stored (user-defined) procedures. I thought a strength of this paper was that it started with an example. By introducing us to the System R optimizer I immediately had some confusion clarified and got new questions about the issue, which were later answered in the paper, like the join optimization example which could use linear or bushy join. I thought a drawback of this paper was immediately saying that he may not be complete or precise in this paper in order to cover more information in a way that is easily digestible to the reader. Coming in the objective I felt as though I was getting myself into a less careful or less thorough overview. I still found that it ended up being helpful but this introduction caught my attention. |
This paper focus primarily on the optimization of SQL queries in relational database systems, it is important because it gives us an overall of the query optimization in relational database system which makes people easy to grasp all the stuff. query optimization can be viewed as a difficult search problem. In order to solve this problem, we need to provide: 1)search space 2)cost estimation so that we may be able to choose the optimized one. 3) An enumeration algorithm that can search through the execution space. A optimizer should satisfy (1) the search space includes plans that have low cost (2) the costing technique is accurate (3) the enumeration algorithm is efficient. The author gives us a overall analysis on System-R optimization. It presents a subset of those important ideas here in the context of Select-Project-Join (SPJ) queries. The enumeration algorithm for System-R optimizer demonstrates two important techniques: use of dynamic programming and use of interesting orders. It assumes that in order to obtain an optimal plan for a SPJ query, it needs to consider only the optimal plans for subexpressions and extend those plans with an additional join. However, the framework cannot be easily extended to incorporate other logical transformations (beyond join ordering) that expand the search space. The search space depends on the set of algebraic transformations but transformations do not always reduce cost. So that it must be used in cost-based manner to ensure a positive benefit. The optimizer may use different representations of a query from parse tree to operator tree during the lifecycle of optimizing. Some systems also use a “calculus-oriented” representation for analyzing the structure of the query. The article also gives us examples of transformations that need operators commuting. There are three examples here: 1) generalizing join sequencing, operations is syntactically restricted to limit search space. The sequence of the join need not to be linear because this operations are commutative and associative 2) outer join and join, if Join(R, S LOJ T) = Join (R,S) LOJ T applies, the joins will be freely reordered among themselves. 3) Group-by and join: if evaluated carefully, such as first grouping-by, the cost will be reduced significantly. There are also some examples to collapse a multi-block SQL query to a single block one in 1) merging views 2) merging nested subqueries: If the nesting structure among query blocks is linear, we can use the associative rule to compute all the joins first and then do all the outer-joins in sequence. We can also transform a query into one that uses table-expressions or views. It also introduces a method in an opposite operation by using semi-join for optimizing multi-block queries, to exploit the selectivity of predicates across blocks. There are many expressions of one query and we need to evaluate the resource it uses. In the evaluation process, two steps are given for estimation: 1) collection of statistical summaries of stored data 2)determine the statistical summary of the output data stream and estimated cost of executing the operation. The latter one includes CPU, I/O and communication costs in the distributed system. An enumeration algorithm need to select an best execution plan for a given query by exploring the search space. And it can also has the adaptability to the changing search space. Two examples are Starburst and Volcano/Cascades. They both use generalized cost functions and rule engine that allows transformation to modify the query expression, and exposes 'knobs' to tune the behavior. However, they are different in enumeration: the first algorithm uses two distinct optimization phases and applying rules in a forward chaining fashion which is different from the second. Also in Volcano/Cascades, The mapping from algebraic to physical operators occurs in a single step. The article also gives introduction of optimization problems in distributed databases, user-defined functions and Materialized views. This paper gives no contribution since it's a summary paper. Strengths are clear explanations and thorough conclusions. |
In this paper, the author mainly talked about query optimization in relational database. There are a lot of room in queries to be optimized because SQL is declarative language. The difference or performance between different query plan would be huge, and traditional query optimization can only be applied on SPJ. Therefore, the author write this paper to introduce general query optimization plans. The two key components are query optimizer and query execution engine. Talking about query optimization can be viewed as a search problem, the three main components of a search problem would be search space, cost estimation and enumeration algorithm. The search space is the space of all query plan, the author introduced current multiple ways to reduce multiblock queries into single block. The cost estimation is the estimation of computation cost of the plan. The optimizer should evaluate the cost both accurately and efficiently. It should collect statistical summaries of data that has been stored and determine the statistical summary of output data stream and estimate the cost of executing the plan. The enumeration algorithm would pick an inexpensive execution plan for a given query and explore the search space utilizing the dynamic programming and interesting orders. The paper introduced two Beyond those three foundations there are problems related to parallel, user defined functions and materialized views.Then the author concluded that the infrastructure for optimization is significant. The main contribution of this paper is that it provide a clear overview of existing query optimization. The paper presenting query optimization as a search problem is accurate and the paper is well structured based on this idea. The author also elaborate three subproblem clearly and examples are provided as well. The weakness of this paper is obvious, for a research paper, it does not have its own research. If the author chose not to include his own research, the paper might want to be more comprehensive. |
Dr. Chaudhuri presents a summary of some core aspects of query optimization in RDBMSs. Query optimization is one of the central pieces of a database system - it is quite literally what makes database systems usable in an efficient manner by application developers who don’t know the details of database system implementation. The paper follows a fairly logical structure: first, it discusses System-R, using it as an example as it was a significant milestone in the history of query optimization. The next three sections go into detail about subproblems in query processing. Finally, the paper addresses some current (at the time of writing) research topics related to query optimization. I thought that the author did really well framing query processing as a search problem. Although I admittedly haven’t read much about query processing, I hadn’t thought of it this way before. The sections of the paper are logically laid out as a search problem as well - search space, cost function, and enumeration. This made it easy to understand how all of the pieces fit together, which is a useful property for a survey paper. The paper is short, and therefore did not go into a ton of depth in some areas. The author clearly stated multiple times that he didn’t feel as though he was spending enough time on some issues, and that he was trying to balance length with providing a good summary. For a survey paper, I generally thought that the approach of adding in advanced topics near the end was okay. Clearly the author was trying to give the reader some idea about future directions in the area of research. However, having a hierarchical structure to present this information - the “Beyond the Fundamentals” section and the “Other Optimization Issues” subsection - seemed unnecessary. Overall, it didn’t fit too well into the structure of the paper, which was primarily describing query optimization as a search problem. |
Problem Addressed & Why it is so important: The author summarize some fundamental concepts and common practice in query optimizer for modern relational database system and significant work in the area. He/She also addressed some of the latest popular techniques beyond the scope of traditional relational database systems like distributed and parallel database, UDFs and materialized views. It also leaves the open questions and main difficulties in each section of the topics. In the author’s words “the goal is ... to explain the foundations and present samplings of significant work ” Main approach: The author starts by introducing a well-known classic System-R optimizer. This system leaves us the idea of cost-based optimization, dynamic programming and interesting orders strongly influence the later development of optimizers. Then the author introduce the 3 basic component of optimizers. First one is Search Space. This component determines what are the possible logical query plans do we consider. Exhausting all possible transformations is impossible so we only pick the Search space for optimal plan that is most likely to occur. Important topics in search space including commuting between operators like joins and group by, reducing multi-block queries to single-block ones and using semijoin to optimize multi-block queries. Second component is cost estimation. With the search space we need statistics on base relation and propagate the statistics via operators to estimate the cost of a plan. The last component is enumeration architecture. This component describe an algorithm to pick inexpensive execution plans from the search space. The author introduces two extendable infrastructure with rule-based engine for transformation. Last topic is some popular topics beyond the fundamentals of optimizers like distributed database and UDFs. Strength and technical contribution: The paper reviews principles and common practices of optimizers for modern relational database system and point out the main difficulties and open problems of each components in the optimizer. The paper organized the topics well to guide the readers through the components of optimizers with concrete examples and explanation at appropriate detail level. Drawbacks of the paper: This paper is more like an introduction or tutorial than a paper. As a tutorial, it does a good job in guiding the readers to learn main components of relational database query optimizer. As a paper, it technical contribution is limited in summarization and pointing out open problems or possible research directions. |
This paper is a brief overview of SQL query optimization. It tries to explain the foundations of query optimization and point out significant works in this field for readers who want to have a deeper understanding. As any search and optimization problem, query optimization needs three basic components: a search space of query plans, an accurate cost estimation for a given query plan and an efficient enumeration algorithm. All of them are hard problems and need to take lots of factors into consideration when designing the system. The paper began by the case-study of System-R, whose ideas significantly influenced the development of optimization. System-R’s search space contains only linear sequence of join operations and evaluates predicates as early as possible. Its cost model considers the estimated size of a data stream, ordering of tuples and execution cost of an operator when calculating the cost. In terms of the enumeration algorithm, System-R uses the idea of dynamic programming and interesting orders. One downside of System-R is that it’s hard to extend the search space by allowing transformations other than join ordering. After the case study, each optimization component is discussed in more details. Search space: Besides linear join, commutativity allows us to consider ideas including bushy join trees, perform Cartesian product early, transform outer join, early group-by, etc. The technique of how to transform multi-block queries to single block query is also discussed. Cost estimation: The discussion of cost estimation focuses on how to get statistical summaries of data and just touch the topic of how to estimate the cost of an operation. In DBMS, information such as the number of tuples in a data stream, number of physical pages, distribution on a column is often recorded to cost estimation purpose. Sampling method can also be used to get statistics on base data. Once we have the statistical summaries of base data, we can propagate them through operators and obtain information for intermediate data streams. A number of questions arise in the process, for example, no correlation information available, independence assumption for multiple predicates, etc. Any of the problems can lead to inaccurate cost estimation. Enumeration: The author gives two examples of extensible optimizers (Starburst and Volcano/Cascades) and describes how they work on a very high-level. Finally, the author gives a brief introduction of several advanced issues in query optimization. As an example, distributed and parallel database expanded the search space brings the question of how to search in the expanded space and estimate costs. UDF and materialized view also posed new problems in optimization and remain unsolved. This paper does a good job of describing the high-level optimizer components and optimization ideas. However, whenever getting into technical details, I found it’s extremely to follow to paper. Sometimes only specific examples are given in the paper for a problem without a generation solution. I think this paper suits people who already have some knowledge of query optimization and wants to make up for missing knowledge by reading the referenced papers listed in this one. |
In the paper "An Overview of Query Optimization in Relational Systems", Surajit Chaudhuri discusses the extensive work that has been done in the field of query optimization. He examines both the foundations and present work in order to highlight the potential growth for query optimization. Since relational query languages enable users to write high-level declarative statements, it is up to the query evaluation component to pick up the slack and give a quick and reliable answer. Thus, SQL programmers who are ignorant of the internals of a DMBS indirectly burden the small subset of programmers that deal with query optimization. Since the problem is complex and has observable consequences (performance wise) there has been, unsurprisingly, many decades of research on it. The query evaluation component is comprised of two parts: the query optimizer and query execution engine. The query execution engine implements a set of physical operators. These operators are essentially the building blocks of high level code that execute SQL queries. One could think of these sets of operators as many imperative plans that they could take to complete the job. Consequently, it is the query optimizer's responsibility to generate the input for the execution engine based on the most efficient execution plan. Thus, one can see how hard building such an optimizer could be - it involves a space of plans, a cost estimation technique to estimate the resources needed, and an enumeration algorithm to search through this space. Desiring a low cost, accurate, and efficient solution forces this problem to be nontrivial. The major technical contributions of the paper are consequently related to techniques used to meet these constraints. Chaudhari explains that the search space must be carefully chosen since the selected plan will exist in that space. Optimally, the number of existing plans and cost to expand these plans must be fair - skewing one side may result in bad plans or expansive search spaces. The cost estimation is derived from summary statistics and operation costs. This is by far the most vital part of the optimizer because it solely relies on this data in order to give an accurate diagnosis on certain plans. Lastly, the enumeration algorithm needs to search through this space quickly. In order to do so, a extensible optimizers were created to be more flexible and adapt to changes in both the set of physical operators and cost estimation techniques. Unfortunately, this paper does not attempt to explore anything new in this field - it is merely a recap of decades of research. As a result, Chaudhuri neglects to mention other recent extensions such as parametric query optimization and multi-objective query optimization. These approaches attempt to compare how much money their execution costs in a cloud computing scenario which seems relevant to many companies. Most importantly, in the modern day setting, machine learning has taken over many fields successfully. I would be interested in how machine learning could be implemented to polish statistical techniques to achieve better optimizations. |
This paper looks at some methods for optimizing complex queries. As an example, it initially describes System-R optimizers, which focus on optimizing linear join trees by noting that optimal trees of large size are necessarily built from optimal trees of smaller sizes. The paper looks into the space of all possible query plans by detailing how joins can and cannot be reordered, and how to simplify views and nested queries when possible. Outer joins are also considered, which often cannot be reordered with inner joins, as they create extra null values in specific rows. In addition, it briefly touches on grouping; while grouping is usually done after all joins, grouping can be done earlier in some cases, which reduces the cost of subsequent joins. Once possible query plans are enumerated, table statistics such as row count and value frequency are needed to help estimate query costs. Once these costs are available and possible query plans are known, an optimal query plan can be chosen by an enumeration algorithm. The Starburst, Volcano, and Cascades algorithms are very briefly detailed. After this the paper notes some other effects on optimization. Parallelism can allow unique forms of optimization, while user-defined predicates can be expensive, and not as easy to shift around a query plan as built-in predicates. This paper provides a general, not specific, overview of query optimizers. As such, it’s able to cover a full optimization strategy in a small amount of space. It does go into specifics at some points, notably with example for determining possible query plans. However, other parts of the paper, like the section on enumeration architectures, tend to be much less specific, which can make them harder to understand. |
This paper mainly introduce and discuss the query optimization of relational database systems. It mentioned that a desirable optimizer is one where the search space includes plans that have low cost, the costing technique is accurate and the enumeration algorithm is efficient. System-R is introduced as it significantly advanced the state of query optimization of relational systems. Some lessons were kept including the use of cost-based optimization, dynamic programming and interesting orders. The paper talks about the search space in different scenario including join, outerjoin and group by. To reduce the complexity of search space, under some conditions like merging views and merging nested subqueries, we could reduce multi-block queries to single-block query. Besides those conditions, a complementary approach using semi-joins are adopted for optimizing multi-block queries. In terms of cost estimation, resources may be CPU time, I/O cost, memory, communication bandwidth, or a combination of these, given an operator tree (partial or complete) of a query, being able to accurately and efficiently evaluate its cost is of fundamental importance. The basic idea of estimating cost is to collect statistical summaries of base data involving CPU, I/O and, in the case of parallel or distributed systems, communication costs and buffer utilization and accurate propagation of statistical information via operators. An enumeration algorithm must pick an inexpensive execution plan for a given query by exploring the search space. More recent optimization architectures have been built with the paradigm that an enumerator should be able to adapt to changes in the search space including addition of new transformation, addition of new physical operators and changes in cost estimation. Starburst and Volcano/Cascade are two main examples of such extensible optimizer. Some advanced issues in DBMS bring new questions to optimization including distributed and parallel Databases, UDF, materialized views, etc |
This paper focuses essentially on optimization of relational query language (SQL). Paper starts off by explaining what is the goal of an optimizer and what is a desirable optimizer with a graph of operator tree. This section is very helpful for readers to understand motivation of optimizing queries and the structure of the paper. Author then gives System-R optimizer as an example to explain search space, logical equivalency, estimation on selectivity of predicates, and estimation on cost of query execution. This section emphasizes on System-R's techniques of dynamic programming and interesting order. The following sections then go more into details of search space, algebraic transformation, cost estimation, and enumerating the search space. Each section of this paper builds a good foundation or reasoning for the following sections, which is helpful for reader to follow along. One thing that I hope this paper elaborates more on is details of some approaches. For example, section 5.2 only introduces the key components to have a good cost computation and leaves it as an open question. It would be great to include at least one approach to solve this issue in more details. |
“An Overview of Query Optimization in Relational Systems” by Surajit Chaudhuri provides an overview of query optimization in relational DBMSs. Query optimization is the process of turning a parsed SQL query into an efficient execution plan, or in other words, a physical operator tree; the execution plan is then executed by the query execution engine. The paper discusses some important factors and components in query optimization: 1) the search space of valid and efficient execution plans that can be chosen from, 2) techniques for accurately estimating the cost of a particular execution plan, and 3) the enumeration algorithm that explores the search space and chooses a low-cost execution plan. The paper starts by discussing the System-R project, which was fundamental in query optimization research, and demonstrates the above three optimization components. Here I provide a brief overview of the query optimizer components as discussed in the paper. The search space of execution plans contains all execution plans equivalent to the plan corresponding to the default parsed SQL query representation. These plans are often algebraic transformations of the original query representation, for example reordering joins operations to achieve a lower-cost sequence of joins, reducing a multi-block query to a single query by unfolding views then reordering joins, merging nested subqueries, or extracting parts of a query command into separate views to avoid duplicate computation. Resources considered when estimating a plan’s cost include computation time, reads and writes, memory required, and communication bandwidth. In general, use of these resources should be minimized. Cost of a plan is estimated based on the plan’s operator tree in conjunction with statistics about the data, e.g., number of tuples, number of physical pages, column information, as how the data pass through operators matters. As cost estimation is performed for each candidate execution plan, the cost estimation procedure should also be efficient. The paper discusses enumeration algorithms through existing optimizer examples: Starburst and Volcano/Cascades. Some characteristics of these enumeration algorithms are that they used generalized cost functions, support transformation of both initial query expression and later operator trees, and they allow a database administrator to tune their behavior (but this is dangerous). As an overview paper, rather than providing research contributions it provides good background on the components of a query optimizer, common and successful approaches in their design, and examples of these approaches as instantiated in prior systems (e.g., System-R, Starburst, Volcano/Cascades). Low-level, concrete examples explaining how different joins expressions are equivalent and how to transform queries were very helpful. Finally, it was valuable that the authors discussed different approaches for cost estimation for huge databases (e.g., sampling), as techniques are not one-size-fits-all. I think the enumeration algorithms section could have benefited from more concrete examples of how the Starburst and Volcano/Cascades architectures work. I appreciate that there was a subsection on distributed and parallel databases, and acknowledgement of challenges that arise for optimization, but I think this section also could have benefited from concrete examples of execution plans for such databases. |
In this paper, the author presents an overview of foundations and significant works in the optimization of SQL queries in relational DBMS. This is a summary like paper focus on query optimization problems, this problem is very important since given a SQL query, there can be a large number of execution plans, however, the throughput or the response time for these plans can be quite different. Thus, choosing a smart plan by query optimization is critical. That’s the reason why we need to discuss the query optimization problem. The author models the query optimization problem as a search problem and gives three key factors of the problem: search space, cost estimation and enumeration algorithm. They first give an example of System-R optimizer and then discuss the three key factors of search problem and finish this paper with developments in query optimization around 1998. First of all, a System-R optimizer example is given, in this example, the author presents his idea in the context of Select-Project-Join (SPJ) and illustrates how search space, cost estimation and enumeration algorithm are presented in a k-join problem. Next, the author discusses the search space of query optimization by using some algebraic transformations. These transformations include commuting between operators, multi-block queries reduction and multi-block queries with semijoin like techniques. With the discussion of these concepts, the author also gives rich examples illustrating how these transformations work. Then, the cost estimation is discussed which is critical for deciding the best operator trees with least resources. It needs to be done accurately and efficiently. For cost estimation, the author first introduces the important statistical parameters, including the histograms, estimation by sampling and static information propagation. Besides, a brief cost computation is discussed. Next, enumeration architecture is discussed including two representative examples of extensible optimizers: Starburst and Volcano/Cascades. Finally, the author discusses some advanced issues which are important in commercial systems. These issues include distributed and parallel databases, UDFs and materialized views. Although this paper was written almost 20 years ago, it is still worthwhile to read it nowadays. The main contribution of this paper is that it summarizes the crux in query optimization from the view of the classical search problem. This paper gives rich examples to help people understand how things work in query optimization. Also, it includes the judgments when comparing different techniques. For example, when discussing Starburst and Volcano/Cascades, the author shows the shared features and differences between them. The weakness of this paper is that I think the author should penetrate deeper into cost computation, in the paper, the author only uses few lines describing the cost computation process in a high-level idea, so that I do not have a clear idea about how this part works. I think it will be better if a detailed example can be given for cost computation. |
This paper was a survey on query optimization, specifically focusing on the design&implementation of the query optimizer and its subparts. The paper explains each of the subparts by providing specific examples and reasoning why these examples are important/noteworthy: the System-R optimization framework is particularly favored by the author due to its influence on modern query optimization.. The most important aspects of this are the cost model, which is calculated using statistics and assigns a cost to every plan present in the search space; as well as formulas for estimating selectivity of predicates and the actual CPU/I/O costs of executing queries for each operator in a tree. The author also provides several examples of algebraic transformations of queries, and discusses the advantages/disadvantages of each. Cost estimation & statistics are revisited in greater detail, and the paper discusses how statistical information about each column (but not about interaction between columns) is stored and used for cost estimation. Finally, two examples of real-world enumeration algorithms are given, as well as some brief discussion on more advanced topics, such as parallel/distributed databases. Because this was a survey, and thus the paper did not do any real synthesis of new ideas, it is slightly difficult to think of any real weaknesses. Everything seemed to be explained fairly well, and the author seemed to provide a reasonable number of examples to illustrate the paper’s main points/topics. |