Review for Paper: 15-Efficiently Compiling Efficient Query Plans for Modern Hardware

Review 1

Modern main memory grows make query performance more and more depends on raw CPU costs of query processing. However, traditional iteration style query processing is naïve and perform badly on modern CPUs. Current techniques cannot outperform hand-written execution plans. Therefore, the paper is proposing a new compilation strategy that translate queries into efficient machine code which results in better performance.

Some of the strengths of this paper are:
1. Data centric processing takes advantage of high-performance CPU registers to increase performance
2. The structure of this paper is well organized by the background->architecture->detail techniques->advanced techniques->evaluation flow, which helps readers understand better.
3. The paper uses detailed examples to illustrate how the query is tuned and optimized

Some of the drawbacks of this paper are:
1. The paper didn’t compare the performance of their techniques versus other techniques like batch-oriented processing, which would be more intuitive to illustrate the advantage of this paper.
2. The technique is not compatible with disk-resident data because it eliminates the block-oriented feature. In fact, most of the big database is still using disk-resident data so it would be better to also develop some idea of the compilation techniques combining main memory resident tables and disk resident tables

Review 2

The classical iterator style query processing shows poor performance on CPU costs, which determine the query performance due to the growth of main memory. Existing techniques like batch oriented processing or vectorized tuple processing are frequently outperformed by hand-written execution plans. Therefore, this paper proposed a compilation strategy which translates a query into compact and efficient machine code using the LLVM compiler framework.

This novel framework has 3 distinct features:
1. Processing is data centric instead of operator centric.
2. Data is not pulled by operators but pushed towards the operators.
3. Queries are compiled into native machine code using the optimizing LLVM compiler framework.

The basic idea behind this design is to consider spilling data to memory as a pipeline-breaking operation. The principle is that during query processing, all data should be kept in CPU registers as long as possible. In order to achieve that, this design reverses the direction of data flow control. It keeps pushing them towards the consumer operators until it reaches the next pipeline-breaker. Therefore, data is always pushed from one pipeline-breaker into another pipeline-breaker. This push-based architecture can reduce CPU cost and register pressure by keeping tuples in registers and having the complex control flow logic outside tight loops. In order to materialize the tuples anyway at some point, this framework compiles the queries in a way that all pipelining operations are performed purely in CPU (i.e., without materialization), and the execution itself goes from one materialization point to another.

By exposing the operator structure, this framework can generate near optimal assembly code, as it generates exactly the instructions that are relevant for the given situation, and it can keep all relevant values in CPU registers.

To compile the query into machine code, LLVM compiler framework is used to generate portable assembly code because it is more robust, portable across machine architectures, strong-typed, and it produces extremely fast machine code, and usually requires much less query compilation time. Actually the model is a mixed execution of query processing in C++ and different operation connection in LLVM. Note that it is not possible or even desirable to compile a complex query into a single function. Therefore, it is necessary to keep track of all attributes and remember if they are currently available in registers.

In terms of more advanced processing techniques, the principle is processing more than one tuple at once while keeping the whole block in registers.

The main advantages of this framework are as follows:
1. The framework has better code and data locality, and has predictable branch layout. Therefore, it produces code which is very friendly to modern CPU architectures, and hence rivals even outperforms the hand-written code.
2. By using an established compiler framework, it can benefit from future compiler, code optimization, and hardware improvements, whereas other approaches that integrate processing optimizations into the query engine itself will have to update their systems manually.

The main drawback of this paper is that too much use of technique words makes the paper less easy to understand thoroughly. It could be better if this paper could give more simple examples to help readers better grasp the ideas.

Review 3

Problem & Motivation
Given the growth of the main memory, raw CPU costs of query processing play an important part on the DB. However, even the recently techniques like batch-oriented processing are not satisfying and are frequently out-performed by hand-written execution plans. It is essential to propose query transfer algorithms.

Main Achievement
The author presents a compilation strategy that can translate a query into compact and efficient machine code using the LLVM compiler framework. Some insights of this system include:
1. how can we organize query processing such that the data can be kept in CPU registers as long as possible --- by reversing the direction of data flow control (in order not to break the pipeline, referring to EECS370).
2. use a code generation process to create produce/consume interface after getting the final algebraic plan.

Drawback
Very abstract. Especially for the part 3.1, but I guess it may because that I does not use mathematical symbols for a long time.

Review 4

Testing

Review 5

With the falling cost of memory, a greater proportion of data is being held in main memory as opposed to being stored in disk. As a result, the performance bottleneck is shifting from disk I/O to CPU cycles involved in query processing, i.e. computing query plans. The paper “Efficiently Compiling Efficient Query Plans for Modern Hardware” attempts to address this issue by presenting a new compilation strategy that uses the LLVM compiler framework to turn a query into machine code efficiently. They claim that their method results in very efficient execution plans without requiring very long compilation times. The authors argue that the classical iterator model used for query evaluation, which is the most popular execution strategy, emphasizes flexibility and simplicity at the cost of performance. In the past, when disk access times dominated everything else, this was not an issue, but this is no longer the case. In many cases, handwritten execution plans greatly exceed the performance of classic iterator systems like Volcano.

In their approach, the authors emphasize the concepts of data and code locality, with the goal of keeping data in CPU registers as long as possible. Towards this end, they push tuples down towards consumer operators rather than pulling them up. As a result, data is always pushed from pipeline breakers (i.e. operators that remove tuples from registers) to other pipeline breakers. This way, code fragments are strongly pipelined, which improves performance since data stays in registers until the last possible moment. These fragments/algebraic expressions are then compiled together into an overall plan to be turned into machine code. The authors considered converting to C++ code to take advantage of the fact that the database system and data structures can be accessed using C++, but they found that compiling to C++ was too slow. Instead, they turned to LLVM to generate assembler code, which is then executed by using an optimizing JIT compiler from LLVM. The authors also take advantage of parallelism in order to process multiple tuples at a time, further increasing performance.

The strength of this paper lies in their new approach to optimizing for query processing performance and the impressive results they reported. As their experimental results show, their implementation (done on the HyPer memory resident DBMS), achieved speedups of 2-12x compared to the VectorWise and MonetDB systems, depending on the query. When compared to a commercial system that they called DB X, the improvement was even more dramatic, with speedups of 15-140x. Besides that, the authors show that their system generated high quality code that consistently resulted in far fewer branch mispredictions and cache misses. In essence, not only did the authors bring forward a novel approach to query processing that is more in line with current technological realities (compared to the venerable but aging Volcano method), but they also strongly suggested that their method was capable of generating better quality code than other established systems.

While the authors did compare their system to other competing query processing approaches like MonetDB, they do not provide a direct comparison between HyPer + LLVM and a handwritten execution plan, since they had claimed that their generated execution plans are competitive with handwritten C++ code. I also wonder how well this method will work with progressively more complex queries containing more and more dependencies, and whether it will always outperform conventional approaches like Volcano.

Review 6

The contributions that this paper presents are compiling optimizations on queries in three unique ways: processing the instructions in a way to keep data in CPU registers as long as possible, improve data locality by keeping the operators and data close to one another, and using the LLVM framework to compile queries into native machine code. The paper presents the idea of how queries are typically translated or compiled into an “algebra” expression which is a set of instructions executed by some operator that continually calls a “next” function on these compiled instructions. The contributions of this paper are important because it optimizes the compiled instructions in a way that rivals the speed of expert handwritten solutions, which are typically the most optimal, and it is also very compatible with computer architectures.

The author then presents the architecture of their solution as well as the code generation for the compiled output. Then the author details how to integrate this solution with different processing techniques and evaluates the results of test on this compilation technique.

I did like how the author seemed to break down simpler concepts so he did not assume the reader’s familiarity to be super high. I still found the material to be a bit dense as I was encountering a lot of new information, however, but the paper was decently gentle at presenting all of these new concepts I felt.

I found this text to be quite a bit more low level than many of the other past readings. Since my background in databases as well as the hardware implications of querying databases is not as deep as the typical reader, I found it much harder to follow this paper. The higher level papers are easier to conceptualize but quite a few of the more technical sections in this paper sort of lost me.

Review 7

This paper has proposed a data-centric query processing method that results in excellent query performance with small compilation time. This method is to resolve the phenomenon of poor performance of traditional query processing in modern CPUs. This work uses LLVM compiler framework with focusing on data locality and predictable branch layout.

This paper starts by introducing the traditional iterator model called valcano-style processing. It's good but only suitable for pervious CPUs. Therefore, the author proposes methods that different from existing approaches: 1)keep the data in CPU registers(data-centric processing) 2)data is pushed toward operators 3) the uses of LLVM compiler framework.

The query compiler is implemented in the following ideas: this paper uses a pipeline-breaker which is more restrictive than other standard DBMS to maximize data and code locality. Also, the architecture is designed so that the data should be kept in CPU registers for as long time as possible, by pushing data toward operators. The data is always pushed from on pipeline breakers to another pipeline breakers. This design reduces the register pressure. Another innovative design is that the queries are compiled by ensuring that all pipelining operations are performed purely in CPU. When compiling algebraic expressions, it keeps all relevant values in CPU registers by generating the instructions that are relevant for the given situations. Therefore, the operators offer an interface which is as simple as in the iterator model. The produce and consume interface are only used in generation that converting algebra into qseudo code.

With regard to machine code compiling, they first attempt to use C++ but it is slow and does not offer total control over the generated code. LLVM is chosen instead with the advantages of 1)portable accords machine architectures 2)strongly typed 3)producing extremely fast machine code. 4) it achieves better result than hand-written code. For complex operators, it is not easy to compile a complex query into a single function for some reasons. Multi-functions are needed in this situation. Then the paper introduces several ways to improve performance by finding two bottlenecks(hashing and branches) and resolving it by tuning the prediction performance.

The tuples are processed in concurrency with keeping whole block in registers. It has advantages: 1) allows for using SIMD instructions(inter-tuple parallelsm) on modern CPUs 2)it helps delay branching. The techniques are implemented in Hyper main-memory DBMS and a disk-based DBMS and it works well. The evaluation is based on code analysis and microbenchmark for specific operator behavior.

Review 8

Neumann starts off by discussing a significant problem: query performance has become CPU-bound. Researchers have not been able to develop techniques that consistently beat (or match) handwritten query plans. The paper outlines a strategy that reimagines how query processing works.

Everything is optimized to form a “data-centric” model. A main goal is to not move data out of registers to speed up processing, and to avoid a significant number of function calls. While these may seem like small things, with a magnitude in the millions, they become significant overhead. This is partially achieved by using a push model instead of the traditional pull model. The pushes happen until a pipeline breaker is reached (when a tuple needs to be removed from registers). Additionally, instead of turning a query into physical algebra to be executed, the query is compiled into a program. For extra efficiency, it is compiled to machine code using the Low Level Virtual Machine (LLVM) compiler.

I felt as though the paper did a good job explaining and justifying their choices and how they fit into modern-day requirements. They explain why the iterator model was traditionally used, and why it is not ideal or acceptable today (non I/O-bound workloads). It is clear that the proposed solution is quite complex, so it is important for the reader to understand the value of this work - especially when more complex code can theoretically lead to more errors. There was also a method used on page 544 that I liked as an explanation: the author wrote some C++ code, and mentioned that C++ wouldn’t have actually been used, it would have been LLVM. The technique of showing it in a simpler language was much appreciated.

I was confused about the role of the HyPer system as discussed in the results section. The authors mention on page 546 that a bad result was “basically a suboptimal behavior of HyPer and not related to LLVM.” My impression from how it was introduced was that HyPer was a full-fledged system that the techniques were integrated with. Therefore, I wondered why a baseline for HyPer was not included, which presumably would have shown this behavior.

Review 9

This paper purposes a new strategy for translating queries into efficient machine code and thus improving the performance of query processing.

Traditional database systems execute algebraic query expressions using the iterator model. However, that method is only suitable for situations where query processing cost is dominated by I/O (not the CPU). The reason is that function call is used extensively in the model and each data tuple need several function calls. For main-memory, these function calls obviously becomes a bottleneck. In addition, the iterator model results in poor data and code locality.

This paper suggests that query processing should be data-centric instead of operator-centric. This means that for operators the boundaries between them are blurred. However, for data, they will stay in the CPU registers as long as it can, and be processed by possibly many operators. This achieves good data locality. For good code locality, each generated code fragment will be working on large amounts of data in tight loops.

Thus the idea here is quite simple, but it’s hard to actually translate a query into machine code that follows the above idea. The paper suggests writing LLVM code to connect different operators (tuple access, filtering, materialization in hash table etc.) and C++ code for more complex parts such as data structure management etc. This ensures that data stay in registers all the time. To achieve the best performance, care must be taken for complex operators and the way we write our code. For complex queries, it’s not desirable to compile it into a single function which leads to an exponential growth in code size. It’s more desirable to write functions in LLVM and then call these functions from other LLVM code. In terms of writing code, the code should be friendly to branch prediction as prediction failure will waste several CPU cycles.

One possible improvement for this paper is how to leverage multi-core machine and generate parallel machine code for query processing. It will be complicated, but definitely necessary.

Review 10

In the paper "Efficiently Compiling Efficient Query Plans for Modern Hardware", Thomas Neumann tackles the classic problem of query performance as it scales with growing main memory and raw CPU costs. He proposes a novel compilation strategy that translates a query into compact and efficient machine code to mimic the performance of object oriented programming. What is the result? Excellent query performance with only modest compilation time. Current models translate queries to algebraic expressions and repeatedly call the "next" function of the operator. The problem with this method is that it was designed in an era where overhead was dominated by I/O operations. "Next" is a virtual call for every tuple which result in poor code locality and complex book-keeping. Noting this as a means for improvement, Neumann approaches this problem with the following in mind:
1) Processing is data-centric. It should be kept in CPU registers as long as possible.
2) Data should be pushed towards operators, not pulled.
3) Queries should be compiled into native machine code.
There are several implementations before this that try to tackle the same problem, but pale in comparison to hand-written code. Neumann makes a strong claim that his system, at times, runs faster than hand written code.

Neumann's techniques are outlined in three main components: the query compiler, code generation, and parallelization techniques.
1) Query Compiler: The goal is to organize query processing so that data stays in CPU registers as long as possible. Thus, he tends towards reversing the direction of data flow and pushing them towards consumer operators until the next pipeline breaker is reached. The execution plans that are produced minimize the number of memory accesses. Furthermore, by exposing the operator structure, they are able to exploit a generation of near optimal assembly code as they generate instructions and keep all other values in CPU registers.

2) Code Generation: Rather than relying on slow C++ compilation and limited control, Neumann opts into the LLVM compiler framework. LLVM only takes a few microseconds for query compilation and hides the register allocation problem by providing an unbounded number of registers. Producing code like this is much more robust than writing it manually. However, since code is so fast, the code fragments themselves and good branch prediction become bottlenecks to the design.

3) Parallelization Techniques: SIMD instructions are employed to enable parallelism to process multiple tuples with one instruction. This is done by partitioning the input of operators, processing each of them independently, and finally merging the results from all partitions. Ultimately, due to the nature of this system, no code needs to be changed. Parallelizing decisions is a problem that is not tackled in this paper.

This paper also has a couple of drawbacks, much like other technical papers. Primarily, the lack of graphs and use of excessive tables was very hard to comprehend. If the information was visualized, a better general understanding could have been achieved. Second, I felt that a lot of the vital information needed to understand some of the topics were pushed into the appendix. Even though there were references to these parts, discussing them in tandem with the current discussion would have been more effective. Lastly, I felt that there was room for discussion about future work. Modern hardware includes SSDs instead of HDD. What effect would this have?

Review 11

This paper describes how to make query execution more efficient in main memory DBMS’s. A main memory DBMS has the advantage that speed isn’t limited by disk IOs. This means that the CPU’s processing time is more important for optimizations.

Ordinarily, a DBMS uses an iterator model for executing its query plans. A “next” function can be called for every tuple in a result set, and that tuple can be processed and sent to the next step of the query plan. This also allows the DBMS to pipeline its execution, as each tuple can be sent to the next stage of execution independent of other tuples. However, iterating on a tuple by tuple basis instead of a coarser granularity incurs significant CPU costs.

In order to fix this, the DBMS can define certain parts of the query plan as pipeline breakers, which force tuples to be removed from CPU registers. The query plan will push tuples from one pipeline breaker to another, which keeps them in CPU registers for as long as possible. This will, as much as possible, prevent the cost of constantly switching tuples in and out of registers. In addition, the interface isn’t much more complicated. Instead of using “next” to iterate through tuples, this uses “produce” to get tuples from one stage and “consume” to send them to another stage.

For best speed, the query plan is compiled into machine code. This is done with a combination of LLVM and C++. LLVM has more efficient optimizers when compiling, but C++ can be easier to use, and both languages can call functions from the other. However, because of LLVM’s compilers, for best efficiency, the most executed parts of the query plan should all be compiled with LLVM.

The experimental results showed LLVM compiling to work better generally than another option, MonetDB. It tends to be faster, use less instructions, and predict branching behavior better. However, the experiments didn’t show comparisons with any other baselines, which limits the kinds of conclusions we can draw from this performance.

Review 12

The paper mainly discuss a new design for efficiently compile query plans for modern hardware with more memory. As main memory grows, query performance is more and more determined by the raw CPU costs of query processing itself. The classical iterator style query processing technique is very simple and flexible, but shows poor performance on modern CPUs due to lack of locality and frequent instruction mispredictions. Several techniques like batch oriented processing or vectorized tuple processing have been proposed in the past to improve this situation, but even these techniques are frequently out-performed by hand-written execution plans. The paper presents a novel compilation strategy that translates a query into compact and efficient machine code using the LLVM compiler framework. The strategy mainly takes into account three idea: Processing is data centric and not operator centric. Data is processed such that we can keep it in CPU registers as long as possible. Operator boundaries are blurred to achieve this goal; Data is not pulled by operators but pushed towards the operators. This results in much better code and data locality. Queries are compiled into native machine code using the optimizing LLVM compiler framework.
One drawback is writing code in LLVM is more tedious and less feasible.
 
 

Review 13

The classical iterator model for query evaluation is the most commonly used execution strategy because it is flexible and simple. However, CPU consumption becomes and issue. Some systems tries to reduce high calling cost by passing blocks to tuples between operators, which causes additional materialization costs. This paper proposes a very different architecture for query processing to maximize performance. The key of the architecture is to keep data in CPU registers as long as possible. This is achieved by reversing the direction of data flow control: instead of pulling tuples up, push them towards consumer operators. The consequence is that data is always pushed from one pipeline-breaker to another pipeline-breaker. Also, the queries should be compiled in a way that all pipelining operations are performed purely in CPU, and execution goes from one materialization point to another. This ensures good code locality and therefore good performance. To compile algebraic expression into code fragments, the paper propose to expose operator structure to generate optimal assembly code and therefore keep relevant data in CPU registers. Each operator offers two functions: produce and consume. This produce/consume interface only applies to the final step of translating algebraic expression to imperative program. The code fragments operate on certain pieces of data at a time, which results in efficient execution. To generate machine code, it is better to use a mixture of LLVM and C++, since they have advantages in different aspects and therefore can complement each other. For code generation of complex operators, multiple functions are needed. There are some issues slower and complicate code generation, but the effort to avoid the pitfalls is not that much. The paper then presents parallel techniques to process tuples. There are some possible approaches introduced here, however, parallelizing decisions remain a difficult problem and out of scope of this paper. The paper also shows experiment results comparing OLTP and OLAP integrated with other systems. The result shows that the data-centric query processing is a very efficient query execution model.

As a reader who does not have much background knowledge in query compilation, I find it hard to follow all the concepts. If there is a section introducing the high level structure and some terminologies, it will be very helpful for understanding and clarity.

Review 14

This paper proposed to compile efficient query plans for CPU. The background is that main memory database system becomes more and more popular, and query performance is more determined by the raw CPU cost of query processing, but not the disk I/O of the conventional database systems.

The main contribution of the paper is building a query compiler. They give a definition of pipeline breaker to help illustrate their approach. The main point is spilling data to memory as a pipeline-breaking operation and during query processing, all data is kept in CPU registers as long as possible. Also, the paper gives functions to translate a query to the algebraic expression. The paper also gives a new way to compile query into machine code, by using low level virtual machine compiler framework, which is extremely faster than the compiling to C++. Complex operators are also considered.

The evaluation part of this paper is also complete. they did experiments on both main memory database system and disk-based database system. And the result is very well. The proposed compiling to LLVM is very fast.

Review 15

This paper presents a novel compilation strategy which translates a query into compact and efficient machine code using the LLVM compiler framework. As they mentioned in their paper, as the main memory develops, the performance of the query is much more determined by CPU. A problem raised that traditional iterator style query processing subjects to poor performance on modern CPUs, even for other techniques like batch-oriented processing and vectorized tuple processing. The main issues of these traditional methods are the lack of locality and frequent instruction mispredictions, resulting even hand-written execution plans can outperform these methods. This problem is significant to database vendors because they need to follow the improvement of CPU to optimize their query processing models. By benefitting from this, the performance of modern DBMS can be improved. In this paper, they provide an innovative strategy aiming at good locality and predictability that yield excellent query performance. They integrate these techniques into HyPer MMDB, experiment results indicate that their strategy results in excellent query performance while requiring only modest compilation time.

Compared with traditional query processing approaches, their strategy differs in 3 ways. First of all, the query processing is data-centric rather than operator-centric, data will be kept in CPU register as long as possible and operator boundaries are blurred. Second, data is not pulled by operators but pushed towards the operators, yielding better code and data locality. Third, queries are compiled into native machine code by LLVM. In their design, they want all data to be kept in CPU registers as long as possible, in order to achieve this, they use a push to the data, data is always pushed from one pipeline-breaker into another pipeline-breaker. This reduces the computational complexity and the pressure of registers. Although compiling to C++ is attractive, there are some overheads like the low performance of optimizing C++ compiler and the potential to suboptimal performance. They choose to use a Low-Level Virtual Machine (LLVM) framework to generate assembler code which can be executed directly using an optimized JIT compiler. LLVM provides better robustness and faster query compilation. Apart from these, they also introduce advanced parallelization techniques which can be naturally in a general framework. Since the whole block is in registers, a block-wise processing can be used. This utilizing the SIMD instructions of modern CPUs and help with delay branching. Promising results in experiments indicate that their strategy works fine.

It’s a nice paper discussing a novel query processing strategy, though a little bit hard to understand. There are several technical contributions to this paper. In their paper, they mentioned that they spill data to memory as a pipeline-breaking operation, during query processing, all data will be kept in CPU registers as long as possible. This is a good design for modern CPUs because the access speed for registers is the fastest, this will greatly improve the performance. Another contribution is that they are not implementing the complete query processing logic in LLVM assembler, on the contrary, they use a mixed execution model which consider the interaction of LLVM and C++, by leveraging the advantage of them, this strategy achieves very low query compilation times. Besides, this paper gives a rich example which can help readers understand their concepts.

I don’t find any downside of their paper. I think this paper introduces some concepts that I am not quite familiar before like LLVM and SIMD, I think they can provide more backgrounds of these techniques.

Review 16

This paper introduces a new compilation strategy for queries that replaces the iterator model. The chief goals of this new strategy are to maximize data & code locality in order to provide compilations (using LLVM) that can rival and sometimes even surpass hand-written code. Some example techniques used to achieve this are keeping data in CPU registers as long as possible (which the iterator model does not do well), doing profile-based analysis using LLVM branch analysis, and identifying “pipeline breakers” for organizing when tuples are moved.
This paper’s advantages seem fairly straightforward—using this compilation strategy, the generated code offers much higher performance than that offered by the iterator model. One disadvantage, however, is that the tradeoff seems to be performance for complexity. One of the main advantages of the iterator model is that it is simple and easy to understand/use (according to the authors, anyway) but this new strategy discards simplicity and instead aggressively optimizes for highest performance, ignoring the fact that sometimes its techniques are more complex.