Review for Paper: 17-Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

Review 1

This paper presents a solution to reduce main memory footprint in high performance hybrid OLTP & OLAP database systems. The contributions include:
1. Data Block, a compressed columnar storage format.
2. Positional SMA, a light-weight intra-block indexing method to improve scan performance.
3. SIMD operations for predicate evaluation.
4. Integrate JIT-compiled query execution with vectorized scans, to achieve high performance in both compilation and execution.

In hybrid database systems, it's crucial to manage hot and cold data efficiently. This paper proposes to divide OLTP and OLAP system into fixed size chunks, individually compress them into read-optimized immutable Data Blocks when chunks are identified as cold. Data Blocks are essentially compressed cold data.

Lightweight compression schema and PSMA are used to achieve efficient access to the data. PSMA narrows the scan range by using attributes' min/max information. Several compression methods including single value compression, ordered dictionary compression, and truncation are introduced.

Vectorization enables the use of SIMD instructions to access data belonging to multiple tuples in parallel to improve performance. Also, a JIT-compiling query engine is integrated in the HyPer main memory DBMS to support query processing. In the paper the author mentions their SIMD optimized algorithms only work for integer data, it would be interesting to see an extension to other data.

One little thing I like in this paper is the idea that, since data is stored in different Data Blocks, they can be compressed in different ways, which results in an opportunity for optimized compression scheme for each column in each block. I find this idea interesting, not complicated but very useful.

This paper presents a solution to reduce main memory footprint in high performance hybrid OLTP & OLAP database systems. The contributions include:
1. Data Block, a compressed columnar storage format.
2. Positional SMA, a light-weight intra-block indexing method to improve scan performance.
3. SIMD operations for predicate evaluation.
4. Integrate JIT-compiled query execution with vectorized scans, to achieve high performance in both compilation and execution.

In hybrid database systems, it's crucial to manage hot and cold data efficiently. This paper proposes to divide OLTP and OLAP system into fixed size chunks, individually compress them into read-optimized immutable Data Blocks when chunks are identified as cold. Data Blocks are essentially compressed cold data.

Lightweight compression schema and PSMA are used to achieve efficient access to the data. PSMA narrows the scan range by using attributes' min/max information. Several compression methods including single value compression, ordered dictionary compression, and truncation are introduced.

Vectorization enables the use of SIMD instructions to access data belonging to multiple tuples in parallel to improve performance. Also, a JIT-compiling query engine is integrated in the HyPer main memory DBMS to support query processing. In the paper the author mentions their SIMD optimized algorithms only work for integer data, it would be interesting to see an extension to other data.

One little thing I like in this paper is the idea that, since data is stored in different Data Blocks, they can be compressed in different ways, which results in an opportunity for optimized compression scheme for each column in each block. I find this idea interesting, not complicated but very useful.

Review 2

DBMS designed for OLAP workloads grows in the past few years. Most of them use column-store design. Query efficiency is increased significantly by operations on blocks of values instead of interpreting query expressions tuple at a time. This paper introduces the evolution of Hyper, which is a main-memory database. The paper aims at reducing the main-memory footprint of Hyper by compression while maintain the high OLTP and OLAP performance of the original system.

There are four main approaches proposed by the paper. First is applying Data Blocks, which is hybrid database system’s novel compressed columnar storage format. Second is light-weight indexing Positional SMA on compressed data. Third is SIMD optimized algorithms. The last is the integration of multiple storage layout by vectorization in a compiling tuple-at-a-time engine.

Some of the strengths of this paper are:
1. Data Blocks storage conserves memory while retain both high OLTP performance and OLAP performance.
2. PSMA narrows scan ranges and can be applied on both compressed and uncompressed data.
3. Hyper refrains PSMA indexes use in hot compressed data to avoid decreasing transaction processing performance on those hot parts.

Some of the drawbacks of this paper are:
1. The SIMD optimized algorithm is limited by integer data. Other data types are still using scalar implementations.
2. The paper provides expected OLAP performance but not actually conduct the experiments and show results.

Review 3

High-performance analytical systems use either vectorized query execution or “just-in-time” (JIT) query compilation, which can not guarantee the same high performance in OLTP. Therefore, this paper proposed the novel Data Block format that allows efficient scans and point accesses on compressed data and addressed the challenge of integrating multiple storage layout combinations in a compiling tuple-at-a-time query engine by using vectorization, which accommodated both high performance OLTP alongside OLAP running against the same database state and storage backend.

This paper solved the problem of fundamental difference between workloads and optimizations of OLTP and OLAP by dividing relations into a read- and a write- optimized partition, and invalidating the immutable (frozen) compressed cold data and moving it to the hot region to perform updates. Besides, this framework used light-weight compression schemes in Data Blocks to allow for highly efficient access to individual records, and introduced PSMAs to speed up scans on Data Blocks.

The main contributions are as follows:
(i) Data Blocks, a novel compressed columnar storage format for hybrid database systems
(ii) light-weight indexing on compressed data for improved scan performance
(iii) SIMDoptimized algorithms for predicate evaluation
(iv) a blueprint for the integration of multiple storage layout combinations in a compiling tuple-at-a-time query engine by using vectorization.

The main advantage of this paper is it integrated best features of two approaches in query processing to get high query performance and transaction throughput, therefore it can achieve both high performance in OLTP and OLAP. Besides, this proposed framework also reduced the main memory footprint, therefore reduced CPU overheads and accelerated the query processing speed.

The main drawback of this paper is for traditional OLTP workloads, SMAs/PSMAs proposed by this paper are not a general replacement for a traditional index structure, as the compression and decompression processes can cost much more time than directly accessing the uncompressed data.

Review 4

Problem & Motivation:
Recently, two main approaches have been proposed to increase the efficiency of the query evaluation which is specialized for OLAP workloads. The first approach is through storing data in the compressed columnar format, which enables the database to interpret the query on blocks of values. The second approach to accelerate query evaluation is “just-in-time” (JIT) compilation of SQL queries directly into executable code. However, both methods cannot be used for OLTP and OLAP workloads simultaneously. Therefore, the authors propose the HyPer, which can reduce the main-memory footprint in high-performance hybrid OLTP & OLAP.

Main achievement:
Basically, the HyPer.
HyPer is a full-fledged main-memory database system and there are two key highlights inside the HyPer.

1. Data Blocks
The first one is the compressed columnar storage format, “Data Blocks”. Given lots of data nowadays, it is essential for an MMDB to compress the data so that more entries can be fitted into the main memory. The data block is used for reducing the memory usage. The compressed entry contains all data to reconstruct the stored attributes and PSMA index structures without metadata. The data block only is applied to the “Cold Data” and therefore will not influence the performance of the hot data. Once a record has been packed into the data block, it can only be deleted (or read from my understanding), which implies if you want to update the record, you need to delete it and reinsert the record. Refers to 3.1 for the storage layout.

2. PSMA (Positional Small Materialized Aggregates)
PSMA is adapted from the SMA. It contains the basic SMA data structure --- one minimum and one maximum value of each attribute stored in a Data Block. The way to use the SMA is to apply this domain knowledge to rule out the unqualified data blocks. However, it offers little benefits for two conditions: a) a single outlier exists in the data block. b). the desired data is uniformly distributed. Therefore, in addition to the simple min/max value of each attribute, a concise lookup table is added to further narrow the scan range. The lookup table contains 2^8 entries for each byte of the data type. One interesting thing is that we store the distance of the value to the SMA’s minimum value rather the value itself. Refers to the end of page 4 for lookup proceeds details.

Drawbacks:
Where is the JIT explanation? I expect to see some background of the JIT.

Review 5

In recent years, the rising popularity of column store based database systems, along with vectorized query execution has greatly improved the performance of OLAP databases. It is often the case, however, that design considerations and optimizations for OLAP do not completely translate to OLTP, and vice versa. In many cases, their requirements can be contradictory, which makes designing a high-performance hybrid database system difficult. The paper “Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation” presents some important innovations to their main-memory database system HyPer, with the aim of achieving high performance for both OLTP and OLAP while using the same database state and storage backend. In order to do so, they contribute the following:
1. A new compressed column-based storage format called Data Blocks
2. Light-weight indexing on compressed data to improve scan performance
3. The use of optimized parallel algorithms (SIMD) to evaluate predicates
4. Use of vectorization to integrate multiple storage layout combinations in a compiling tuple-at-a-time query engine.

In their design, the authors divide the data into hot data, which is left uncompressed, and cold data, which is stored in the compressed Data Block format. These Data Blocks are self-contained containers which store attributes in byte addressable compressed format. Compression methods are chosen based on the distribution of the attribute, allowing for high compression ratios. Also, SARGable scan restrictions are evaluated directly on the compressed data representation to find matches. Additionally, all data contained inside a Data Block is immutable, and in order to update information, records are deleted and followed up by an insert. Positional Small Materialized Aggregates (Positional SMAs) add low-overhead intra-block indexing, which further improves scan performance and predicate evaluation. In essence, Data Blocks combined with Positional SMAs bring the advantages of compression best utilized by OLAP while maintaining fast point accesses (through byte-addressability) needed for efficient OLTP. The system uses LLVM to compile queries into machine code which can be executed quickly. In order to keep compile times in control, especially with the use of compression, the system calls pre-compiled interpreted vectorized scan code for vectors of tuples, which are then processed a tuple at a time. This scales much better for large numbers of potential storage combinations. Finally, SIMD-optimized code is used to efficiently find matches in the data.

The main strength of this paper is the presentation of a working database system that is able to deal with both OLTP and OLAP workloads without sacrificing too much performance. According to the results presented by the authors, it seems that their HyPer solution makes the right balance of tradeoffs with regards to compression ratios, query performance, and more. For example, while Vectorwise (another system that they compared their results against) has on average 25% more compression than HyPer, it comes at the cost of poorer query performance. As for OLTP workloads, their system using compressed Data Blocks only performed around 9% worse than an uncompressed TPC-C database. In other words, it achieves a good balance for both OLTP and OLAP workloads that makes a convincing case that it is capable of handling both reasonably well.

As for the weaknesses of this paper, I felt that the authors did not explicitly discuss why a hybrid system for OLAP and OLTP workloads was even necessary. While they did make a good case about why their compression scheme is better than competing approaches, why not just specialize database systems towards one workload or another, rather than create a system that is not as good at either? As one of the motivating factors was the relative lack of advances in OLTP systems compared to OLAP, this seemed somewhat strange, since Data Blocks, etc. were made to improve OLAP at as little cost to OLTP, rather than increasing OLTP performance.

Review 6

This paper’s main contributions are Data Blocks (compressed cold data), vectorized scans, SIMD-optimized predicate optimization, and the compression of Data Blocks in their JIT compiling query engine. The authors present their contributions implemente in HyPer, their main-memory database system. HyPer supports high performance OLAP and OLTP on the same data. These contributions are important for use cases where the database can fit in memory and we want to optimize query performance.

Data Blocks store compressed chunks of attributes as well as a few bits indicating the number of items in the block as well as a Small Materialized Aggregate (SMA) which indicates the minimum and maximum value stored in the block. The information in the SMA is extended to a Positional SMA index, which contains a lookup table whose entries are scan ranges that point to the cold compressed data in the Data Block. Compression methods are selected for the data in a block, including single value compression, in the event of duplicate entries, ordered dictionary compression, in the event of ordered dictionaries that do not need to be updated or grow, and truncation, in the event of integer data that can be stored as the delta between each value and the minimum value in the range. When searching the Data Block for matches, the scan range is narrowed by minimum and maximum information in the SMA as well as other checks, like in the case of dictionary compression, where a binary search is run, for example.

The article also contributed pre-compiled vectorized scan code to help compile time scale well with the number of storage layout combinations, as opposed to just-in-time (JIT) query compilations. The author then describes challenges with integrating vectorized scans into HyPer, which included determining if a block was cold/compressed or not, which would affect how they searched for matches in the data chunk. Next the paper discusses how SIMD instructions will make multiple comparisons in parallel and will result in a bit-mask indicating all the matching elements. In order to efficiently map this bit-mask to a match position vector, which they achieved using a precomputed table and the movemask instruction, for a constant time operation. Evaluation of the contributions of this paper were done on the main-memory database system HyPer. We saw how HyPer generally used more space as a tradeoff for query efficiency.

I liked how this paper successfully improved compile time and runtime of queries. I think that under the assumption that your memory is abundant, these contributions are very attractive. I would naturally criticize the clear increase in the memory footprint of their implementation, which would limit the audience of people who could fully benefit from their contributions.

Review 7

Datablocks
This paper proposed cold blocks which is a compressed columnar storage format for cold data. This method significantly reduce the main-memory footprint but maintaining the highly query performance and transactional throughput in hybrid OLTP and OLAP system. The data blocks has high transaction performance for the reason that the compression format of data blocks brings tuple's efficient access. Also, the data blocks integrates the good features of vectorized query execution and "just in time" query compilation using interpreted vectorized scan subsystem.

The author shared their incentives of designing data blocks in the introduction. The recent OLAP system has already increase the CPU efficiency by storing data in compressed columnar format. The key idea here is by using vectorized execution, a kind of batch processing idea rather than interpreting tuple each at a time. However, the method proposed by the author is quite different from the previous systems. It accommodates and take the advantages of both OLAP and OLTP by introducing the compression so that it reduces the main memory footprint. The hybrid database system is hard but the author addresses this problem dividing relations into a read- and a write- optimized partition. For obtaining efficient access to individual records by tuple position, it uses light weight compression schemes. And for speeding up scans, it uses 'positional' type of small materialized aggregates. Combining all those together and vectorized query execution and JIT compilation, it achieves both high performance in OLAP and OLTP.

For data blocks, it is first a self-contained containers that stores unfixed attribute chinks in byte-addressable compressed format. The author chooses data blocks because it can achieve both high performance of OLTP and OLAP while conserving memory. In this method, the data blocks is used as compressed RAM storage format for cold data. The introduction of compression of schemes for the cold data is also for the aforementioned performance, although compared with hot data, it takes more time during the point access. Therefore, the requirement of compression is that the compressed data needs to be byte-addressable for efficient point access. So it uses three different kinds of compression ways: 1)single value compression 2)ordered dictionary compression 3)truncation which ensure the performance will not drop drastically while the point access be in the cold data part with the considerable compression rate.

The introduction of vectorized scans is to efficiently integrate multiple storage layouts into 'tuple at a time' JIT-compiling query engine. And different from the previous query excretion model, the query engine used in this paper can generate code for the entire query pipelines. It uses LLVM for the compilation. When the system is obtaining a query, it is broken down to multiple pipelines and each pipeline provides the logics of all operators with materialization. The author did thorough experiments on compression performance, query performance, and OLTP performance and shows the proposed method achieves relatively better results than others.

The main contribution as well as the advantages of this paper is that it successfully reduces the main-memory footprint while maintaining the high performance in OLTP and OLAP. Something need to improve within this paper is that the method does not achieve very good result on the data compression.

Review 8

This paper proposed a compressed columnar storage format for cold data. This compress solution is aiming at reduce main memory usage while still provide high query performance and transactional throughput. The Data Blocks proposed in this paper is an evaluation of Hyper, can be applied on hybrid OLTP and OLAP database system. People tend to use row store for OLTP and column store for OLAP. To achieve Hybrid, a storage solution is needed to be compatible with memory and disk. Also stored tuples can be used for transaction processing.

JIT compilation relies on CPU registers, while vectorization passes data go through memory. However, the vectorized scan can be pre-compiled and can be performed well with the SIMD algorithm, which can be further optimized. Often, these two methods have these obvious benefits.
The authors use vectorized predicates to take advantage of new hardware features such as Advanced Vector Extension (AVX) and Streaming SIMD Extension (SSE). The authors introduced a new type of small materialized polymer (SMA) called PSMA. Since the best compression technique is chosen for each column in the block, the PSMA lightweight index provides the minimum and maximum values of the block, allowing data blocks to be skipped during the scan. If the block cannot be skipped, the scan range can be narrowed down by the PSMA index.

Overall, the author made four major changes to the HyPer database.First of all, change the structure of the Data Block. Second, support for data block compression. Third, predicate evaluation using and optimizing vectorization. Fourth, integrate multiple storage layout.

The main contribution of this paper is that it proposed Data Blocks, a novel compress columnar storage format for hybrid database system. Secondly, this paper proposed a light-weight indexing on compressed data for scan. More over, the paper also proposed a SIMD-optimized algorithm for predicate evaluation.

One point that this paper might want to explore more is that should this process also support the store procedure. Also, I am wondering is there anything that can be improved on CPU side

Review 9

The authors aim to produce a system that can be used effectively for both OLAP and OLTP workloads. The proposed solution, data blocks, is used to store “cold” data that is infrequently accessed. This essentially becomes a solution for supporting data that is used primarily for analytical queries within a database where writes may happen. Instead of having a separate read and write store like C-Store, chunks of data are moved to immutable, optimized blocks. If the data ends up being written to at some point in the future, it is deleted and moved back to the area with hot data. As opposed to previous work, the optimizations are all done for a main-memory database.

Some important contributions made by the paper are the compression and indexing strategies for Data Blocks. I found the compression strategies to be particularly well explained and justified. Three different types of compression are used: single value compression (where all values in a block are the same), ordered dictionary compression, and truncation. The authors do a good job explaining some drawbacks of compressing at the block level - for instance the same string could be in multiple blocks and therefore multiple dictionaries - and explain why they still chose to use their compression scheme. They stuck with it because they felt that the added flexibility of being able to use the appropriate compression scheme for each block outweighed the negative implications.

There are multiple levels of indexing for Data Blocks. SMAs define the minimum and maximum value for an entire block, and can be used to rule out entire blocks in a scan. PSMAs, meanwhile, provide lookup tables for the data within the block. JIT query compilation and vectorized scanning techniques are both taken advantage of to best access the data. To the user, data access seems the same for both compressed Data Blocks and uncompressed hot data. Overall, I found that the authors clearly take advantage of the simplified problem of working with frozen data within data blocks to create clever solutions that wouldn’t be possible on writable data.

The paper seems to focus on Q1 and Q6 of TPC-H. Although some specific attributes of those queries were described, the queries themselves were never written in the paper. I found this to be a small weakness. Additionally, as someone who is not an expert in hardware, some of the language was hard to follow, especially in the second half of the paper. The authors use a significant number of acronyms and domain-specific words without defining them first. I even noted that SARGable was defined on page 4 - the 4th time it was used in the paper. With small things like this, I believe that the authors could have tried harder to accommodate people who might not be experts in this area.

Review 10

This goal of this paper is to reduce the main-memory footprint for hybrid OLTP and OLAP databases. Besides the main goal, the authors also want to retain high throughput (for OLTP) and query performance (for OLAP). Their solution is to invent a new data storage format and combine this format with vectorized query execution and “just-in-time” query compilation for high performance.

When designing the new storage format, we should notice that the needs for OLTP and OLAP are conflicting. OLAP workload, on one hand, suggests using a complex compression scheme to reduce bandwidth usage. On the other hand, the complex compression scheme will result in slow tuple processing and lower the performance of OLTP workloads. Therefore, the author designs a light-weight compression scheme and avoid using bit-packing techniques. The main idea here is that for each data chunk, the system individually finds the encoding scheme (single value, ordered dictionary, truncation) for each attribute based on the data type and distribution. Then it constructs the data block by first storing the number of records, then for each attribute, the compression method, offsets, and finally the actual data.

Now the compression scheme can be used to create data blocks for all cold data chunks. Once a data block is created, it’s immutable. An update operation will be internally converted into a deletion in the cold data partition and an insertion a the hot data partition.

Another contribution made by this paper is the idea of Positional SMA index. Basic SMA only consists of min and max value of an attribute. However, if values are uniformly distributed in the relation, this information is no longer useful as it’s unlikely to rule out any data blocks. Therefore, the author suggests using a concise lookup table which maps a scan range to the compressed data inside a data block. This further narrows down the range during a scan and improves performance.

Finally, the paper also suggests the use of vectorized scan and JIT compilation. Vectorized scan exploits SIMD instructions and also fits into the JIT query pipelines. By incorporating the advantage of both methods, the systems yields the highest performance for both OLTP and OLAP workloads during experiments.

One question I have is that is a unified solution for both OLTP and OLAP workloads really a good idea?

Review 11

In the paper "Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation", Harald Lang and Co. introduce "Data Blocks", a compressed columnar storage format for cold data that aims at reducing main-memory footprint in high performance hybrid OLAP and OLTP databases. With the emergence of new architectures specialized for OLAP workloads, we have new columnar storage formats and query evaluation efficiencies that are granted by vectorized execution. Particularly, this paper discusses the evolution of HyPer which is developed with the JIT-compiling query engine to incorporate vectorization in the context of Data Blocks. In order to achieve high performance for both OLAP and OLTP workloads, they establish and define the following:
1) Data Blocks
2) Light-weight indexing on data blocks for improved scan performance
3) SIMD optimized algorithm for predicate evaluation
4) A blueprint for integration of multiple storage layout combinations using vectorization
Attempting to create a single system that can optimize for two very different workloads is an arduous task - many fundamental assumptions are contradictory. Thus, Lang tries to fuse JIT and vectorization using an interpreted vectorized scan subsystem to get the best of both worlds. Developing a system that can satisfy the use cases of any entity is both a hard and interesting problem.

The two major contributions to the paper are the following:
1) Data Blocks: Data blocks are self-contained containers that store one or more attribute chunks in a byte-addressable compressed format. This enables them to conserve memory via compression in a high OLTP and OLAP setting for cold data and persistence. Once records have been packed into data blocks, they become immutable and only susceptible to deletion; if one wishes to modify a record, a deletion followed by an insertion would need to occur. Data blocks also incorporate a new index structure called Positional SMA in order to narrow index scans within data blocks, even if the entire block can't be eliminated.
2) Vectorized scans in compiling query engines: Since data blocks effectively determine the best suitable compression scheme, it has problems with JIT-compiling tuple-at-a-time engines. Naturally, one would want want to make their own engine; Lang introduces a data-centric compilation approach that compiles relational algebra trees into native machine code using LLVM. Thus, code is generated for the entire query pipeline instead of each tuple passing from one operator to the other. This greatly reduces overhead and works well with the data block scheme.

One drawback that I noticed from the beginning of the paper was the inherent flaw of evaluating this new system to others. Since this is a system designed to be efficient in both OLAP and OLTP workloads, the lack of specialization will not allow it to outperform any of its specialized counterparts. Thus, benchmarks will not have a valid control to test against. Another drawback that was present in the paper was the missing "future work" section. I felt that this paper could have expanded on the use cases of industry and how they could have possible tweaked their system to adapt to this (although companies would most likely use a DBMS exclusively for OLTP or OLAP workloads). Lastly, I did not feel any motivation for solving such a problem since there are existing systems that work just as well, if not faster (Again, why use a general system when you can build a system for YOUR NEEDS?).

Review 12

This paper describes the DBMS HyPer, and specifically how it changed to accomodate Data Blocks, a new method of storage for improving memory usage. The general idea is to get a midway point between storage that works for OLAP, which favors compression and read-only storage, and OLTP, which favors uncompressed read-write storage.

Data in memory is stored in a structure called a data block. Data blocks are relatively lightweight; they contain no pointers and no schema, so they can be written to disk easily. In order to facilitate reads, memory is divided into a read-only storage, and a write storage. Blocks in the read storage cannot be changed. Blocks in the write storage are periodically moved to read storage. Each data block is stored in a columnar format, to allow compression, but each block must contain all of the columns for its data, so that blocks can be independent.

Each attribute can use its own compression scheme, but in order to optimize reads, all entries in the data block must remain byte-accessible. The methods of compression are:
- When all values of an attribute are the same, they can be stored as a single value
- Since values cannot be changed, storing them in a dictionary allows ordering of values with keys to be preserved.
- A single base value can be chosen, and each other value can have its delta from the base stored, instead of the original value.

In addition to the data, each data block stores small materialized aggregates (SMAs) in order to speed up searches on the stored data. Each SMA is stored for each column in the data block. Some of the simpler ones are the min and max of each attribute; when searching an attribute, any data block with the wrong range can be ignored.

However, a more powerful version of this is Positional SMA, or PSMA. This stores the attribute values in a small table of ranges. Each range can be expanded when more values fit in it. When any value must be selected from the table, it can be looked up in the PSMA. The difference between the value and the min value in the lookup table can be used to find the index of the range where that value resides.

This paper is able to show how and why using data blocks can be done efficiently, going over how space can easily be saved with compression and reads optimized by searching blocks effectively with SMAs. Being generic enough to help OLAP and OLTP workloads should help this to be more easily adopted.

However, the explanation of the lookup tables for PSMA was hard to follow. In addition, by optimizing for OLAP and OLTP workloads, this implementation runs the risk of being outperformed by specialized DBMSs and storage systems.

Review 13

The paper introduces Hyper, a full-fledged main memory database system that was originally built with a JIT-compiling query engine to incorporate vectorization in the context of the novel compressed columnar storage format presented in paper - Data Blocks. Different from previous database system, it aims at designing a hybrid system both for OLTP and OLAP with high performance. Most importantly, it reduces the main-memory footprint of HyPer by introducing compression while retaining the high OLTP and OLAP performance of the original system.

The contributions of this paper can be divided into three parts: (1) the design of Data Blocks, a novel compressed columnar storage format for hybrid database systems (2) light-weight intra-block indexing, called Positional SMA, for improved scan performance and SIMD-optimized predicate evaluation (3) the integration of JIT-compiled query execution with vectorized scans in order to achieve highest possible performance for compilation as well as query execution.

The drawbacks of the system may be the inadaptability to dynamic and complex scenarios where frequent changes always happen. The idea of separating ‘hot’ and ‘cold’ is brilliant but is based on the assumption that it’s easy to recognize ‘hot’ and ‘cold’ data and they do not change over time. In a dynamic and complex scenarios, there may be a medium area between ‘hot’ and ‘cold’, which will performs poorly for both ‘hot’ optimized method and ‘cold’ optimized method. Also, chances are ‘hot’ data switch to ‘cold’ data frequently, what’s the cost of transferring the format of ‘hot’ to ‘cold’, the paper seems not mention it.

Review 14

This paper presents aims to reduce the main-memory footprint in high performance hybrid OLTP & OLAP databases. To achieve the goal of main-memory footprint, the paper contributes (i) Data Blocks, a novel compressed columnar storage format for hybrid database systems, (ii) light-weight indexing on compressed data for improved scan performance, (iii) SIMD optimized algorithms for predicate evaluation, and (iv) a blueprint for the integration of multiple storage layout combinations in a compiling tuple-at-a-time query engine by using vectorization. Data Blocks are self-contained containers that store one or more attribute chunks in a byte-addressable compressed format. It determines the best suitable compression scheme on a per-block and per-column basis. Its advantages include high compression ratios, efficient point accesses, SIMD algorithms provide higher speedups than on uncompressed data, ability to determine if a Data Block can be skipped during a scan, and narrows scan ranges using the PSMAs. The paper discusses how Data Blocks provides these features by presenting implementation details of PSMA and attribute compression. Data Blocks constitutes a challenge for JIT-compiling tuple-at-a-time query engines, therefore there needs to be a way integrating multiple storage layouts. The paper presents a query engine generating code for entire query pipelines. A query is broken down into multiple
pipelines, then performs the logic of all operators, and finally materializes the output
into the next pipeline breaker. With the query engine compile times can be kept low, even with many storage layout combinations. The paper then discusses how this feature is integrated in HyPer. Lastly, the paper explores finding matches using SIMD instructions. With SIMD, comparisons are performed on multiple attributes in parallel and yields a bit-mask which identifies the qualifying elements. Although there are challenges like computing the positions of matching records based on the given bit-mask, this feature provide much higher speedups than on uncompressed data. The paper evaluates implementation of interpreted vectorized scans, SIMD-optimized predicate evaluation, and the compressed Data Blocks format in our JIT compiling query engine, and the result look well.

The advantage of this paper is that the explanation is very thorough. It tries to explain high level concept and then goes into low-level details. However, readers with no background in this area can still be lost in the low-level explanation. Maybe introducing a little bit more about the terminologies would help better understanding of the paper.

Review 15

The paper proposed a hybrid OLTP and OLAP system integrating several novel techniques. The main goal of their proposal is to reduce the main memory footprint. To achieve the goal, they first proposed an innovative compressed columnar storage format named Data Blocks. They also proposed a light-weight index structured named positional SMA, to narrow down the scan range and improve the performance. Attribute compression is proposed to meet the requirement that compressed data needs to remain byte-addressable for different accesses. Also, vectorized scan is developed to integrate multiple storage layouts. Codes are generated by a query engine on the entire query pipelines level, which is different from a traditional query execution. At last, experiments are implemented to evaluate the integrated system with the aforementioned techniques.

I think the main contribution of this paper is exactly the Data Blocks data storage format. The idea is that cold data is not frequently hit so that it can be stored in a compressed data format so that reduce the main memory footprint.

The paper looks very good to me. There are many figures and tables in the paper to help readers to understand. The writing of the paper is also complete and understandable. The system detail is described in a very complete way. Each technique is proposed in an informational way.

The main drawback of the paper is that it doesn't do experiments on other systems, only their HyPer is evaluated. I think more experiments on other database systems will make their proposal more attractive. Another drawback to me is that the machined used for evaluation is too strong from my perspective. I wonder whether a practical database would be built on such a strong and expensive machine.

The paper proposed a hybrid OLTP and OLAP system integrating several novel techniques. The main goal of their proposal is to reduce the main memory footprint. To achieve the goal, they first proposed an innovative compressed columnar storage format named Data Blocks. They also proposed a light-weight index structured named positional SMA, to narrow down the scan range and improve the performance. Attribute compression is proposed to meet the requirement that compressed data needs to remain byte-addressable for different accesses. Also, vectorized scan is developed to integrate multiple storage layouts. Codes are generated by a query engine on the entire query pipelines level, which is different from a traditional query execution. At last, experiments are implemented to evaluate the integrated system with the aforementioned techniques.

I think the main contribution of this paper is exactly the Data Blocks data storage format. The idea is that cold data is not frequently hit so that it can be stored in a compressed data format so that reduce the main memory footprint.

The paper looks very good to me. There are many figures and tables in the paper to help readers to understand. The writing of the paper is also complete and understandable. The system detail is described in a very complete way. Each technique is proposed in an informational way.

The main drawback of the paper is that it doesn't do experiments on other systems, only their HyPer is evaluated. I think more experiments on other database systems will make their proposal more attractive. Another drawback to me is that the machined used for evaluation is too strong from my perspective. I wonder whether a practical database would be built on such a strong and expensive machine.

Review 16

In this paper, they proposed a novel database system which hybridizes OLTP and OLAP workloads. This is an interesting but also challenging topic since the technical demands of both types of workloads are very different and many fundamental physical optimizations are contradictory. In order to reduce the main-memory footprint in high-performance hybrid OLTP and OLAP databases and keep high query performance and transactional throughput, they introduced a novel data format call Data Blocks which compress column storage for cold data. Making the hybrid DBMS is always an important issue because it will be easier if people can store their data in one place but support both OLTP and OLAP workloads, it reduces the cost of maintenance and unnecessary data replication from OLTP engine to OLAP engine. Next, I will summarize the crux of this paper from my point of views.

First of all, they discussed the Data Blocks model which are self-contained containers that store one or more attribute chunk in a byte-addressable compressed format. Data Blocks are used to conserve memory and retain the high OLTP and OLAP performance, this format also allows scans and point accesses on compressed data and address the challenge of integrating multiple storage layout combinations in a compiling tuple-at-a-time query engine by utilizing vectorization. As they mentioned in their paper, this format also has several pleasant features like high compression ratios, only byte-addressable compression, utilizing SIMD approach for SARGable scan restrictions, adoption of SMA and PSMA and etc. In order to integrate multiple storage layouts, they combine their Data Blocks into JIT-compiling query engine which is optimized for OLTP and OLAP. In Hyper, they used a data-centric compilation approach that compiles relational algebra trees to highly efficient native machine code using LLVM compiler infrastructure which achieves high efficiency. Through experiments, they proved that their method our perform others, the performance of Data Blocks + PSMA is very impressive.

There are several advantages of their method. First of all, they use a light weighted compression schema in their Data Block, it is really a good choice for leveraging the OLTP and OLAP workloads. As they said in this paper, for OLTP workload this design not only keep positional access on compressed data cheap while for OLAP workload, it allows scan based OLAP workloads to reap the benefit of early filtering but not losing the benefit in sparse tuple decomposition. The second contribution is the combination of JIT-compiling query engine and the vectorization in the context of Data Blocks which is very efficient, they also used a vectorized scan to handle ad-hoc queries and transactions. Besides, for predicate evolution, they applied the automatic generation of SIMD instructions which is optimized and efficient. Also, they proposed a novel light-weight indexes PSMA which narrow the scan range within a block to speed up the search. To sum up, it’s really a nice paper and I learned a lot from it.

There are some downsides of their methods, first of all, since they partition their data into cold and hot two parts, there are definitely some overheads for maintaining this, for example, if the user wants to modify some frozen data, the system needs to move it to the hot region, these extra operations may reduce the performance of the system. Besides, as they mentioned in their paper, they said that the Data Blocks can be designed with a secondary storage solution, I wonder the use case scenario of such product in industry, when do people need to do OLTP and OLAP in one system simultaneous, I want to know whether this kind of product can make a success in the market.

Review 17

This paper introduces a storage technique used for hybrid OLAP / OLTP database systems. The main contribution is the “Data Block”, which is a unit of compressed columnar storage that is supposed to be used for cold data (as opposed to hot data). Data blocks are immutable and avoid heavy compression so that individual tuples can still be accessed by OLTP workloads and so that the system avoids bit-unpacking, which is an expensive operation. Hot data is left uncompressed, as it is usually accessed by OLTP transactions. Query execution is a variation of vectorized query execution that feeds into a LLVM-implemented JIT compilation that can leverage SIMD-optimized algorithms for scans. Special indexes on compressed data also allow for increased scan performance.
The biggest strength of the paper is the idea that one database can deal with both OLTP and OLAP workloads by using these data blocks. Compressed data can still be leveraged without severely hurting OLTP performance. Also, the high-level concepts of data blocks was very intuitive and well-explained.
There are two weaknesses I can think of for the paper. The first is that I’m not convinced that a hybrid database for OLTP / OLAP workloads is very necessary. I don’t have empirical evidence for this, but it seems intuitive to me that it should be relatively easy to know whether your workload will be OLTP or OLAP and therefore you can choose databases that are heavily optimized for one or the other. Also, the paper’s explanation of the query execution / JIT compilation was confusing to me; there seemed to be a lack of consistency at times with the paper referring to it as tuple-at-a-time sometimes and as vectorized query execution sometimes, and I’m still not 100% sure that I understand this section.