Review for Paper: 27-C-Store: A Column-oriented DBMS

Review 1

Conventional RDBMSs were designed largely for OLTP, which means a write-heavy workload. Typical databases use lock-based concurrency control and store data row-wise, where each row of a table is stored as a unit. This architecture is not ideal for databases that are read-mostly, such as OLAP data warehouses. If data is periodically updated by bulk load and otherwise read mostly, a different kind of database can provide better performance. In particular, a read-optimized database should use optimistic concurrency control and store data in a layout that makes reads fast.

“C-Store: A Column-oriented DBMS” proposes a column store DBMS and presents early results from its implementation. The C-Store design differs from traditional databases in that it stores data column-wise, and also stores redundant “projections,” which are similar to materialized views. This allows C-Store to answer queries that scan entire columns of a table more quickly than a row store could. To see this, consider a query like “SELECT SUM(Sales) FROM Table1.” In a row store, the entire Table1 table would have to be loaded in memory to run this query. In a column store, however, the DBMS could scan just the Sales column without loading the rest of the columns from disk. The trade-off is that when a new tuple is added to Table1, the column store would have to insert a value into each column separately.

The paper presents a clever implementation, which is highly available and fast for reads, yet also allows writes at any time. C-Store has a writeable store that allows small amounts of data to be written via UPDATE or INSERT commands. A larger read-optimized store holds most of the database. Periodically, something called the tuple mover transfers data from the writeable store to the read-optimized (column) store. Timestamps are used to provide snapshot isolation, a form of optimistic concurrency control that fits well with the design goals of C-Store.

One shortcoming of the C-Store paper is that its design does not break much new ground, although the combination of features in C-Store seems appropriate for a read-mostly transactional database. Other column stores like Sybase IQ existed before C-Store, as noted in the paper. C-Store does add a few novel features, such as densely packing data and using compression to make it feasible to store multiple sorted versions of some columns.


Review 2

This paper focuses on C-Store, a Database Management System which stores data in a column orientation. The authors wanted to build a read-optimized Database system instead of then prevalent Write-optimized systems. It caters to ad-hoc queries which form a major part of Transactional systems along with other enquiry-related systems.

The storage in C-store is done by using only materialized views. The columns are compressed to save space. There is no alignment in the storage and big disk blocks are used. It is optimized for cluster computing focusing majorly on sorting rather than indexing. The shared-nothing architecture uses horizontally partitioned segments which store record number. It has different encoding schemes for different columns based on ordering and value distribution. Enough projections are stored as a form of redundancy to provide recovery from elsewhere in the network. It has a automatic physical DBMS Design which accepts a training set of queries and space budget, choosing the projections and joining the indices automatically, re-optimizing periodically based on a log of interactions. It has its own query-optimizer and executor. Transactional updates are made for error corrections. It uses a Hybrid store consisting of a write-optimized as well as read-optimized column store to reduce the delay between OLTP system and warehouse.

The paper is successful in highlighting the design and implementation of C-store. Its read-optimized environment coupled with cluster parallelism provides better performance than the then prevalent DBMS systems. The argument is supported by relevant performance comparisons against row-store DBMS.

The paper doesn’t provide insight about its working on OLAP-based systems. The recovery process takes too much time to reconstruct.



Review 3

Most major DBMS vendors implement record-oriented storage systems, where the attributes of a record (or tuple)are placed contiguously in storage. With this row store architecture, a single disk write suffices to push all of the fields of a single record out to disk. Hence, high performance writes are achieved, and we call a DBMS with a row store architecture a write-optimized system. These are especially effective on OLTP-style applications. In contrast, systems oriented toward ad-hoc querying of large amount of data should be read-optimized. Data warehouse represent one class of read-optimized system in which periodically a bulk load of new data is performed, followed by a relatively long period of ad-hoc queries. Other read-mostly applications include customer relationship management (CRM) systems, electronic library card catalogs, and other ad-hoc inquiry systems. In such environments, a column store architecture, in which the values for each single column (or attribute) are stored contiguously, should be more efficient. In this paper, they discuss the design of a column store called C-Store that includes a number of novel features relative to existing systems.

1. A hybrid architecture with a WS component optimized for frequent insert and update and an RS component optimized for query performance.
2. Redundant storage of elements of a table in several overlapping projections in different orders, so that a query can be solved using the most advantageous projection.
3. Heavily compressed columns using one of several coding schemes.
4. A column-oriented optimizer and executor, with different primitives than in a row-oriented system.
5. High availability and improved performance through K-safety using a sufficient number of overlapping projections.
6. The use of snapshot isolation to avoid 2PC and locking for queries.

I think column store is similar to bitmap and projection indexes which have been around for a long time. The update of record is quite complicated which might limited the applications of column-oriented database.



Review 4

This paper presents a new relational DBMS called C-Store, which is read-optimized that contrasts a lot with most current write-optimized systems. In short, C-Store is a column-oriented DBMS that is architected to reduce the number of disk accesses per query, and it is designed to achieve high performance on warehouse-style queries and reasonable speed on OLTP-style transactions. This paper first presents the data model implemented by C-Store. Then it introduces some design details of C-Store, including RS portion, WS component, allocation of data structures. Following is a discussion about query optimizer and executor, as well as how it performs in comparison with a row store system. Finally, it discusses related and future work.

The problem hers is that traditional DBMS imply a record-oriented storage systems where record (tuple) is placed contiguously in storage (row store architecture), which is designed to achieve high performance for writes. However, there are some applications need to be read-optimized, for example ad-hoc querying of large amounts of data, customer relationship management (CRM) systems, and electronic library card catalogs. Therefore, C-Store is designed to be read-optimized and achieve high performance on those systems while remain reasonable performance for write-optimized systems.

The major contribution of the paper is that it provides detail design details and examples for the new design C-Store. By doing performance evaluation on a traditional benchmark TPC-H, it shows that C-Store is substantially faster than popular commercial products. Here we will summarize key elements of C-Store:

1. storage of data by column (rather than by row)
2. hybrid architecture with a WS component (for frequent insert/update) and an RS component (for query performance)
3. redundant storage of elements of a table in several overlapping projections
4. heavily compressed columns
5. column-oriented optimizer and executor
6. high availability and improved performance through K-safety
7. use of snapshot isolation to avoid 2PC and locking for queries

One interesting observation: this paper is very innovative, and the idea of storing data by column instead of by row is very interesting. The performance evaluation provided in this paper shows that it performs very well on read-only queries. It might be better if it also provides performance analysis on queries with some writes, which can show the performance on traditional OLTP-style applications.


Review 5

In what seems to be a common trend, this paper introduces yet another radical database architecture reimagining, again spearheaded by the prolific Michael Stonebraker. C-Store is a shared nothing column oriented database, designed for OLAP workloads with a focus on high read throughput and low read latency. This is in contrast to traditional DBMSs which are instead optimized for write efficiency. Written in 2005, this paper borrows heavily from other ideas in the database space, but is the first design to present a comprehensive architecture for a column store system that leverages modern computational systems such as shared nothing designs.

The core of C-store is the use of a column oriented data model, which eschews traditional row indexing in order to provide better read access in cases where a user is accessing a small subset of columns over a large quantity of tuples. C-Store accomplishes this by using projections, which are sorted subsets of attributes of a table. The database may contain any number of projections, but each column in must be present in at least one. By using column based projections, C-Store offers faster read performance for queries that read data mainly from columns, as many ad-hoc queries tend to do in analytic processing. For example, it’s more likely for a client to request the first names of all clients than it is to request a client by name when doing data analysis. The paper gives an overview of other parts of the system, including a look at the snapshot isolation supported by the DBMS and the performance enhancements such as column compression and column-oriented optimizers and executors. Components of interest are the hybrid architecture for reading and writing, which consists of two separate storage utilities - one for reading and one for writing. The other interesting discussion is that of k-safety and the implementation of a shared nothing architecture. The paper ends with performance analysis (which is eye opening if taken at face value) as well as a look at systems that inspired C-Store.

Strengths:
C-Store’s use of column based projections offers an innovative solution to column based storage. While offering the obvious advantages of fast read access of OLAP workloads, the key insight of these projections is the use of overlap to facilitate k-safety. Because each projection (which is in essence a restricted materialized view) contains data that is non-dependent on the other projections, a given column can be represented multiple ways. By doing this the database can contain redundant columns and allow K nodes to fail while staying operational.

The use of read stores and write stores is another clever innovation. C-Store compromises between read performance and writability by offering a small column store for writes to take place in. This store can be thought of as a staging area, or buffer for writes. Stored in memory, this buffer is eventually written to the database. By offering only snapshot isolation, the potential for lock contention and isolation failure is reduced with this eventually consistent system. This hybrid is especially useful because the column store requires many operations to do small updates.

Weaknesses:
As with many shared nothing systems, the problem of how to distribute redundant data across nodes is a hard one. Determining the physical layout in C-Store requires the database administrator to maintain K-safety while also keeping performance optimal. This is devastating for a real world system, and an automated tool would be required for this system to function in a real world scenario. Another glaring weakness is the inability to operate efficiently on workloads unforeseen by the installer. If there is a scenario where we must do a large write, we will be horribly inefficient. In the same way, executing and transactional processing workload will cause the system to slow dramatically.



Review 6

This paper attempts to solve a problem with traditional database systems - most of these systems are optimized for writes, because they were designed for OLTP workloads. To optimize for writes, it makes sense to store everything for one tuple continously - then all updates to the tuple can be reflected with one disk write. However, there is also wasted space in this arrangement - values are padded to word or byte alignments, so that the CPU does not need to move them to work on the data. However, these design decisions were made decades ago. Moore's law has led to a exponential increase in CPU speeds, but much lower increases in disk speed. Because of this, it is now worth spending many more CPU cycles, if it can save us a disk access.

Because of the hardware architecture changes over the past several years, a software architecture change was also required. The most dramatic change that this paper proposes is to store data in columns, as opposed to rows. This means that in a table with three columns (id, name, salary), all of the ids are on disk, then all the names, then all the salaries. This can dramatically improve performance if we are doing a read, and don't need all the columns in the table. While a row store would have to read all the data from disk and then throw out what it didn't need, a column store can only scan the columns that it needs. Aggregates, such as an AVG() on the salary column, would also be much faster. However, this does mean that a write would require three pages to be written to disk. The system also uses compression to minimize the number of disk pages that must be read, but does impose a CPU overhead to decompress - or the system can attempt to work directly with compressed data.

In addition to these differences, this system runs on a cluster. Because of that, it must keep replicas of data. However, unlike other systems, which take a hit on performance as a result, this system uses the replicas as an advantage. While it has to keep replicas of all the data, it does not have to replicate the order - the data could be sorted one way at one site, and the other way at another.

The authors discuss many of the technical issues in creating their cluster-column store, such as seperate RS and WS systems, join indexes, snapshot isolate, concurrency control, and recovery. The paper does a great job of explaining all of these technical details. The performance gains obtained by C-store, even in comparison to another columnar store database, are impressive.

The system was obviously very successful - after finishing the open source version, Stonebraker and his colleagues went on to found Vertica, which they then sold to HP for a ton of money.



Review 7

The paper is proposed because most of the current database systems are write-optimized, and a read-optimized relational DBMS are needed. C-Store are the system built to solve the problem. The traditional write-optimized system has row store architecture and is effective on OLTP applications. While read-optimized system use column store and is oriented to the ad-hoc querying of large scale of data. It is frequently used in read-mostly applications. In the column-store architecture, the database can find the needed values in a column for reading. C-Store provides SnapShot Isolation for high performance, and strict two-phase locking concurrency control for serializability of read-write transactions.

The weakness of the paper is that the performance analysis is too rough, which make it not convincing enough.


Review 8

This paper introduces C-Store, a column-oriented DBMS that is a read-optimized as opposed to write-optimized. Most traditional DBMS focuses on write-optimization, but the diversity of workloads we see today presents a case for a read-optimized system. The main factors that distinguish C-Store from other DBMS are:

1) storage of data by column rather than by row,
2) careful coding and packing of objects into storage during query processing
3) storing overlapping collection of column-oriented projections instead of using indexes and tables 4) high availability and snapshot isolation for transactions
5) use of bitmap indexes to complement B-trees

While C-Store is an interesting concept and the details of the implementations are well outlined, the paper is weak on evaluation of the technique. The analysis of storage and runtime performance of C-Store feels quite weak and looks at a fairly limited set of workloads.



Review 9

This paper introduces a column-oriented relational DBMS, C-store, which is read-optimized, compared to row-oriented relational DBMS which is optimized for write. C-store stores data as projections. For each relation, it stores several projections, which contain one or more attributes in each. In addition, those projections store in its compact form. C-store has two main components, one is read store(RS), which is read-optimized, the other is write store(WS), which also stores relations as columns but does not compact them and has associated B-tree index to make write faster. There is a tuple mover which keep the RS up-to-date by moving recent updates in WS to RS in the background.

C-store shows a great advantage in read performance. C-store efficiently stores all the relations in column, which not only saves much disk space but also enlarges the disk throughput since the compact storage allow more valid data to be read for a single disk block. In addition, for a given query, it may not need to access all the attributes for a tuple. Hence, the query can be fastened as it only needs to read a projection from the relation. Moreover, one attribute can show up in several projections of a relation, which gives high availability and throughput in a distributed systems with less disk usage compared to the replication approach in row-oriented relational DMBS. the more compact data can fit in the main memory more easily.

C-store has good fault tolerance nature. As mentioned above, an attribute in a relation can exist in several projections. This makes it tolerant at most k - 1 failures, k is the number of projections that contain this attribute. In addition, there is a copy of a relation stored in RS and WS respectively, which is nature to the fault tolerance. Meanwhile, the projections can even be horizontally partitioned efficiently. Each segment is assigned a segment id and key ranges for fast look up. This approach gives high availability with less space usage.

Though C-store shows a good performance in practice, there are still some potential drawbacks:
1.There is more disk seeks to insert a row, especially for incremental insertion. To insert a row, it has to insert an entry in each projection. In addition, updates is in a similar situation. In an OLTP workload, the updates and insertions may become a bottleneck for the overall performance for C-store.
2.For a lightweight query, high disk bandwidth may not make it faster and there should be more CPU cycles to decode the compact raw data read from the disk. This can potentially make queries even slower for some slower. Batch processing might be a solution to this.
3.C-store has an overhead in tuple mover, i.e, an overhead to keep the RS up-to-date. Though tuple mover runs in background, it may consume much resource if RS is required to closely catch up with WS, which might slows down the whole system.


This paper introduces a column-oriented relational DBMS, C-store, which is read-optimized, compared to row-oriented relational DBMS which is optimized for write. C-store stores data as projections. For each relation, it stores several projections, which contain one or more attributes in each. In addition, those projections store in its compact form. C-store has two main components, one is read store(RS), which is read-optimized, the other is write store(WS), which also stores relations as columns but does not compact them and has associated B-tree index to make write faster. There is a tuple mover which keep the RS up-to-date by moving recent updates in WS to RS in the background.

C-store shows a great advantage in read performance. C-store efficiently stores all the relations in column, which not only saves much disk space but also enlarges the disk throughput since the compact storage allow more valid data to be read for a single disk block. In addition, for a given query, it may not need to access all the attributes for a tuple. Hence, the query can be fastened as it only needs to read a projection from the relation. Moreover, one attribute can show up in several projections of a relation, which gives high availability and throughput in a distributed systems with less disk usage compared to the replication approach in row-oriented relational DMBS. the more compact data can fit in the main memory more easily.

C-store has good fault tolerance nature. As mentioned above, an attribute in a relation can exist in several projections. This makes it tolerant at most k - 1 failures, k is the number of projections that contain this attribute. In addition, there is a copy of a relation stored in RS and WS respectively, which is nature to the fault tolerance. Meanwhile, the projections can even be horizontally partitioned efficiently. Each segment is assigned a segment id and key ranges for fast look up. This approach gives high availability with less space usage.

Though C-store shows a good performance in practice, there are still some potential drawbacks:
1.There is more disk seeks to insert a row, especially for incremental insertion. To insert a row, it has to insert an entry in each projection. In addition, updates is in a similar situation. In an OLTP workload, the updates and insertions may become a bottleneck for the overall performance for C-store.
2.For a lightweight query, high disk bandwidth may not make it faster and there should be more CPU cycles to decode the compact raw data read from the disk. This can potentially make queries even slower for some slower. Batch processing might be a solution to this.
3.C-store has an overhead in tuple mover, i.e, an overhead to keep the RS up-to-date. Though tuple mover runs in background, it may consume much resource if RS is required to closely catch up with WS, which might slows down the whole system.


This paper introduces a column-oriented relational DBMS, C-store, which is read-optimized, compared to row-oriented relational DBMS which is optimized for write. C-store stores data as projections. For each relation, it stores several projections, which contain one or more attributes in each. In addition, those projections store in its compact form. C-store has two main components, one is read store(RS), which is read-optimized, the other is write store(WS), which also stores relations as columns but does not compact them and has associated B-tree index to make write faster. There is a tuple mover which keep the RS up-to-date by moving recent updates in WS to RS in the background.

C-store shows a great advantage in read performance. C-store efficiently stores all the relations in column, which not only saves much disk space but also enlarges the disk throughput since the compact storage allow more valid data to be read for a single disk block. In addition, for a given query, it may not need to access all the attributes for a tuple. Hence, the query can be fastened as it only needs to read a projection from the relation. Moreover, one attribute can show up in several projections of a relation, which gives high availability and throughput in a distributed systems with less disk usage compared to the replication approach in row-oriented relational DMBS. the more compact data can fit in the main memory more easily.

C-store has good fault tolerance nature. As mentioned above, an attribute in a relation can exist in several projections. This makes it tolerant at most k - 1 failures, k is the number of projections that contain this attribute. In addition, there is a copy of a relation stored in RS and WS respectively, which is nature to the fault tolerance. Meanwhile, the projections can even be horizontally partitioned efficiently. Each segment is assigned a segment id and key ranges for fast look up. This approach gives high availability with less space usage.

Though C-store shows a good performance in practice, there are still some potential drawbacks:
1.There is more disk seeks to insert a row, especially for incremental insertion. To insert a row, it has to insert an entry in each projection. In addition, updates is in a similar situation. In an OLTP workload, the updates and insertions may become a bottleneck for the overall performance for C-store.
2.For a lightweight query, high disk bandwidth may not make it faster and there should be more CPU cycles to decode the compact raw data read from the disk. This can potentially make queries even slower for some slower. Batch processing might be a solution to this.
3.C-store has an overhead in tuple mover, i.e, an overhead to keep the RS up-to-date. Though tuple mover runs in background, it may consume much resource if RS is required to closely catch up with WS, which might slows down the whole system.


This paper introduces a column-oriented relational DBMS, C-store, which is read-optimized, compared to row-oriented relational DBMS which is optimized for write. C-store stores data as projections. For each relation, it stores several projections, which contain one or more attributes in each. In addition, those projections store in its compact form. C-store has two main components, one is read store(RS), which is read-optimized, the other is write store(WS), which also stores relations as columns but does not compact them and has associated B-tree index to make write faster. There is a tuple mover which keep the RS up-to-date by moving recent updates in WS to RS in the background.

C-store shows a great advantage in read performance. C-store efficiently stores all the relations in column, which not only saves much disk space but also enlarges the disk throughput since the compact storage allow more valid data to be read for a single disk block. In addition, for a given query, it may not need to access all the attributes for a tuple. Hence, the query can be fastened as it only needs to read a projection from the relation. Moreover, one attribute can show up in several projections of a relation, which gives high availability and throughput in a distributed systems with less disk usage compared to the replication approach in row-oriented relational DMBS. the more compact data can fit in the main memory more easily.

C-store has good fault tolerance nature. As mentioned above, an attribute in a relation can exist in several projections. This makes it tolerant at most k - 1 failures, k is the number of projections that contain this attribute. In addition, there is a copy of a relation stored in RS and WS respectively, which is nature to the fault tolerance. Meanwhile, the projections can even be horizontally partitioned efficiently. Each segment is assigned a segment id and key ranges for fast look up. This approach gives high availability with less space usage.

Though C-store shows a good performance in practice, there are still some potential drawbacks:
1.There is more disk seeks to insert a row, especially for incremental insertion. To insert a row, it has to insert an entry in each projection. In addition, updates is in a similar situation. In an OLTP workload, the updates and insertions may become a bottleneck for the overall performance for C-store.
2.For a lightweight query, high disk bandwidth may not make it faster and there should be more CPU cycles to decode the compact raw data read from the disk. This can potentially make queries even slower for some slower. Batch processing might be a solution to this.
3.C-store has an overhead in tuple mover, i.e, an overhead to keep the RS up-to-date. Though tuple mover runs in background, it may consume much resource if RS is required to closely catch up with WS, which might slows down the whole system.


Review 10

Problem/Summary

Many RDBMS are write-optimized. Because of this, these databases store records contiguously on disk, so that updating a record will only require a write to a single place on disk. However, this practice can be non-optimal for read-only queries that only are interested in a few attributes of each record. In this case, storing the columns for these attributes contiguously would be more efficient. This paper introduces a database called C-store, which uses this idea.

C-store does not store whole tables. Rather, it stores multiple projections of tables- that is, it stores groups of columns together. Deciding which projections to choose is analogous to physical design in a traditional database. Each projection is sorted on one or more of the columns in the projection, decided by the user. The projections are then horizontally partitioned into segments which can reside on different nodes. Join tables map the ids of one projection onto the ids of another, so that the original table information is not lost.

C-store is split into a read component RS and a write component WS. RS is compressed, which allows for even less disk I/O, and querying is optimized for working on compressed data. WS is not compressed, so that records can be inserted or updated more quickly. Once in a while, a Tuple Mover will propagate changes made in WS to RS. To ensure isolation, read-only transactions are run using snapshot isolation, so that they only see records that exist at the logical time that they run.

C-store achieves better performance for read-mostly workloads. Storing columns allows less non-pertinent data to be read, and storing multiple projections allows columns to be pre-sorted on certain attributes, which speeds up retrieval. Compression also allows disk I/O to be traded for CPU cycles, which increases performance.

Strengths:

This paper is very good at explaining the reasoning behind its design decisions. The authors recognized a need that was not covered by traditional databases.

Weaknesses:

In many places, the authors say that C-store is not complete. In fact, some of the features and ideas discussed are not even implemented yet. Why write a paper before these ideas are confirmed to work? This detracts from some of the credibility of the performance tests.



Review 11

The paper describes C-Store, a column-oriented DBMS. While the concept of a column-oriented DBMS has already been discussed in the early days of DBMS, C-Store has successfully combined different features that make an efficient column-oriented DBMS successfully. Later, C-Store has been commercialized into Vertica, which was then acquired by HP.

C-Store combined numerous features to implement an efficient column-oriented DBMS. In order to support column-oriented store, it separates storages for write and read (WS and RS), which are optimized for frequent updates and query performance respectively. The system also takes an advantage of column stores where the data from a same attribute is stored contiguously in the disk. From this, it can better compress data than row store where each row has attributes that are heterogeneous. With the operators that can work on compressed data and the implementation of snapshot isolation to avoid locking, the paper claims that C-Store performs substantially well compared to commercial row store and column store databases.

As the paper mentions it, the main strength of the paper is the combination of innovative features to create a better performing system in the less explored problem domain. It is clear to readers that most of features included in C-Store and discussed in the paper are necessary for its performance benefits. I suspect that if one of these features is taken out, its performance would degrade significantly. Without its implementation of “projections” to redundantly store data needed or compression schemes and operators that can work on compressed data, C-Store would not have been able to achieve the performance it is showing and reported in the paper.

I was slightly disappointed that their experiments only included result from tests from a single-node case. This is especially because the authors emphasize a “grid” computing and environment in earlier chapters and how C-Store fits well, yet the paper restricts itself with a single-node system in its evaluation. The paper explains the architecture of C-Store in detail pretty well, which I do not have any complaints about. I just wanted to see how it works well in a grid computing, but the paper fails to provide an evaluation in that regard.

In conclusion, C-Store is an efficient column-oriented DBMS that combines many of innovative features for its performance improvement over other commercial row store and column store database systems. I think it is remarkable that the authors have successfully integrated the features discussed in the paper to create a system that work so effectively, yet also slight disappointed with the fact that the paper fails to deliver the evaluation of the system in a grid environment.



Review 12

This paper introduces C-Store. C-Store differs from traditional databases in that it chooses to store data in a column rather than row representation. The motivation behind C-Store is to support read heavy workloads. C-Store supports normal transactions and makes an effort to compress data when possible. It splits data into two different components: write store and read store. The write store component is responsible for tracking any changes that are made in the database. An update it split into a delete and an insert. The read store component is typically orders of magnitudes larger than the write store component and contains all the data that can be viewed. C-Store uses snapshot isolation as its isolation mechanism. In order to propagate changes from the write store to the read store, C-Store contains a tuple mover. The tuple movers moves inserts from the write store into the read store and deletes marked records in the read store.

One of the key ideas of the paper is the compression of the read store. The paper notes that within a column, there may be many duplicate values such as a name appearing more than once. It is a waste of storage to keep all instances of the value. To handle this, C-Store compress the value into a single representation. C-Store also keeps track of join indices to allow for data split across multiple relations to be rejoined without loss.

From my understanding, it seems like C-Store's join mechanism uses an idea similar to pointers in that the join indices point to where information is stored for a tuple. An interesting extension of this idea would be to allow for flexible schemas. In current relational databases, there is not support for changing a schema dynamically. Suppose a schema consisted of (name, age, gender) and a new row was added with the files (name, age, gender, salary). I think C-Store, with its pointer like setup could easily support this. It would require a new join index to be created for the new row which would point to the new field. The difficulty in supporting this idea would be figuring out how to avoid creating unnecessary rows. For example, inserting (name, age, gender, salary), (name, age, gender, pay), (name, age, gender, city) should only result in created two new fields, but it might be difficult in determining if salary and pay are equal. It also might be difficult to tell if salary and city are not equal as opposed to the user forgetting to provide an int rather than a string.


Review 13

The DBMSs we have seen till now in class are all row-oriented; this paper proposes C-Store, a column-oriented DBMS. Traditional row-oriented DBMSs are write-optimized for OLTP-style applications since it is easier to insert or update a record. However, data warehouse workload often consists of intensive reads. In this case, not all attributes of each records are useful for such queries. Column-oriented DBMS is read-optimized that large data can be stored more densely and it avoids reading irrelevant attributes.

This paper proposes C-Store, a refined column-oriented DBMS. It introduces the data model of column-oriented DBMS and then describes the architecture and concurrency control of C-Store. C-Store keep several projections for each logical table. A projection is a set of columns anchored on a logical table T and can overlap with other projections. The records in a projection are sorted by a column (sort key) and can be partitioned into several segments with different ranges of the sort key.

To reconstruct a logical table, we need a covering set of projections as well as the "storage key" and "join index". C-store assigns a storage key for each row in a segment, which serve as the identifier when joining projections. A join index is the mapping from one segment to another segment across different projections.

In C-store, there are two major components: Writable Store (WS) and Read-optimized Store (RS), and a tuple mover that serves as the bridge between the two. WS is a smaller component. It shares the same physical design as RS but is updatable. RS is the major storage component. This paper gives some data compression schemes for different types of data.

The concurrency control in C-store also takes advantages of its architecture. It adopts Snapshot Isolation for read-only transactions. It also uses strict two-phase locking on data objects and manages a distributed lock table for read-write transactions. C-Store can be recovered when less than K sites fail within time t, if it is specified to be K-safety.

Contribution:
This paper proposes a hybrid architecture for column-oriented DBMS. The data compression scheme for RS and the ability of processing queries directly over compressed data is the major reason why C-Store can use less space while incorporating column redundancy to achieve better performance. The experiments show that C-store outperforms not only other row-oriented DBMSs but also commercial column-oriented DBMS.

Drawbacks:
1. They not yet implement WS and tuple mover in their systems. As a result, the experiments they conduct are confined to some specific types of queries and thus are incomplete.
2. In the experiments, they show that C-Store outperform commercial column-oriented DBMS. However, there is no explanation about the difference between C-Store and commercial column-oriented DBMS and the cause of such performance differences.



Review 14

Motivation for C-Store

C-Store is important because, unlike most DBMS that are write-optimized, it is read-optimized. Read-optimized DBMS is good for ad-hoc querying systems with large amounts of data. Warehouse databases, for example, involve aggregates performed over a large number of data items. With C-Store, the DBMS only needs to read values of columns required for processing the give query, since the values of a single column are stored continuously, and can thus avoid bringing irrelevant attributes into memory. This paper describes the implementation and execution of C-Store.


Details about C-Store

C-store stores data by column rather than by row. It also packs objects into main memory during query processing and stores an overlapping collection of column-oriented projects, rather than in tables and indexes as in traditional DBMS. C-store implements transactions that have qualities of high availability and snapshot isolation for read-only transactions. This DBMS uses bitmap indexes extensively to complement B-tree structures. C-store also uses CPU cycles to save disk bandwidth by coding data elements into more compact form and densepacking values in storage, giving column store compressibility advantages over row store. C-store contains a collection of columns, each sorted on some attributes, and groups of columns sorted on the same attribute are referred to as “projections.” C-store is horizontally partitioned across the disk of various nodes with “shared nothing” architecture. Columns in read-optimized column stores are compressed using self-order, few distinct values (clustered B-tree indexes over their value fields, can densepack the index because there are no online updates, and the height of the index can be kept small), foreign order, few distinct values (ran length encoded to save space, offset indexes which are B-trees that map positions in a column to the values contained in that column), self-order, many distinct values (densepack of b-trees at the block level used to index these coded objects), and foreign-order, many distinct values (leave values unencoded). Join indexes are used to connect various projects anchored at the same table. Snapshot isolation is provided by turning updates into insert and delete. Concurrency control is implemented with strict two-phrase locking, and deadlocks are resolved in timeouts by aborting one transaction. Recovery happens by running a query from other projections. In the experimental results, C-store is much faster than row store because column representation avoids reads of unused attributes, storing overlapping projects rather than the whole table allows storage of multiple orderings of a column, better compression of data allows more orderings in the same space, and query operators operating on the compressed representation mitigates the storage barrier problem of current processors.


Strengths of the Paper

I enjoyed reading the paper because it introduces a very interesting DBMS with very significant performance benefits. I liked that, rather than creating a dataset of their own, they test with the gold standard TPC-H benchmarks, making their results more credible and less bias.


Limitations of the Paper

I felt that the paper could have done a better job of explaining why the encoding schemes are divided the way they are. Why is it divided based on self-order/foreign order rather than sparse/dense data, for example? I would’ve also liked to see examples of procedures that row store performs well in, such as OLTP, compare c-store’s performance in OLTP, and discuss trade-offs of using C-store. Lastly, I would’ve liked to see a discussion on whether data warehouses use C-store now, and, if so, how it is performing.



Review 15

This paper presents a DBMS architecture named C-store which varies from past architectures in that it stores data in columns rather than rows and is optimized for read operations rather than writes. The paper motivates the use of read optimized systems by talking about the prominence of data warehousing in more recent years. The paper outlines its contributions and data model. It then discusses indexes and encoding schemes and how it maintains consistency. Lastly, the paper discusses how queries are executed in C-store and concludes by restating the innovations in this paper.

The strengths of this paper come from the innovation and performance increase for specialized applications. This paper presents a novel column store representation and architecture that efficiently stores information and allows an optimized shared nothing environment to run queries and implement snapshot isolation. This paper motivates the use of this type of system well and shows intuitive examples and a thorough discussion of how such a system should be implemented. The paper evaluates its experiment using the TPC-H data set which consists of business oriented ad-hoc queries. C-store shows superior performance to row store and column store as well as using 40% less space than its comparisons.

The authors state that their write optimized store and the tuple mover were not working at the time the paper was written and so they cannot be shown for comparison. This is disappointing and doesn’t give us the full picture of how this system would compare to other possibilities for data warehouse situations. The number of queries used in evaluation was only seven. This is not a thorough way to evaluate the system. The TPC-H contains many more queries than they chose to use. The authors could have simply chosen a few queries that worked well for their system. Though their results do show that their system is superior in every case a more thorough evaluation and possibly an analysis of the differences of query performance within TPC-H would have provided a more convincing paper.



Review 16

Part 1: Overview

This paper presents a new relational DBMS design which is optimized for read heavy workloads called C-store. Despite of the popular OLTP workload there are ad-hoc querying of large scale data which should be read optimized instead of write optimized. C-store provides high availability and snapshot isolation to read-only transactions. C-Store includes a hybrid architecture of write optimized component which can help C-store maintain good performance on the OLTP workload. C-store also includes redundant storage of overlapping projections of data which can help optimizing read transactions. C-store is column oriented and equipped with a column optimizer and a column executor. Snapshot isolation is implemented therefore C-store can actually get rid of 2-PC and locking schemes for the read-only transactions and thus some overhead can be reduced.

Logical data model for relational DBMS is supported by C-store and SQL is supposed to be the language used. To speed up access process, C-store implements projections. Locking based concurrency control is used for concurrency. NO-FORCE, STEAL policy is taken into use for recovery process together with write ahead logs. Distributed commit processing is used for large scale data. There are ten node types which accept operands and produce projection, column, bitstring, predicates, joins, attribute names and expressions.

Part 2: Contributions

Experiments have been done with TPC-H workload and C-store has outperformed existing popular database systems.

Read optimized storage uses self-order or foreign-order with more or less distinct values encoding schemes and can provide snapshot isolation to read-only transaction.

Part 3: Drawbacks
Column oriented databases are faster at certain analytical operations. Queries on large datasets, which don't fit in memory, involving joins will be significantly faster. Conversely, column oriented databases are significantly slower at handling transactions. The advantages and disadvantages of column oriented databases only really become apparent at some degree of scale either in terms of data size or transaction volume. In practice they are most useful for data warehousing applications on "big" relational data.



Review 17

===overview===
In this paper, a read-optimized relational DBMS is presented, called C-Store. It contrasts sharply with most current systems, which are write-optimized. There are a lot of other differences, and most important ones include:
1. store data by column rather than by row
2. careful coding and packing of objects into storage while processing queries.
3. same column can be stored in multiple collections of column-oriented projects
4. wide use of different data structures to improve availability, such as bitmap indexes to complement B-tree structures
5. unique combination of different techniques

The design focuses on reducing disk access for reads while still support general updates. So the paper also presents the data model implemented by C-Store. They also explain the two major parts of the design: RS and WS, which stand for Read-optimized store and writable store respectively. They are connected by high performance tuple mover. The new storing and partition scheme makes it different to do query optimization and handle locality issue, and paper also address those problems.

In summary, C-store provides a whole new view of how to optimize database for read queries. At the end, the paper also shows that on TPC-H style queries, C-store outperform alternate systems. But the author also mentioned that the potential overhead of WS might be large. Even though the complete form of C-store was not finished, a new perspective is inspiring.

===strength===
The paper provides some strong arguments of where column partitioning might be better than traditional relational DBMS.

===weakness===
However, not too many experiments are provided to show the potential. As said in the paper, the project is not completed yet. Maybe we should wait for a more complete project and more experiment results.


Review 18

C store is a column store architecture based database that has a set of optimizations in design over the existing column store database.

Major optimizations are:
1) Use CPU cycles to trade disk bandwidth, because CPU cycles are abundant. C store uses compact storage that needs more CPU cycles to manipulate data, but saves disk bandwidth.
2) A combination of Write optimized and Read optimized store is used to improve the write speed. Although query needs to read “historical” data, but this efficiently solved data write bottleneck. It is natural to support snapshot isolation in this case.
3) Specially designed column oriented optimizer and executor are used to fit the storage strategy used in C store. Obviously there are differences between column and row storage.

According to the Optimization directions mentioned above, components described in C store paper is not difficult to understand.

Besides conceptual design, there are also important thing to notice in the system itself. To support transaction and concurrency control in a system has RS and WS, they used timestamp and Tuple Mover.

Contribution:
As the paper mentioned, many of the parallel topics are not unique, it is their combination that make C store interesting and unique. Also I think the idea of supporting read and write using two different server, RS and WS, and connecting them using Tuple Mover is quite new and interesting.

Weakness:
I actually have questions about the join. If someone writes a query that use several column, I wonder how much time is spent on joining these columns.



Review 19

C-store is a database system that unlike other column stores that are read-optimized, supports write –optimized structures as well.

One of the key structures of the C-store is that it has a read-optimized column store and an update/insert-oriented writeable store connected by a tuple mover. C-store is able to implement redundancy using multiple projections of the data and no actual base table. They ensure it is fail-safe by introducing a property of K-safe where K is the number of sites that can fail within the time to recover and the system will be able to maintain consistency.

They have used snapshot isolation in this system. They take advantage of this implementation in order to implement the high water mark and the low water mark. The LWM is the earliest effective time when only a read-only transaction can run whereas a HWM is the time when the latest data was committed so the reads before or on HWM can be read safely. The data is stored in columns in a projection and is sorted on the same sort key from left to right.

One of the important aspects of this implementation is the idea that the write-store is also a column store and is mapped to the read store using the storage key and therefore this db can work well for all kinds of data loads. One of their major contributions is the idea that the data can be compressed saving a lot of storage since it is stored in columns, and for most operators, the database system does not even have to decompress the data.

The C-store definitely saves up on space by compressing the data, however, I think since it has multiple projections and multiple indexes over that projection, the storage could be an issue in case of huge databases. Also considering the data is duplicated over the read store and write store, if multiple updates are made, there can be overhead with respect to updates being made to the read store.



Review 20

C-Store: A Column-oriented DBMS

In this paper, introduced a new problem of database management system design: there is a trade off between choosing the row storage architecture/column-oriented storage architecture and the associated performance. Traditionally, all the DBMS are using row based storage design as default, for it is really easy to operate on a logical record, especially in the processing of high-frequency updates in the DBMS. However, most of the traditional DBMS using row storage suffers in the execution of ad-hoc queries, because so much unneeded data are read from disk and considerable time can consumed by locking in concurrent execution.

C-Store presented a new DBMS architecture to support the execution of such kind of ad-hoc queries. C-Store partition the table in a horizontal way such that small projections with only a few columns are stored together. Columns are sorted by a predefined sort key in each projection, and active compression method are adapted to further decrease the amount of read for each column. The hybrid system consists of two parts: read store and write store, both of which uses column store technique. The write store make use of the B-tree on storage key to allow fast insertion and deletion of single columns and read store is feed by the tuple mover with the data from write store. In terms of concurrency control, the write store used the two phase locking to protect the B-tree index and the read store maintains IV and DRV for each projection to help decide the high-water mark and low water mark between which snapshot isolation can run.

To sum up, the advantages of the new column storage design are that it made good use of the hybrid system to support both fast write and quick read operation. Especially in the read store, more ordering for projection can be stored due to the compression and truncation. The compressed column data can largely reduce the disk accessing time since a memory storage approach is used. On the other hand, the use of snapshot isolation method in the read store totally eliminated the wait time for concurrency control based on locking, this made possible for concurrent execution for multiple ad-hoc queries. The other noticeable merits of C-store is that it made use of the nature of column duplication in its system to achieve recovery without logging to disk based persistent storage.

But there is still some weakness for this paper. First of all, the paper states that the tuple mover will run on the background tasks, but how would it affect the concurrency control for both the r store and w store is never mentioned, reader may be concern about the negative influence on performance. And the next problem is that the paper says that C-store maintains the K safety for recovery, but no explanation is provided as for how to tune the K value to manipulate the overall system fault tolerance as user needed. Last but not least, the trade off made by the column storage design made the whole system vulnerable under non-read dominate condition, or at least there is a need to deal with the OLTP jobs on a different system, and the synchronization between them may be worth considering.



Review 21

This paper talks about a column oriented DBMS which store the data by column instead of rows and thus is read-optimized. The C-store need only read the values of columns required for processing a given query and thus avoid bringing not useful information into memory. It compress the columns to save space.

The C-store does’t use index, instead it physically stores a collection of columns in different sort attributes, it is horizontally partitioned into segments and stored in order of a storage keys.It needs multiple join indexes for reconstructing a base table. As the columns have many replicas, it has enough projections to ensure K-safety and thus avoid heavy load of redo using the log. It uses snapshot isolation to provide concurrency control.Then the paper talks about query executor and optimizer.

Strength:
The paper talks about the C-store which is a column oriented DBMS and about how the C-store different from traditional database in performance and architecture. Also talks about some optimization on the C-store. It is very suitable for read-mostly environment. But bad for insert, update and selection(because need reconstruct table).

Weakness:
As the C-store compress the data, when insert or delete some data in the table, it is really hard to handle such situation like, when a lot of new distinct value comes, the compression will no longer work for current column and thus need to “re-hash” and change whole physical storage. I think this is a very bad scenario for c-store but the paper doesn’t mention.


Review 22

The paper presents a new column-store DBMS, C-Store. The purpose for building a column-store database is to optimize for a read-heavy workload. C-store supports both fast writes/updates and reads by having a small writeable store and a large read-optimized store for each column. Updates are batch moved from the writable store to the read store to minimize work. By focusing on a read-heavy workload, the authors are able to build a system that (without the writable store) performs faster than two commercial database systems (one row-store and one column-store) while using less disk space.

The main contribution of this work is in putting together a set of techniques to build a new system that provides substantial performance and storage gains over existing database systems. In particular, by making assumptions about the workload that the database will see, the authors can make optimizations that would otherwise be impossible.

Providing a workload-optimized database is useful and (as demonstrated) leads to impressive performance gains. However, the writeable store and tuple mover were not implemented in the system used in the performance comparison, so the overhead of this feature is unknown. Given that it is important to how C-Store performs under updates, a performance evaluation that executes occasional updates and has a writeable store is necessary. Additionally, nowhere in the evaluation was fault-tolerance tested. Failures are common on large clusters and commercial systems are likely to have mechanisms in place to allow for fast recovery. The paper touched on the issue of system recovery but it did not provide a performance comparison.


Review 23

This paper introduces a column oriented DBMS called C-Store that can outperform many row oriented DBMS. In the past, relational DBMS were all row oriented meaning that the rows of the tables were stored contiguously. This is optimal for writes because we will usually update one row, contiguous rows, or add rows into tables. However, we have recently discovered that workloads with majority reads is now more and more common, so column oriented DBMS are becoming more common. With the C-Store implementation, we can get a significant improvement on the runtime of queries compared to row oriented DBMS.

Since we are implementing C-Store to tolerate failures, we need redundancy in the database. With this redundancy, we can store the same column of the database in multiple different ways that will help certain queries run faster. For example, we can store a column in alphabetical order and also store it based on whether the string has symbols. Both of these orderings may be useful in different situations. C-Store can also compress columns based on the schema so reduce disk space used. There are usually four ways that it can do this: self-order, few distinct values; foreign order, few distinct values; self-order, many distinct values; and foreign order, many distinct values.

Finally, we can compare this implementation with the row oriented DBMS. C-Store managed to use less space than row store and run queries a lot faster than row store. A 4.5GB database in row store took only 2.0GB in C-Store. Furthermore, C-Store is about 164 times faster than the commercial row store DBMS.

Overall, this paper does introduce a very useful column store DBMS that can outperform current row store DBMS, but I still have some concerns about aspects that were not mentioned in the paper:

1. When comparing C-Store to the row store DBMS, all that the authors mentioned is that it is a commercial row store DBMS. This DBMS can have the worst runtime out of all row store DBMS to make their numbers look better.
2. The testing queries all had very few items in the SELECT part of the statement, which is optimal for column store DBMS. I would have liked to see some queries that did SELECT * to see if C-Store can outperform row store DBMS at that point.
3. The authors explained some encoding schemas that can reduce space used, but they never explained how we would insert a value that would split up an encoding. Would there be a significant runtime hit when we want to insert using these encodings?


Review 24

This paper discusses C-Store, a column-oriented DBMS that is designed to support a “read-mostly” query workload. One of the primary benefits of a column-oriented database is that only data in the columns needed to answer a query need to be read in from disk. This avoids the additional work of reading in every field of every record involved in the query, which can be a substantial savings for tables that have large records.

C-store also optimizes the storage space required for a database in several critical ways. First, column-aligned databases have no need to pad attributes to word boundaries on the disk, so data can be compressed by utilizing the maximum capacity of the disk. Second, C-Store allows for several different types of encoding/compression of data, so storage capacity is further increased. Because C-Store is so space efficient, data can be replicated several times and still incur less storage cost than an equivalent row store DBMS. These replicas provide increased reliability and availability and can be sorted on different attributes, increasing the performance of C-Store across a wider range of queries.

Because C-Store is designed for “read-mostly” rather than “read-only” workloads, the creators decided to maintain transaction support for read/write queries. This is accomplished by maintaining both a small Writeable Store optimized for high performance inserts and updates as well as a larger Read-optimized Store. The architecture also includes a tuple-mover which uses a merge-out operation to move ordered sets of tuples from the write store to the read store efficiently.

My chief complaint about this paper is that much of it has not been tested. The Write Store and tuple mover were not actually implemented at the time the paper was written, so any claims made about the efficiency of these components amounts to little more than speculation. While the authors were able to demonstrate marked improvements over existing databases in both storage space required and the latency of read only queries, they have no evidence to support their claims that the write store and tuple mover will be able to deliver high efficiency for inserts and updates.



Review 25

This paper presents the design of C-store, an architecture that aims at the read-mostly DBMS market. C-store is a design of a read-optimized relational DBMS, in which the data is stored by column rather than by row. This read-optimized data architecture oriented toward querying large amount of data, and can avoid brining into memory irrelevant attributes. Therefore, the authors implemented a column-oriented DBMS called C-store.

First, the paper introduces the design of C-store, including read-optimized store (RS) and writable store (WS). Columns in the RS are compressed using one of 4 encodings, which are “self-order, few distinct values”, “Foreign-order, few distinct values”, “Self-order, many distinct values”, “Foreign-order, many distinct values”. In addition, in order to avoid writing two optimizers, WS is also a column store and implements the identical physical DBMS design as RS. For simplicity and scalability, WS is horizontally partitioned in the same way as RS, and thus there is a 1:1 mapping between RS and WS segments. Therefore, the C-store implementation contains the RS and WS portion.

Second, the paper talks about the management of the C-store, including query execution, update transactions, providing snapshot isolation, locking-based concurrency control, recovery, and query optimization. For updates, all inserts corresponding to a single logical record have the same storage key. A record is visible if it was inserted before effective time and deleted after effective time. For concurrency control, strict two-phase locking is used, and timeouts is used for resolving deadlock. For the query optimization, the authors use a Selinger-style optimizer that uses cost-based estimation for plan construction. Thus, C-store is a data architecture that is read-optimized while supports many transaction issues.

The strength of this paper is that it provides many examples during each section. It assists readers in understanding the ideas of C-store more clearly. The paper provides many example records, tables, and queries to help illustrate these ideas.

The weakness of this paper is that it did not provide many details about the concurrency control and recovery. It does not compare these algorithms in column-oriented DBMS against row-oriented DBMS.

To sum up, this paper presents the design of C-store, an read-optimized architecture that aims at the read-mostly DBMS market.



Review 26

This paper was an introduction to C-Store which is a column oriented, read focused DBMS. Most DBMSs at the time were row based in storage so a whole tuple would be together in memory, C-Store is column stored so if you wanted to grab a whole tuple you would need to read from many different places in storage, but if you wanted to aggregate off a whole column the storage would have much higher locality.

C-Store has two different parts to it, RS and WS. RS is a heavily read optimized storage setup, and WS is more focused on writes. RS is a much larger portion of the database because most data is historic and only read from so they don’t make much of an effort to compress data values, like they do for RS to increase storage and speeds.

The results section showed that C-Store was a full order of magnitude faster than standard Row Store mechanisms, and significantly faster than other column store algorithms! It was also able to compress the data much more than the other setups so it was faster and more space efficient.

I did have a problem with the results section though as they only focused on reads (SELECT statements). Obviously a DBMS like C-Store that is focused on reads and heavily optimized for reads is going to perform better than other setups for reads. I would have liked to see how much worse it was for writes and updates than the other systems as I think this is a very important consideration for picking a DBMS. It felt like a very selective group of queries to run for the results section.

Another strange downside to this paper was the introduction was far too long. The introduction was the longest section in the paper which doesn’t make any sense as it is supposed to be used to just introduce, not explain everything about C-Store. I was surprised by this because normally I love the way Stonebraker formats his papers.

Overall, I think this was a solid paper that was a good introduction to C-Store and it gave a good high level overview as well as more in depth technical details. I think C-Store is a great way to store data that is going to be read-mostly (though I’m sure there are other great ways to store this data as well), but I don’t know enough about most databases to know if they fall under a read-mostly setup, and C-Store did not convince me that it would perform well for databases with many writes.



Review 27

The motivation behind this paper is that many traditional DBMS systems have been previously optimized for writes; updating a single record only involves locating the physical region on disk and modifying a small contiguous location. This row-oriented architecture is nice for being able to modify multiple parts of a record at a time, but when information across specific attributes needs to be aggregated or summarized, record-based storage incurs lots of unnecessary overhead by having to read through attributes that are irrelevant.

One of the intuitions that was not immediately clear to me was exactly why packing columns takes less memory; because there is almost no entropy in data types (i.e. a row can contain ints, strings, and mixtures of different data types, while a column can usually account for almost no variance), making it easier to densely pack many values within a range into memory. Furthermore, sorted values make range queries even easier for similar reasons. The ability to cleverly compress and pack data I think opens up the opportunity to perform operations directly on compressed data in certain cases, since unpacking and interpreting decompressed information also incurs overhead.

While none of the ideas presented alone seem to be worth a published paper, the fact that they designed a comprehensive system for effective optimization and column-based storage makes it a powerful medley of tools. There is a lot of room for optimization (e.g. operating on compressed data, optimized caching and online in-memory updates) but the limitation is that, in the end, this is optimized for aggregate read-only data, so it will not be as efficient for many random writes across different columns.


Review 28

The paper talks about C-Store, a design of read-optimized relational DBMS which are optimized by implementing storage of data by column. Column store architecture allows DBMS to read only the values that matters in a given query, therefore avoids bringing irrelevant attributes into memory. The paper introduces us to a radical departure from the architecture of current DBMSs and presents preliminary performance data on a subset of TPC-H.

The paper starts with motivation behind and architecture of C-Store. C-Store combines read-optimized column store (RS) and update/insert-oriented writable store (WS) into a single piece of software, connected by tuple mover. Inserts are sent to WS, while deletes marked in RS for later purging by tuple mover (updates = insert and delete). The read-only queries are run in historical mode, based on certain timestamp value: High Water Mark (HWS) and Low Water Mark (LWS). In C-Store, the data is stored by creating segments of a table in a form of projections of that C-Store table that contain or more attribute (column). When queried, “table” are reconstructed by joining the segments based on storage keys and join indexes. Columns in the RS are encoded using one out of four encoding techniques (self-order, few distinct rules; foreign-order, few distinct values; Self-order, many distinct values; Foreign-order, many distinct values). The WS is also a column store and implements identical physical DBMS design as RS. The Storage Keys (SK) is explicitly stored in each WS segment. There is 1:1 mapping between RS and WS. The WS does not use encoding, since it is assumed that the size is much smaller than RS. Next, the paper talks about storage management and update transactions. C-Store provides snapshot isolation that serves as the “visible data” to read-only transactions. Read-write transactions use strict Two-Phase Locking with distributed COMMIT processing and transaction rollback to avoid deadlock. Next section explores the design of Tuple Mover as the connector between RS and WS. The paper continues with implementation of C-Store query execution, including query operator, plan format, and query optimization. C-Store Performance is compared to row-store DBMS and another column-store DMBS. The result shows that C-Store is ahead the other two, even when the two add materialized view to improve performance.

A major contribution of this paper is that it provides an new alternative of database architecture that supports the read-mostly system with ad hoc queries. Current DBMSs mostly focuses optimization on write operation, but it usually put read operation in the backseat. However, since the size of data (warehouse) is getting bigger and there are needs for analytical purpose (i.e.: aggregate data), it is important to have an architecture that improves read operation.

However, I think it would be better if the writers also explain about the implementation of C-Store along with mainstream row-oriented DBMS. Because, face it, there will be no OLAP without OLTP. For OLTP, I think row oriented DBMSs are still pretty much of use, since C-Store is still quite heavy for transactional operations (insert, update). It will be more likely that those two DBMS are to be implemented alongside each other. In this case, how would C-Store interact with row-oriented DBMS?--



Review 29

The purpose of this paper is to introduce a column-based database where records are not stored as rows, but rather columns of values are stored together in a column-based and compressed way. Because of this storage mechanism, the DBMS is optimized for read-only queries such as those found in an OLAP data load.

The technical contributions of this paper are mainly in their presentation of a hybrid architecture of a column store database that allows for fast updates, but is a read optimized system. They present storage compression systems to allow for fast access of data in a smaller amount of space. Additionally, they present a column-oriented query optimizer and query executor that is different from those that are typically used in a record-based DBMS. Another important technical contribution of this paper is that they present a database with high availability by using overlapping projections to store the various versions of their data in different sorted orders. These different sorted orders allow this system to kill two birds with one stone. They get additional projections through their data replication which makes query execution faster as well as providing availability!

I think this paper has quite a few weaknesses. Overall, I do not think it is a well-written paper. I also struggle with the sections of the paper where there are parts of their system that are still “under construction”. I would be more satisfied if they could say with more confidence that their To Do list was not going to significantly negatively affect performance, but they leave that possibility on the table. I also don’t think it is clearly explained why the WS system is not storage optimized even though they state that it has a one-to-one mapping with the RS system which must be storage optimized. Perhaps there’s something I’m missing, but this was not clear to me at all.

As far as strengths go, I do think they’re presenting an innovative system that shows promise for optimizing read-dominated OLAP workloads. Their ideas have a grounding in related work, but they are presenting an interesting combination of ideas. I think their biggest strength is in their presentation of a novel query execution engine that is optimized for use in column store databases.



Review 30

Review: C-Store: A Column-oriented DBMS
Paper Summary:
This paper proposes the design of a read-optimized relational DBMS that stores data columns-wise for read-optimization. Traditional DBMS stores data row-wise so that a single write operation can push an entire record to disk and thus is write-optimized. The proposed model, in contrast to the write-optimized systems, stores data by columns, which means each attribute are stored closely. Therefore less number of reads is required to acquire a large number of data records and thus makes the system read-optimized.

Paper Review:
This paper provides detailed description of the data model by providing a number of examples. At the mean time the idea of the model’s construction is also intuitive and straightforward which also makes the explanation simple and reader-friendly. By iteratively comparing the read-optimized system with the write-optimized system the paper also makes it easy to understand and easy to grasp the core idea of the proposal. In section 6.1 the paper also proposes an effective way of providing snapshot isolation given the situation’s constraints. However in this section the figures are not very well visualized and becomes a little confusing to understand the idea it tries to depict.

The experiments are presented clearly and detailed. The configurations of the experiments make the experimental results very convincing in demonstrating the advantages of the proposed model. It could be more thorough a comparison to include the performance of a write-intensive task to show how much sacrifice the system make in writing operations to optimize reading operations. Although this is not a critical evaluation criterion since users can choose from systems according to the tasks’ properties.

One more thing that may be interesting to discuss is what's the impact of this implementation on the difficulty and feasibility of distribution data. Will it be the same or similar implementation to distribute a column-orientated DBMS comparing to a row-orientated DBMS?


Review 31

This paper introduces C-Store, a column based RDBMS. Previous DBMS are row based which is easy to be horizontal partitioned to meet the OLTP workloads. The column oriented databases is on the other hand more suitable for data warehouses and Customer Relationship Management system. By storing data in columns, it can avoid those memory related issues. The C-store combined write optimized column store (WS) with read optimized column store(RS) with the redundant storage of elements (tuple mover) in a table with overlapping projections in different orders to make the access more efficiently with higher availability. C-store uses compression technique to make the disc write more efficiently. Moreover, the C-store uses Snapshot Isolation to provide atomicity.

Strengths:
1. This paper proposes an architecture that each component is highly optimized for certain operations, which is a great way to distributed works and could potentially improve the performance greatly and also is a good optimization of less latency.
2. This paper capture the problem of traditional row based RDBMS and start from there move forward with the C-store DBMS to to solve the problem, which I think is very educational for its reader. Great discover are always comes from the great observation, and great observations are also come from the challenging problems.

Weaknesses:
1. This paper does not talk much about the communication cost to coordinate in a large scale system, since based on the design of C-store, the different nodes are responsible for different tasks, which may cost a lot work to coordinate in a large scale system.
2. This paper only tested seven queries to evaluate the difference between C-store and row-based RDBMS. I believe in order to test the performance on column store database, it is more convincing when providing more test cases.



Review 32

This paper introduces C-Store, a column-store DBMS that is optimized for “read-mostly” market. Unlike traditional DBMS designed for write-heavy workloads, C-Store achieves high performance for read by store data in a column fashion. It is designed for ad-hoc querying of large-scale data, especially in data warehousing. With a column store architecture, C-Store aggressively compress data in a column by trading more CPU cycles for less disk I/Os. There are two ways a column store can use CPU cycles to save disk bandwidth. First, it can code data elements into a more compact form. Second, it can densepack values in storage. C-Store implement data replication by storing multiple data projections, name store a column based on different keys and materialize such instances. To deal with the tension between providing updates and optimizing data structures for reading, the C-Store architecture consists of a writable store(WS) and a readable store(RS). The updates go to WS and propagate to RS in a batch. The read-only queries are run to RS in historical mode. In this mode, the query selects a timestamp and runs on a snapshot taken on that timestamp. The authors also proposed a query optimizer designed for column store DBMS.

The main contribution of this paper includes:
1.Propose of column store database design and a novel architecture of WS and RS that allows transaction in column store DBMS.
2. Reduces the disk I/Os by coding data values and dense-packing the data.
3. Achieves data replication by using overlapping projections of tables, which also speeds up the read query at the same time.
4. Implementation of distributed transactions and snapshot isolation without a redo log or 2PC.

One weakness of this paper is that a read query can only be executed on historical data, thus limiting the freshness of query result.