This paper introduces the design, implementation, and thoughts on Bigtable, a distributed storage system for managing structured data. The paper first introduces the data model (row->column family) used, and the building blocks (mainly SSTable which is immutable storage, and how the system interacts with Chubby, a distributed lock service). Then the paper discusses the implementation of Bigtable, topics include table location (table hierarchy), table assignment (how to locate them in tablet servers), tablet serving (how tablet servers respond to read/write operations), and compactions (shrink memory usage of tablet servers and reduce data amount for recovery). After a high level implementation description, some refinements are introduced, including locality groups (accelerate SSTable access), compression for locality groups, caching and bloom filter to accelerate reads, and merged commit log for performance. Performance evaluation shows the performance of random/sequential read/write operations. The scalability is also analyzed by varying the number of tablet servers. How Bigtable is used in real applications is also introduced. I like this paper very much. The first reason is, it introduces the design motivations and implementations very well and clearly. With the knowledge of google file system, this paper is not hard to read, and provides a reasonable amount of details when it discusses its design choices. The second reason is, besides providing performance on the simple test, the paper discusses how the Bigtable services Google's applications, such as Google Earch. The third reason is it also introduces the relationship with some previously read papers, including Map Reduce, Cstore, Google File System, and it helps me refresh what I read before, and understand their designs better. I would like this paper better if it could give a more detailed introduction to how Bigtable interacts with MapReduce. The paper provides a brief idea, and it's interesting to learn more on the topic. |
Data processing and storage in Google are growing to a very large size in petabytes scale. In order to fit the data storage demand of Google services including web indexing, Google Earth and Google Finance, the author’s team implemented and deployed Bigtable, a distributed storage system for managing structured data from Google. Bigtable shares implementation strategies with parallel databases and main-memory databases to achieve scalability and high performance. The paper first describes the data model in detail. Then it provides an overview of the client API, the architecture, implementation and refinement of Bigtable. Next section presents evaluation and application of BigTable. Finally, the paper ends with the discussion of future work and conclusion. Some of the strengths and contributions of this paper are: 1. BigTable provides clients with a simple model that supports dynamic control over data layout and format. 2. BigTable allows clients to group column families together into a locality group, which leads to more efficient reads because applications don’t need to read through irrelevant content groups. 3. The minor compaction on the tablet moving from a tablet server to another saves recovery time by reducing uncompacted state in the tablet server’s commit log. Some of the drawbacks of this paper are: 1. BigTable doesn’t support advanced indexing like secondary indices. 2. BigTable is highly consistent on single-row updates, but it offers no consistency guarantees for multi-row updates or cross-row updates. 3. Bigtable has no built-in support for SQL. |
Petabytes of structured data of different types, including URLs, web pages and satellite imagery, need to be stored across thousands of commodity servers at Google, and need to meet latency requirements from backend bulk processing to real-time data serving. Therefore, this paper proposed BigTable, a distributed storage system for managing large-scale structured data, which gives clients dynamic control over data layout and format. Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights. Bigtable uses the distributed Google File System to store log and data files. A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications. The contribution of this paper is Bigtable not only introduced an interesting data model (rows, columns, column families, timestamps, atomic row updates), it also combined a large number of interesting and useful data representation techniques (mutable stacks of immutable SSTables, Bloom filters, compressed tablets), some of them new. The paper offers a deep set of systems techniques and obviously good engineering. The Chubby/master/tablet-server interactions show that single-master systems can avoid bottlenecks and scale tremendously. The main advantages of BigTable are as follows: 1. Incredible scalability. Bigtable scales in direct proportion to the number of machines in the cluster. 2. A special query language is not needed, and thus improving the query language for query optimization is not necessary. 3. Operations are performed only at the level of the line so join operations are not required. 4. There is no limit for row length and unlimited number of connections can be kept for each record. 5. With this approach, disk access is reduced and cost is low in contrast to RDBMS, and it offers high availability. The main disadvantages of BigTable are as follows: 1. Data loss can occur and BigTable is lack of advanced features for data security, and secondary index is not supported. 2. Because of the distributed nature of a Bigtable database, performing a join between two tables would be terribly inefficient. Instead, the programmer has to implement such logic in his application, or design his application so as to not need it. 3. BigTable implementations require setting up multiple components including Hadoop's HDFS, and ZooKeeper, which are not trivial, which means BigTable may not stem from the ease with which projects can get started. |
Problem & motivations: Google, again, wants a data format to store all their data and it should be easy-management and able to scale out to petabytes of data. Also, since most applications of Google requires a rapid reaction time like Google Earth, the data storage format (database) should achieve low latency. Main Contribution: To solve the problem mentioned above, Google proposed the Bigtable, a NoSQL database. Bigtable is a confusing name because Bigtable is not a table of data, rather it contains a list of table. The highest level of the Bigtable can be totally fitted in the memory, and it stores the pointer that points to the next level table. All the data is rarely to change and all keys within a table are ordered so it will be efficient for the program to look up certain values. And all sub level table is ordered as well. Like the maximum value of table 1 will be smaller than the minimum value of table 2. In addition, it supports crash recovery by logging some log files. Main Drawback: Join operations. This paper never mentions how can they operate a join operation. I search from the internet and find the Bigtable does not support join operation. |
Storing large amounts of data is a difficult task; finding a way that scales to petabytes of data and more is even more difficult. This is the reality facing companies today, however, as the amount of data being produced and collected continues to explode. To deal with this need, Google has introduced Bigtable, which is a distributed storage system that manages data across thousands of machines. It is very scalable and reliable, spans a wide range of configurations, and can handle a variety of workloads from ones where throughput is important like batch processing to others where latency is paramount. At its core, Bigtable is a sparse, distributed, persistent multidimensional sorted map, where each map is indexed by a row key, column key, and timestamp. As the authors note, Bigtable shares many characteristics with database systems, with important differences. For example, it does not support a full relational data model, rather opting for a simple data model that gives users control over data layout and format, as well as work with the data locality in the underlying storage. Data is indexed using row and column names that can be arbitrary strings, and data is treated as arbitrary strings of up to 64 KB in size. Data is ordered by row key, and row ranges are dynamically partitioned, allowing users to improve data locality with careful row key selection. Row ranges are called tablets in this paper. Column keys are grouped into column families, where all data in a column family is usually of the same type. Disk and memory accounting as well as access control is all handled at the column-family level. Each Bigtable cell can contain multiple versions of the data, which are differentiated by their timestamp. Bigtable is accessed by an API that allows for creating and deleting tables and column families, as well as other functionality such as modifying metadata and access control rights. Additionally, Bigtable supports tasks such as single transactions, using cells as integer counters, execution of client-supplied scripts in the server address spaces, and running with MapReduce. As with other Google systems, Bigtable uses the Google File System, and is stored in the SSTable file format. SSTable provides a persistent, ordered immutable key/value mapping, where keys and values can be arbitrary byte strings. Locking is provided by Chubby, which is a distributed lock service. Tablet location information is structured similarly to a B+ tree, where the first level is the location of the root tablet, followed by the second level which has metadata tablets, and finally by user data tablets in the third level. Tablets are assigned to one tablet server at a time, and updates are committed to a log that stores redo records for recovery purposes. The most recent commits are stored in memories in a sorted buffer called a memtable, and when the memtable reaches a certain size, it is frozen, converted to an SSTable, and written to disk. A new memtable is created in its place. Besides these core features, Bigtable offers other refinements such as the ability for the user to specify a locality group of multiple column families, data compression and/or caching, Bloom filters, and other performance optimizations. The main strength of this paper is that it introduces a successful implementation of a distributed data management system that is very scalable and reliable. This is underscored by its widespread adoption across most, if not nearly all of Google’s major services, like Google Analytics, Personalized Search, Google Finance, and Google Earth, to name a few. That it is successfully able to handle such a wide array of requirements and workloads speaks to the adaptability of the Bigtable system. Devising and implementing a system on such a large scale, especially within an established company, is a very impressive feat. Some weaknesses of this paper include Bigtable’s sublinear speedup when it comes to parallel execution across multiple machines. In the results, increasing the number of tablet servers by a factor of 500 only increases random read performance by a factor of 300, and other benchmarks show less favorable ratios. The authors identify some factors behind this loss of performance, such as suboptimal load balancing. Finally, the authors could have done a slightly better job of discussing the specific steps they took to improve fault tolerance of Bigtable, especially since it has to be run on a distributed array of commodity servers that could fail at any time. |
The paper introduces Bigtable by Google which stores distributed data, designed for managing structured data. The contributions of this paper were to make Bigtable a highly applicable and scalable tool, and as high-performance and available/local as possible. Bigtable is used by a large number of Google tools and it provides a simple data model that supports control over the structure of the data. The paper describes a Bigtable as a “sparse, distributed, persistent multi-dimensional sorted map”. It is indexed with a row, column, and a timestamp. The authors came to this model by analyzing possible problems with a system of its kind, and as a result the model is robust to indexing specific elements in resources that were fetched at a certain time. An example of row keys would be the URLs where a fetch is made (where a row range is called a tablet) and an example of column families might be the language that the page was written (we only use one key in the column family) in or the anchor of a webpage. Timestamps are used to keep track of versions of the indexed item, which might be the state of a webpage when it was fetched at different times. Big Table allows an api as well for modifying the data. One can create and delete tables and also change permission on cluster, table, and column family data. Bigtable is also compatible with MapReduce that we read about previously and it is built on top of the distributed storage system GFS. It uses a cluster management system for scheduling distributed jobs and managing shared resources, monitoring node status, and managing failures. They also use the Google SSTable format to store the map data, as it provides an ordered immutable map from keys to values. It also relies on a distributed lock service called Chubby which keeps 5 active replicas, one of which is a master that serves requests, of data and uses the Paxos algorithm to keep the replicas consistent. Bigtable is implemented in three components, a library which is connected to each node, one master server and multiple tablet servers. The master manages tablet servers and garbage collection of files in GFS. Each tablet server manages a set of tablets, handling reads and writes as well as splitting the data if it becomes too large. A root tablet is stored in Chubby which contains a METADATA table with the location of all the other tablets. Chubby keeps track of all of the tablet servers and when a server starts, it acquires a lock on some data. The persistent state of the data system is stored in GFS and updates to the data are committed to a commit log. The authors also implemented ways of compacting data into fewer SSTables to keep the data footprint from getting out of hand with new commits. As per the goals of a high performance and highly available system, the author then details some of the smaller refinements that they made on top of the core features like keeping data local and compressing data where possible. I liked the graphics in this paper. I did feel that it was a bit dense but the graphics at times provided a good disruption to help visualize things especially like the tablet location hierarchy image. I did not like, however minor, how they did not cover related work until the end of the paper, where I prefer that context and point of comparison upfront. |
This paper introduces Bigtable which a distributed storage system for structure data. It is design for many google's application which needs to use petabytes of data. And those data are distributed in thousands of servers. Bigtable provides a flexible resolution with high efficiency. Bigtable is designed like database system but provide a totally different interface. Bigtable does not support a full relational data model but provides clients with a simple data model that supports dynamic control. Bigtable is a sparse, distributed, persistent multi-dimensional sorted map indexed by a row key, column key, and a timestamp. Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata. Bigtable is built on several other basic component from google. 1: Bigtable adopts google file system to storage logs and data format files. It depends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status. 2: Bigtable also relies on SSTable which is a persistent, ordered immutable map from keys to values. 3: Bigtable relies on a Chubby which is a highly available and persistent distributed lock service. Chubby uses Paxos to maintain the consensus of each replica. Bigtable uses Chubby for ensuring that there is at most one active master at any time, storing the bootstrap location of Bigtable data, discovering tablet servers and finalizing tablet server deaths, storing Bigtable schema information and access control lists. To ensure the high performance, availability, and reliability, there are a couple of optimizations implemented in Bigtable. They include 1: locality groups. 2: compression over locality groups 3: caching for read performance 4:Bloom filters 5 commit-log implementation. 6 speeding up tablet recovery 7: exploiting immutability. A obvious improvement in performance during scaling is when the number of tablet servers in the system increase, the throughput increase dramatically but not perfectly linear because the bottleneck is the individual tablet server CPU. There are some good application that uses Bigtable introduced by the paper which are google analytics, google earth and personalized search. The main contribution of Bigtable is that it introduces an innovative data model and brought some representation techniques such as immutable SStable, Bloom filters and compressed tablets.it also gives implementation experiences in scaling and how to avoid bottleneck. The advantages of Bigtable: 1: it provides strong consensus. 2 it is designed based on distributed file system and separate database and storage which makes it easy to implementation. 3 ordered row key 4 supports multi-version concurrency control. 5 provides excellent compression strategy. The only disadvantage I believe is that any table is located on single server in which when that server is crashed it needs other servers to help recover by some techniques in logging which may harm the performance and stability. |
This paper proposed BigTable, the Google’s way of handling large structured data. BigTable is a distributed system built on top of GFS that provides subset of database semantics. Bigtable stores the data in several Tables. The form of each Cell in the Table is as follows: (row: string, column: string, time: int 64)->string. When storing data, Bigtable will sort the Table by Cell's Row Key and provide support for row-level transactions. As a distributed storage engine, Bigtable divides a Table into several adjacent tablets by Row and distributes the Tablets to different Tablet Servers for storage. As a result, when the client queries the closer Row Key, the concept of the Cell falling on the same Tablet will be larger, and the query efficiency will be higher. A complete Bigtable cluster consists of two types of nodes: Master and Tablet Server. The Master is responsible for detecting the composition of the Tablet Servers in the cluster and their join and exit events, assigning the Tablet to the Tablet Server, and balancing the storage load between Tablet Servers and reclaiming useless files from GFS. In addition to this, the Master is also responsible for managing Schema modifications such as Table, Column Family creation and deletion. Each Tablet Server manages several Tablets designated by the Master, handles read and write requests to these tablets, and is responsible for segmenting the tablet as it becomes too large. The Bigtable cluster manages several Tables, each consisting of several Tablets, each associated with a specified Row Key range, and the Tablet contains all the data for that Table within that range. Initially, the Table will have only one tablet, and as the tablet grows automatically split by the Tablet Server, the Table will contain more and more tablets. The main contribution of it proposed the idea of BigTable that can work with GFS and solve the problem of google applications. Also, there are several spark point in this paper, such as the coordination between nodes using Chubby distributed lock service and usomg LSM Tree to convert random writes to the database into sequential writes. One of the weakpoint of bigtable is that it is not suit for large dataset with a lot of random row read and write. |
Bigtable is a distributed storage system built by Google on top of the Google File System (GFS). It is meant to handle “web-scale” data - petabytes and thousands of individual machines. Bigtable has its own client code and does not support a relational data model or query language. Without knowing too much about DBMS history, I would say that it was probably one of the first popular systems in the NoSQL wave. The bigtable data model includes rows that can be indexed by a row key. The keys are stored in lexicographic order, and sets of row ranges are called tablets. Bigtable is made up of a single master and many tablet servers to store the tablets, while also having a client library and using an internal google system called Chubby. Some additional major differences from other systems are that bigtable uses a “Column Family” as a means of access control, which can include many columns, and that multiple versions of the same data can be stored by timestamp. This has been referred to as “time travel” in other papers, although it should be noted that bigtable uses garbage collection so that users can get rid of older versions. I have found from the three Google papers that we’ve read (GFS, MapReduce, Bigtable) that they follow a certain pattern. First, both of the later papers make use of GFS. Second, these papers are on systems that are already heavily used in production at Google. This is a key distinction when comparing them to other papers. There are fewer questions that remain unanswered and ideas that remain unimplemented. The systems tend to be a somewhat revolutionary way of thinking about things, so the results sections don’t compare to various other existing baselines in the same way as other papers. Still, I thought that this paper’s results did a good job showing what Bigtable works well for. The fact that these papers are very fleshed-out makes them easy to read, in my opinion. A downside to these Google papers is that the code is very closed off. As far as I understand, at the time when these papers were written, google did not share implementation for these technologies, and didn’t provide them as a service either - as Microsoft does with SQL Server updates. Now, of course, there is a cloud version of Bigtable. At the time, this meant that the papers lead to a bunch of copycats and in my opinion, redundant work. A downside of Bigtable itself is that it does not provide the consistency of a traditional RDBMS, or a recognizable query language. Additionally, many optimization decisions seem to be left up to the programmer. This makes it significantly harder to adopt than other systems. |
This paper introduces Bigtable, which is a distributed storage system for managing structured data. It is designed to scale to even petabytes of data across thousands of machines. The goal of Bigtable is to provide high performance, high availability, and wide applicability. Google is using Bigtable for a variety of different workload, for example, Google Analytics, Google Earth, Google Finance etc. Bigtable is built on other pieces of infrastructure systems include GFS for file storage, Chubby for distributed lock service and a cluster management system. Bigtable relies on Chubby to ensure there’s only one active master, to store the root location of Bigtable metadata, to discover tablet servers, etc. This, however, also means if Chubby is unavailable for a long time, Bigtable is also down. Internally, Bigtable stores all data as key-value pairs. Each key consists of a row key, a column key and a timestamp. Value, on the other hand, can be any string. Bigtable as its name suggests, store these key-value pairs in tables. Each table is further partitioned into multiple tablets based on ordered row key ranges and stored on different tablet servers. For columns, different columns in a table can be group into sets called column families, which form the basic unit of access control. Bigtable also supports multiple versions of data for the same row and column key through timestamp. The user has the ability to specify how many versions or only new-enough version should be kept. The rest will be garbaged-collected. Bigtable stores tablet location information in a three-level hierarchy way and the root is stored in a Chubby file. The client will cache location information, however, when the client just boots up, it takes three round-trip time to get the information. Master will keep track of the set of live tablet servers and the current assignment of tablets to tablet servers, including which tablets are unassigned. The master will periodically check each tablet server to ensure it’s still alive; if not, the master will update the state of all the tablets managed by that server. One drawback of the system, as I pointed out previously, is the high dependence on Chubby. If that system is down, the whole system can not work properly. |
In the paper "Bigtable: A Distributed Storage System for Structured Data", Fay Chang and other Google employees develop Bigtable, a flexible, distributed storage system for managing structured data. BigTable is designed to scale to very large sizes: PBs of data across thousands of commodity servers. Bigtable supports workloads from many Google products such as Google Earth and Google Finance - two very different and demanding fields in terms of data size and latency requirements. Despite the varied demands, Bigtable has been able to secure wide applicability, scalability, high performance, and high availability. Bigtable differs from current parallel databases, main-memory databases, and full-relational data models. Rather, it offers a simple data model and supports control over data layout and format. Row and column names are in string format, data is treated as uninterpreted strings (although they can be structured), locality of data can be controlled by clients, and clients have a choice of serving data from out of memory or disk. Since such a storage layout is used as the infrastructure for many Google applications, this is an important problem to consider in terms of finding a balance between throughput oriented batch processing jobs and latency sensitive jobs to end users. The paper is subdivided into several categories (which help explaining it much easier): 1) Data Model: Sparse, distributed, multi-dimensional sorted map. Indexed by row key, column key, timestamps. Row keys are arbitrary strings - every read/write of data in a single row key is atomic. The rows are kept in lexicographical order and partitioned into tablets. This makes row reads in particular ranges much more efficient. Column keys are grouped into sets called column families. The assumption is that these column families do not change very often. Access control and both disk and memory accounting are performed at the column-family level. Since multiple copies of the same data can exist, we need timestamps. Decreasing timestamps are used so the most recent are read first. Occasionally, garbage collection is used to get rid of old entries. 2) BigTable API: Allows creating/deleting tables/column families. Clients can write or lookup values. Also allows support for more complex operations. Lastly, has built in support for previous works: MapReduce. 3) Building Blocks: The Google File System is used to store logs and data files. Uses the Google SSTable file format to internally store BigTable data. BigTable also relies on Chubby to do concurrency control and replication. Unfortunately, since they are so intertwined, if Chubby is unavailable, BigTable will also be unavailable. 4) Implementation: There are 3 components: a library linked to every client, a master server, and many tablet servers. The master takes care of meta data and balancing workload of tablets between tablet servers. Each tablet server does local work about read/writes on assigned tablets. The clients never communicates with the master, but the tablet servers directly. Thus, the master has less "stress" on it. 5) Refinements: Compression techniques -> reduce space overhead. Caching -> better performance on multiple reads on the same data. Bloom filters -> reduce the number of disk I/Os. Multiple commit logs per tablet server -> reconstruct servers in the case of failures. Master decides to move tablets between servers -> tablet is compacted and sent over without requiring any recovery of log entries. Immutability of SSTables -> forces us to do garbage collection but enables us to split tablets efficiently. This paper, much like others, has some drawbacks. The first thing I noticed is Google's obsession with master and slave servers. I understand that the master server is in charge of meta operations and the slave servers do local work (based on their job), but this will make the master server a bottleneck for performance. Especially in the case of failure or if the workload gets large enough, there may be a need for multiple masters. Another drawback that I noticed is that even though they did an excellent job showing that their storage layout supports Google operations, they still did not have a benchmark to test against this. Perhaps this is because Google likes creating their own infrastructure rather than relying on third parties, but it would be helpful to see the gains when compared to other methodologies. |
This paper describes Bigtable, a storage system for structured data that can scale to extremely large sizes. Bigtable is a Google system, and so it’s built on top of GFS, and uses Chubby for handling locks. Tables are represented as a 2-dimensional map, where a row-column combination maps to a cell containing a fixed amount of data. Each row is identified by a unique string, and any updates to any number of columns in a row are performed atomically. Each column is also identified by a unique string, but columns can be grouped into “families”, and the column name becomes the family name appended to the column’s identifier. Each cell is some fixed amount of bytes, but any cell can have several versions, and each version has its own timestamp. The timestamps can be set automatically by Bigtable, or manually by the user. Bigtable automatically deletes old versions after a certain amount of time, or if there are too many. Each group of rows is organized together into a tablet, and tablets are all around the same size. For storage purposes, one master server manages everything, and an arbitrary number of tablet servers store tablets. The number of tablet servers can change at any time to accommodate increased loads. Whenever a tablet server starts, it creates a unique file, and uses Chubby to get a lock on it. The master server can look at all of these files to find the tablet servers. If a tablet server ever loses the lock, however, then the master assumes that it’s dead, and deletes the file. As such, the servers are only alive so long as the master has a way to contact them, and so the master can rearrange tablets among tablet servers as needed. The advantages of Bigtable lie mostly in its high scalability. It includes structured data like a DBMS, but is able to distribute that data among arbitrarily many servers. According to the experimental data, increasing the number of servers does increase the number of reads and writes performed on the system overall, but tends to decrease the load on each individual server. On the downside, the experiments only show the performance of Bigtable in a vacuum; it isn’t compared to any other storage system, which means that there aren’t as many conclusions that can be drawn about its performance. As well, Bigdata requires that no more than one master is active at a time; the paper doesn’t address much if this could become a bottleneck as the number of tablet servers increase, and the master has to do more work moving tablets around. |
Although Google has GFS to store files, but applications has higher requirement. GFS only provides data storage and access, but applications may need version control or access control ( such as locks ). GFS's master may also be too burdened to deal requirements from multiple large scale distributed system. So Google design a database system to manage structured data. That is Bigtable, which is a combination of other techniques of GFS and Chubby. Bigtable's logical structure is a sparse multi-dimensional map. The row is used to describe client(such as URL, sessions), and it is called tablet. While column is used to describe application's features ( such as anchor, contents and authors ), and columns are grouped into column families. The data in each cell has a timestamp to control the version. The is the basic access control unit. So data with locality will be fetched together to improve performance. Three latest version of data in each cell will be stored. APIs are designed for application to control data in tables and client can also use script to process data. There are three components in Bigtable, client library, master server and tablet servers. The master assign tablets to tablet servers, keep track of them and split or merge tablets according to their sizes. Tablet is used to process write/read requirments. Client library are used to provide API to contract with servers and cache temporary data. The tables are in three hierarchies. Root table describes the location of metadata table, and metadata descirbes the user tables' locations. All table access are control by Chubby lock system to avoid concurrent conflict operations. Strengths: (1) like GFS, clients are communicating directly with tablet servers for read/write operations. This could help the system to overcome the bottleneck effect brought by the centralized coordination of the master. (2) There are two different layers of centralization in this system and they are all handled pretty well in order to survive failures. The master of tablet servers is supported by Chubby lock service which ensures that there is one master at any time to eliminate the possibility of inconsistency. The other master is located in GFS and is backup-ed periodically. Weak points: (1) while the usage of different systems and applications (GFS, SSTable, Chubby) decouples different layers and aspects of Bigtable (GFS is the low level file storage solution, SSTable is the actually data structure, Chubby is responsible for metadata, cluster management and the stats monitoring), the interactions of systems could lead to overhead and complexity for maintenance. |
This paper introduces Bigtable, which is a distributed storage system for managing structured data that is designed to scale to a very large size. Google projects like Google Earth and Google Finance store their data in BigTable. These applications have different demands for BigTable: data size and latency requirements. BigTable turns out to provide flexible solutions for different applications. First of all, Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. The row keys in a table are arbitrary strings, and Bigtable maintains data in lexicographic order by row key. The column keys are grouped into sets called column families, which form the basic unit of access control. The column keys are comprised of family and qualifier. Furthermore, each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Timestamp is used to avoid collisions. For example in Webtable, timestamp is assigned using the time at which the page is crawled. The the paper briefly introduces the Bigtable API. The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights. One thing to note is that Bigtable can be used with MapReduce, therefore it can do large-scale parallel computations. Bigtable is not by itself but have several building blocks. Bigtable uses the distributed Google File System to store log and data files; the Google SSTable file format is used internally to store Bigtable data; Bigtable relies on a highly available and persistent distributed lock service called Chubby. The paper then discusses the implementation of Bigtable with three major components: a library that is linked into every client, one master server, and many tablet servers. The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. Each tablet server manages a set of tablets. The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large. Clients communicate directly with tablet servers for reads and writes. A Bigtable cluster stores a number of tables. Each table consists of a set of tablets, and each tablet contains all data associated with a row range. The paper goes into technical details of each major component. To achieve high performance, there are a few refinements: clients can group multiple column families together into a locality group, clients can control whether or not the SSTables for a locality group are compressed, , tablet servers use two levels of caching, a Bloom filter allowing to ask whether an SSTable might contain any data for a specified row/column pair, using only one log, and source tablet server does a minor compaction on the tablet to reduce recovery time. Lastly, the paper evaluate performance of Bigtable on various Google applications. The advantage of this paper is that it is very thorough because it basically covers all aspects of Bigtable and what applications actually use Bigtable. Most papers I have read this semester do not cover application part. Also this paper uses many graphs, which is helpful with understanding. The paper also presents lessons learned from building Bigtable, which brings more thoughts on Bigtable. |
“Bigtable: A Distributed Storage System for Structured Data” by Chang et al. describes a new system at Google called Bigtable, which is a distributed storage system for structured data, designed to support a wide variety of data storage and processing use cases. Bigtable uses a simple data model, allowing users to choose nearly arbitrary row and column names, and encourages them to choose names in such a way to store related records near each other. Bigtable keeps track of multiple versions of a given table cell, and therefore allows clients to index not only by row or column key, but also timestamp. Bigtable is built on the Google File System (GFS) for storage and Chubby as a distributed lock manager. A row range of data is stored in a tablet. Tablet servers host tablets, and the master server assigns tablets to tablet servers, as well as monitors tablet server status. The authors evaluated Bigtable by measuring its performance as they varied its number of tablet servers, in particular measuring the rate for random reads, random writes, sequential reads, sequential writes, and scans. Next the authors discuss how Bigtable fares for Google’s own internal use cases, Google Analytics, Google Earth, and Personalized Speech. Finally, they discuss related work in distributed storage solutions and parallel databases. It’s compelling to see that dozens of Google services use Bigtable. Also, it makes sense that in the related work section they discuss distributed/parallel databases. However, I wonder how empirically how these prior distributed or parallel DBMSs fare against Bigtable; it might have been nice to see an experiment comparing these. |
In this paper, the engineers in Google proposed a novel distributed storage system for structured data called Bigtable. The problem they are going to solve is to design and implement a distributed storage system to manage structured data in scale. This problem is very important for Google, one of the largest internet company in the world. In Google, there are tons of structured data including URLs (contents, crawl metadata, links), per-user data (preference settings, recent queries) and geographic locations (physical entities, roads, satellite image data). Apart from this different kind of data, the scale of the data is very huge, they have billions of URLs, many versions and pages, hundreds of millions of users, and more than 100TB satellite image data. At that time, this scale is too large for most DBMS in 2006 so that they have to build their own systems. As a result, they successfully build a distributed storage system featuring high scalability, performance, availability, and flexibility. Next, I will summarize the important techniques used in Bigtable. First of all, Bigtable utilizes a distributed multi-dimensional sparse map, it maps from a key to an arbitrary byte array. The key contains three attributes, row name, column name, and a timestamp, note that Bigtable is a multi-version system by using the timestamp. For Bigtable API, it provides basic functions like creating and deleting tables and column families, changing cluster, table and column family metadata. It also supports single-row transactions, cells to be used as integer counter and executing of client-supplied scripts. The API can be easily accessed through C++ programming. There are 3 main components in Bigtable, the master server, tablet servers, and Chubby client library. There is only one master elected from Chubby and there can be many tablet servers that dynamically added or removed to handle the workload. The Chubby client library links the master server, tablets servers, and all clients. One important feature of Bigtable is that it is built on several other pieces of Google infrastructures including GFS for log and data files storage, cluster management system that handles job scheduling, resource management, failure handling, and monitoring, Google SSTable file format (persistent ordered immutable map for data storage) and Chubby for a distributed lock service. Chubby is a high-available and persistent distributed lock service, it uses namespace consists of directories and files as locks. Chubby use Paxos algorithm to achieve consensus among replicas and the reads and writes in Chubby is atomic. For locating tables, they use a three-level B+ tree-like structure that store special tables containing tablet location info in Bigtable cell itself. The persistent state of a tablet is stored in GFS, before writing stuff to immutable SSTable files, they utilize a memtable which is an in-memory sorted buffer, writes go to log then to the in-memory table and reads maintain an in-memory map of keys to SSTables. Bigtable use compaction including minor and major compaction and SSTable can be cleaned and then merged to new ones. They also introduce some optimization mechanisms. They use locality group in which Clients can group multiple column families together into a locality group. They apply compression to each SSTable block separately and use Bentley and Mcllroy’s scheme and fast compression algorithm. Caches are also provided for reading performance including scan cache and block cache. Bloom filters are introduced to reduce the number of disk access. They also use a one log per tablet server to reduce overhead. The experiments show the performance of Bigtable and the scalability are very good with the number of tablet servers increased. There are many advantages to Bigtable. First of all, they use a lot of commodity hardware to achieve high performance with low cost, use the same idea of GFS, is pretty pioneering. And this framework does perform very well, it also achieves high scalability, availability, and flexibility. Second, Bigtable utilizes many existing techniques developed by Google itself, it makes the design of Bigtable much easier and makes the system more robust. Build a system internally means that system can be utilized across many projects and reduce the cost, and Bigtable does make it. The third advantage is that Bigtable is very configurable, because, in the paper, they give several projects within Google that is using the Bigtable, one can easily find that this table requires different feature, however, Bigtable can be used flexibly by changing the configuration, and the result shows that it can almost handle any workload with different configuration, that is very impressive! I think there are some downsides to this paper. First of all, in the refinement of Bigtable, they utilize bloom filters to reduce the number of disk access, however, we know that bloom filters are not always correct, but they don’t mention how to fix this problem when some of the results provided in bloom filters are incorrect. Second, I think they do not provide too many experiments in this paper, they only show the performance with the change of the number of servers, I think they should provide some other experiments to convince people. Finally, although this paper is written 10 years before, Bigtable is still an important product of Google. However, as I know about Bigtable, Bigtable is varied rapidly and now even open for outside of Google, I hope I can know more about the Bigtable recently and the new features added for today’s Bigtable. |