Review for Paper: 21-Chapter 22 (Database Management Systems by Ramakrishnan and Gehrke)

Review 1

This chapter covers topics related with parallel and distributed databases. It starts from defining parallel and distributed database systems.

Detailed topics discussed include:
- Parallel database architecture: shared nothing, shared memory, shared disk.
- Parallel query evaluation: how operators can be processed in parallel - either operators are independent, or partitioned data are evaluated.
- Parallelize individual operations: introduces how bulk loading, scanning, sort, and join are evaluated in parallel, in shared-nothing architecture.
- Parallel query optimization: introduces two kinds of inter-operation parallelism, one is to pipeline the result of one operator into another, one is to execute independent operations concurrently - an optimizer should consider such scenarios when optimizing queries.
- Distributed DBMS architectures: client-server systems where server process executes queries from clients; collaborating server systems, where collaborating servers run transactions against local data and calculations are spanned; middleware systems, where middleware software is used to execute quries.
- Data storage in Distributed DBMS: data are fragmented (horizontally or vertically), replicated (for increased availability and faster query execution).
- Query processing in Distributed DBMS: for join, several methods including ship data to one site for evaluation, semijoin, and bloomjoin are introduced and compared.
- Updating distributed data: synchronous and asynchronous replication are introdiuced, and asynchronous is proposed since synchronous could be too expensive or even unavailable. Primary site (capture-apply) and Peer-to-Peer replication methods are introduced.
- Distributed concurrency control: three lock management methods are introduced - centralized, primary copy, and fully distributed. Global dependence detection, hierarchical dependence detection, and timeouts are used to avoid deadlocks.
- Distributed recovery: 2 phase commit algorithm is illustrated in detail.

There are several points I like about this chapter. First, it has a clear structure, clearly points out important aspects of both parallel and distributed DBMS, and also clearly differentiate two concepts. Second, it has good examples to illustrate ideas, and it's easy to read as a textbook material.

I would enjoy more about this chapter if it could give more information on how distributed query optimizations are carried out. The material just gives some short comments on this topic.

Review 2

Motivated by the performance, increased availability, distributed access to data and analysis of distributed data, parallel evaluation techniques and data distribution have become increasingly popular in databases.

1. Parallel databases:
The basic idea is to carry out evaluation steps in parallel whenever possible in order to improve performance.
Three main architectures: shared-memory system, shared-disk system, shared-nothing system. The problem of the former two is interference, which motivates the development of the shared-nothing system. Shared-nothing system provides linear speed up and linear scale up.

In shared-nothing systems, parallel relational query evaluation can be achieved by data partitioning, which consists of round-robin partitioning, hash partitioning, and range partitioning. Round-robin partitioning is suitable for efficiently evaluating queries that access the entire relation, while hash partitioning and range partitioning are better accessing only a subset of the tuples. Range partitioning is superior when range selections are specified, but is more likely to lead to data skew where as hash partitioning can keep data evenly distributed even if the data grows and shrinks over time.

The basic idea of parallelizing existing code for sequentially evaluating a relational operator is to use parallel data streams. Streams (from different disks or the output of other operators) are merged as needed to provide the inputs for a relational operator, and the output of an operator is split as needed to parallelize subsequent processing.

Two optimization of the operations which can be implemented in parallel are sorting and join. The former can be optimized by first redistributing all tuples in the relation using range partitioning, which can use a splitting vector to achieve a evenly distributed range partition. Joins can be optimized by executing the smaller joins one after another with each join executed in parallel using all processors.

Parallel query optimization consists of 1)The result of one operator can be pipelined into another. 2)Multiple independent operations can be executed concurrently. However, issues still need to be considered, including operation cost estimation, the tradeoff between speed and cost, and other parameters that will be known only at run-time.

2. Distributed databases.
The desired property includes distributed data independence and distributed transaction atomicity, which are prohibitive due to the administrative overhead. Distributed databases consist of homogeneous distributed database systems, heterogeneous distributed database system, and multidatabase system. Notice distributed data management comes at a significant cost in terms of performance, software complexity, and administration difficulty.

Distributed DBMS architectures are Client-Server, Collaborating Server, and Middleware.
In Client-Server systems, clients are responsible for user-interface issues, and servers manage data and execute transactions. In Collaborative Server systems, we have a collection of database servers, each capable of running transactions against local data, which cooperatively execute transactions spanning multiple servers. The Middleware architecture is designed to allow a single query to span multiple servers, without requiring all database servers to be capable of managing such multi- site execution strategies. Only one database server is capable of managing queries and transactions span multiple servers; the remaining servers only need to handle local queries and transactions.

To reduce the accessing overhead across remove site, relations are either fragmented or replicated. Fragmentation includes horizontal fragmentation and vertical fragmentation. Replication means that we store several copies of a relation or relation fragment, and is motivate by increased availability of data and faster query evaluation. Replication can be categorized into synchronous and asynchronous replication.

When updating the distributed databases, transactions should continue to be atomic actions, regardless of data fragmentation and replication. Synchronous replication means before an update transaction commits, it synchronizes all copies of modified data. Asynchronous replication, which are more widely used in distributed databases, means copies of a modified relation are updated only periodically in this approach, and a transaction that reads different copies of the same relation may see different values. Asynchronous replication compromises distributed data independence, but it can be more efficiently implemented than synchronous replication.

The main contributions and advantages of this paper are:
1. It gave the motivation for parallel and distributed DBMSs, and their alternative architectures respectively.
2. It introduced the pipelining and data partitioning used to gain parallelism, and the dataflow concepts used to parallelize existing sequential code.
3. It described how the data is distributed across site, and how to evaluate and optimize queries over distributed data.
4. It explained the inherits of synchronous vs. asynchronous replication and the transaction management in a distributed environment.

The main disadvantage is although this paper tried to use examples to illustrate the fundamental ideas, it could use more figures instead of text descriptions to help reader better understand the concept and working pipelines. Besides, it could compare parallel databases to distributed databases, both in terms of concepts and techniques used in these systems.

Review 3

Testing123

Review 4

This chapter in the textbook deals with parallel and distributed databases. While many discussions of database management systems consider transaction processing to be essentially sequential for simplification purposes, it is clear that being able to run them in parallel can lead to potentially significant performance gains. Additionally, using a distributed system has advantages such as increased availability due to having multiple machines capable of storing copies. Parallel databases can be built as Shared Nothing, Shared Memory, and Shared Disk, which differ in the level of access that a given CPU has to data, as well as the method of communication between them. In shared disk and shared memory, each CPU has its own memory or disk, respectively, and shares common memory or disk, respectively. These architectures have lower communication overhead, but suffer in scaling due to increased contention for memory/disk. For shared nothing, each CPU has its own memory and disk, and communicates with other processors through an interconnection network. It often requires the DBMS code to be adapted more extensively, but it can provide linear speedup, and is often the model of choice for large parallel database systems. In order to perform parallel query evaluation in a shared nothing architecture, data partitioning and parallelizing the operators has to be done. Some operations like scanning and loading can be done in bulk, while sorting can be done with range partitioning. Joins, though, can be slightly more complicated in how it is partitioned and executed. Of course, pipelining and parallel execution of independent operations can optimize this process.

The other major topic of this topic is distributed databases. Some key features for any distributed database system include data independence, i.e. users should be able to request data without having to specify where the data is located, and transaction atomicity, where users should be able to write transactions that access and update data just as if the data was stored locally. Distributed database systems can be either homogenous, where all servers run the same DBMS software, and heterogeneous, where they can vary. To deal with heterogeneous systems, a gateway protocol can be used, but this adds additional performance penalties due to overhead, etc. Three types of architectures for distributed database systems are Client-Server, Collaborating Server, and Middleware. In a client-server system, there is one or more client processes and one or more server processes, and a client process can send a query to any server process. The client process manages the user-facing side, while server processes manage the data and execute transactions. For collaborating server systems, a collection of database servers cooperatively execute transactions spanning multiple servers. This is in contrast to the client-server model where a single query cannot span multiple servers. Finally, a middleware system allows a single query to span multiple servers, without requiring servers to be able to manage the multiple server execution plans. Once this architecture is decided, many other issues need to be dealt with, like whether to partition or fragment relations across various servers, keeping track of this distributed data, and processing the query in a distributed manner. Additionally, concurrency control, redundancy, and recovery are all more complicated when dealing with a distributed system. For example, a distributed DBMS has more modes of failure, such as communication links or a specific server failing while a transaction is executing, so it can be challenging to ensure ways to guarantee that despite these potential failures, the transaction will be committed (or rolled back in case of issues).

Overall, this chapter is a strong and comprehensive overview of the major issues involved in the implementation of parallel and distributed database systems. The writing is easy to follow and complements the textbook’s mission of communicating this knowledge well. Major concepts are delineated clearly, and covered more extensively in the text.

In terms of weaknesses, I cannot identify any, especially given that this is a textbook meant to teach fundamentals rather than a research or review article. As such, some things that would make it better, like examples of actual real-world systems employing these concepts, would not be as appropriate, unless they were used as historical examples, since technology is progressing so rapidly.

Review 5

The purpose of this reading was to provide the reader with an overview of parallel processing and data distribution in database management systems. This issue is important because it was motivated not only by performance concerns of sequentially executed transactions, but it was also motivated by a desire for safety and availability, locally accessible data for systems distributed geographically far from one another, and analysis of this data that is distributed. The author introduces the different types of memory arrangements for a distributed system, including shared nothing, shared disk, and shared memory. The speak on how sharing memory and disk resources does not scale well because of contention in memory access due to different processors operating on the same data. Then the other focuses on shared-nothing which does not have this problem. In shared nothings, processing of a single transaction can be parallelized across multiple processors. When partitioning the data, the author introduces the idea of hashing the tuples in the database to different processors versus storing ranges of tuple values on certain processors.

Next we cover loading horizontally distributed relations as well as scanning in bulk. We look into how sorting is performed on a partition, which is easily parallelizable if you used range partitioning properly. The author then introduces joins and merges which are alternatively made parallelizable by using hash partitions on your database. Next the author covers parallel query optimization which is the idea of either pipelining the output of operators into the next operator versus executing operations concurrently if they are independent of one another. Next the author speaks on distributed databases where we can operate on data as if it were local, when it may be distributed over multiple systems that are remotely located. The author covers the three different types of distributed DBMSs, Client Server Systems where a client can query any server, Collaborating Server Systems where the query cannot span multiple servers, and Middleware systems where a single query can span multiple servers. We then look into distributing data with methods like fragmented memory in different places in the system.

I liked how this text was brief in explaining each concept. It laid each idea out in a logical order. I do wish, however that they had more visuals to help with the newer ideas. It was still easily understandable, I thought, and high level.

Review 6

This paper gives an overall comprehensive introduction of distributed DBMS.

there are three basic classical stricture of distributed database which is 1) shared nothing 2)shared memory 3)shared disk. In shared-nothing system, each CPU has local main memory and disk space, but no two CPUs can access the same storage area, and all communication between CPUs is through a network connection. For shared memory architecture, multiple COUs are attached to an interconnection network and can access a connection region of main memory. For shared disk structure, each CPU has a private memory and direct access to all disks through an interconnection network. Shared memory and shared disk have problems that as more CPUs are added, existing CPUs are slowed down because of the increase contention of memory access and network bandwidth. Shared nothing architecture requires more extensive reorganization but I has been shown to provide linear speed-up and linear scale up.

Parallel query evaluation is considered in shared nothing architecture. A relational query execution plan is a graph of relational algebra operators, and the operators in a graph can be executed in parallel. In addition to evaluating different operators in parallel, we can evaluate each individual operator in a query plan in a parallel fashion. This is done by firstly partition the data and then combining the results. In order to achieve the parallel read and writes, the partition in large dataset is executed horizontally across several disks. Several ways of implementing this includes round-robin fashion, hashing, and assigning tuples to processors by ranges of field values.

When paralleling database, we need to paralleling individual operations. When scanning a relation and loading a relation, make pages read in parallel and bulk loading in parallel. For sorting, range partitioning is important in parallel sorting. The principle of this is nearly the same as doing the sorting in map-reduce. For parallel join, take parallel hash join for example, the basic idea of parallel joins is divided into smaller joins by partition the joining components into k logical buckets or partitions. As query can executed in parallel, query optimization can also be done in parallel. There are two kinds of parallelism, first is that the result of one operator can be pipelined into another. Second, multiple independent operations can be executed concurrently.

In distributed database, what we want is distributed data independence and distributed transaction atomicity. The first means users should be able to ask queries without specifying where the referenced relations, or copies or fragments of the relations, are located. Second, Users should be able to write transactions that access and update data at several sites just as they would write transactions over purely local data. Two types of distributed database systems are introduced. If data is distributed but all servers run the same DBMS software, homogeneous distributed database system should put into use. However, if different sites run under the control of different DBMSs and are connected somehow to enable access to data from multiple sited, a heterogeneous distributed database system is used. Overall in running a distributed DBMS, we need to consider time performance, software complexity and administration difficulty.

There are three different kind of distributed DBMS architectures which are client-server system, collaborating server system and middleware system. For client-server system, it has one or more client processes and several server processes, and a client process can send a query to server process. This architecture is easy to implement and also user can have interface on the server. For collaborating server system, it does not allow a single query to span multiple servers because the client process would have to be capable of breaking such a query into appropriate subqueries to be executed at different sites and then piecing together the answers to the subqueries. For middleware architecture, it is designed to allow a single query to span multiple servers, without requiring all database servers to be capable of managing such multi-execution strategies.

For distributed storage in DBSM, relations are stored across several sites. Accessing a relation stored at a remote site incurs message-passing costs and, to reduce this overhead, a single relation may be partitioned or fragmented across several sites, with fragments stored at the sites where they are most often accessed or replicated at each site where the relation is in high demand. Horizontal fragmentation and vertical fragmentation are considered which the first one means the union of the horizontal fragments is equal to the original relation while the second one is that the collection of vertical fragments is a lossless-join decomposition. When we want to store several copies of a relation of relation fragments. The entire relation can be replicated at one or more sites. Two advantages of this: 1)increased availability of data and 2)taster query evaluation.

The main advantages of this paper is that it is easy to read and is logically instill with rich background. It gives, from the motivation, architectures, pipeline, to how is the data organized, how to do the query optimization in parallel. The main contribution of this paper is the systematically overview and summary. The only drawback of this paper is lack of diagrams and some comparisons.

Review 7

This chapter mainly investigate the issues of parallelism and distributed system in DBMS. A parallel database system seeks to improve performance through parallelization of operation while a distributed database system physically stored data across sites and it is motivate by distributed access and distributed analysis.

There are three main architecture for parallel databases, shared nothing, shared memory and shared disk. The speed bottleneck of interface motivated the development of shared nothing architecture which which provided linear scale up. Tha main approach used to parallel in a shared nothing system is to partition the input data and process parallely. Robin-hood partition cannot accelerate the query on subset of data, range partition may lead to data skew. Obtaining good parallel version of algorithm for sequential operator evaluation requires careful consideration. Many operations can be parallely execute, the author uses the example of hash join to elaborate.

A homogeneous distributed system means that the data is distributed but the DBMS runs on all servers are the same. Heterogeneous distributed system, aka multidatabase system means that different site uses different softwares but they are connected somehow. A gateway protocol is essential in heterogeneous distributed system. The heterogeneous system comes at a high cost in term of performance, software complexity and also administration difficulty. There are three main types of distributed DBMS architectures, that are client server system, collaborating system and middleware system. In the late part of this chapter, the author discussed the, fragmentation and replication, query optimization and evaluation of distributed DBMS, such as bloom join and semi join. Also, the transaction of distributed DBMS is discussed.

The main contribution of this chapter is that is gives a good overview on the parallel and distributed databases, many essential topics are covered in this chapter. Basic concepts and materials in these two topics are well organized in this chapter. The author clearly state the problem in each part and discuss the tradeoff between solutions.

There is a limitation of this chapter is that, when the author is discussing the distributed DBMS, concepts are only given in high level. Also the author only tried to transfer traditional DBMS into distributed DBMS, but failed to introduce some modern model of distributed database examples that are invented.

Review 8

The textbook chapter provides an overview of parallel and distributed database systems. The chapter starts with parallel databases. Important topics covered are architecture, pipelining, and partitioning. The authors explain a variety of alternatives for architectures (shared-disk, shared-memory, shared-nothing) and partitioning (round-robin, range, hash). From there, they describe applications for some of the alternatives. The second half of the chapter is about distributed databases. Similarly, the chapter starts out with architectures. Then it dives into distributed DB specific content, such as fragmentation, distributed query processing, and distributed transactions/concurrency control.

Many of the concepts are very well laid-out. I found that I understood the material pretty well after reading the chapter, although because it is a textbook chapter, it’s described at a fairly basic level. Overall, I found the chapter to be very thorough, describing most of the basic concepts that one would want to learn about before diving in to more advanced material. Two concepts that were particularly interesting were the sorting/joining algorithms that were described and two-phase commit. I had heard of 2PC before, but didn’t really know what it was, and the chapter did a good job describing how it works when failures happen, in addition to describing certain optimizations. The different distributed join algorithms were described very well, with the authors helping the reader to understand the pros and cons of each.

One thing that I found lacking was that there weren’t very many diagrams (maybe one on every 4th page), even though they would have been useful for explaining some concepts. I also noted that there did not seem to be much in terms of real-world examples. The authors could have enforced concepts by describing some real-world parallel and/or distributed systems, along with their applications. While some of this would have ended up seeming dated in 2018, I think it would still be useful to see how the concepts are applied in the real world.

Review 9

This chapter introduced parallel and distributed databases. The motivation for parallel database system is to improve performance through parallelization of various operations, while for distributed databases to goal is to increase availability, exploit locality of data access and gain the ability to analysis already distributed data.

The first half of the chapter focuses on parallel databases. There are three main architectures for parallel databases: shared memory, shared disk, and shared nothing. Among the three choices, only shared nothing architecture can provide linear speed-up and scale-up. To actually run queries in parallel, data should somehow be partitioned. The book introduces three methods: round-robin, hash and range partitioning. Besides data, parallel implementations for query operations are also necessary. The book gives examples for loading & scanning, sorting, and hash join. Here the key idea is to split and send data to multiple processors when necessary and later combine the results. For example, to do hash join, we can first use a hash function (on the join column) to split data into multiple small, disjoint parts. Then each processor can calculate one small join, and finally, we can simply combine the output of all processors to get the result.

Distributed database systems are much more complex than parallel ones as we need to provide distributed data independence and distributed transaction atomicity. As now there are multiple server nodes, the system architecture is quite different. In terms of which part should handle the coordination of sub-queries, there are three architectures: client-server, where the client should do all the sub-queries if it wants to access multiple servers, collaborating server, where server handles query split and merge and middleware, where some specialized servers handle coordinating. The book then talks about different designs of data storage: horizontal or vertical fragmentation or both, synchronous or asynchronous replication. It also mentions catalog management and how database systems generate a globally unique name for each replica.

Query processing in distributed databases should also take the cost of transferring data into account, especially for the join operation. The book shows how "fetch as needed" can result in very poor performance, and how we can use more advanced algorithms such as semi-join and bloomjoin to get better results. Finally, for distributed updates and transactions, the book describes several different designs. For example, we can use primary site asynchronous replication to handle updates. In this design, only the master copy can be updated, other copies (called secondary) is read-only. As another example, distributed concurrency control need a new way for lock management and deadlock detection. One solution may be to use a fully distributed lock system and hierarchical deadlock detection. Recovery is also a tough question in distributed databases, the book briefly explains how to use two-phase commit to solve the problem and points out several ways to improve it.

One weakness of this chapter is that it talks many different aspects of parallel and distribute databases, without getting deep into it. It’s good for someone who only wants to have a high-level understanding, but to really understand the topic, readers may need to find other resources.

Review 10

In Chapter 22 of "Database Management Systems", Ramakrishnan and Gehrke discuss the benefits and issues of parallelism and data distribution in a DBMS. Parallelism is a field that is growing in popularity due to its potential for gaining greater performance in operations such as loading data, building indexes, and evaluating queries. Likewise, data distribution is also trending due to its ability to enable databases to reflect organizational structures, have improved share/local autonomy, and grow modularly. With the trade-off for complexity and cost($$$), parallel and distributed DBMS function better than their sequential counterparts. It is not far-fetched to say that the future of DBMS will work on a scale-out model rather than a scale-up model. Since progress towards building machines that have faster computation has slowed down, we need to be smarter and effectively use our current resources. Thus, it is clear that this is an important and interesting field to research.

Ramakrishnan and Gehrke split up the chapter into several subsections:

1) Architecture for Parallel Databases: There are three main architectures that are considered: shared nothing, shared memory, and shared disk (described in previous papers). It is also important to note the scale up vs speed up relationship. Adding more CPUs enables more instructions per second and consequently, the ability to process larger problems.
2) Parallel Query Evaluation: Pipelined parallelism is limited by the presence of operators - they can be blocked. The best way to evaluate an operator in parallel is to partition the input data and combine the result. This approach is called data-partitioned parallel evaluation.
3) Parallelizing Individual Operations: We can implement bulk loading, bulk scanning, sorting, and joins. Pages can be read in parallel while scanning a relation, and the retrieved tuples can then be merged. When sorting we need to make sure that processors get these tuples in equal amounts. If we have a disproportionate amount, it can act as a bottleneck for the system. Lastly, for join, we can implement an improved parallel hash join that greatly reduces the cost for smaller joins.
4) Parallel Query Optimizations: The optimizer needs to consider the cost of running queries in parallel, rather than sequentially. It should also consider bushy trees with a larger search spaces and parameters such as the available buffer space.

5) Distributed DBMS Architecture: There are three different architectures: client server systems, collaborating server systems, and middleware systems. Each have separate functionality, with client server systems being the most popular. This system has several client processes and one or more server processes. The client process can send a query to any server process and are responsible for UI while the server handles transactions.
6) Storing and Updating Data in a Distributed DBMS: We can horizontally or vertically fragment data to consider subsets of rows of an original relation. Furthermore, replication allows for increased availability of data and faster query evaluation. To update fields, we can use asynchronous replication where copies of a modified relation are updated periodically and transactions that read different copies of the same relation may see different values.
7) Distributed Query Processing: In this case, we need a better way to evaluate cost of running queries - which means changing our cost model. Optimization runs essentially like a centralized DBMS and the cost metric must account for communication costs, but overall, the planning process is not any different.
8) Distributed Transactions, Concurrency Control, and Recovery: We need to manage locks across several sites and deal with deadlock situations that can occur. Furthermore, there needs to be a guarantee on transaction atomicity. Thus, when a site commits or aborts, it is reflected across all sites.

Even though the Ramakrishnan and Gehrke described parallelism and distributed DMBS in an logical manner, there are still some drawbacks. The first drawback is the lack of examples when describing parallelized operations. I feel that the best way to understand a topic is to deal with a specific case and then make an abstraction from there. The second drawback is that I would like to have seen a list of some companies that use data distribution effectively. Even if a something sounds good on paper, industry will sometimes be reluctant to adopt it due to extraneous reasons. Lastly, I felt that they could have considered more details about the runtime of parallel databases versus sequential databases. In particular, some graphs with generalized results would convince me that this is something worth pursuing, despite the complexity that would arise.

Review 11

This chapter describes various difficulties and implementations of distributed DBMSs. With multiple CPUs, DBMSs can have several kinds of parallelized operations. Data reads can be sped up, with several processors reading in parallel. Sorting can be parallelized using a form of quicksort; each processor takes a partition and sorts it, and the results are merged afterwards. Joining can even be parallelized: each processor can get a partition of a join from two tables, and join them independently. Because execution can be parallel, optimizers need to take this into account, and see what can be parallelized, like pipelined or independent operations that multiple processors can work on.

On the other hand, data can be distributed across multiple locations. This can increase storage easily, and speed up some operations. We desire that distributed data access be both atomic and independent, but this isn’t always possible. Any data can be fragmented, or stored in multiple locations, and replicated, or having multiple copies stored.

In order to access replicas consistently, each replica should have a unique name. Since keeping track of names across all locations would be a bottleneck, each location can have its own catalog of names of every replica it stores. Then, any query must lookup the locations of appropriate fragments in various locations. Copying fragments from one location to another can have addition costs on the query, so any optimizer must take these costs into account when choosing a query plan.

Updating replicas requires extra implementation to keep them consistent. One possible solution is to update replicas asynchronously, so that updating is fast, but independence and consistency are lost. For synchronous updates, a quorum based replication can be used, where reading and writing must be performed from enough replicas so that the two operations can’t happen simultaneously. To speed up reads, this can be done so that any write must update all replicas, and a read need only read from one. Another way to update data is to have one or more master replicas, and have all others be secondary. The masters must be updated, and then the data can be propagated to secondary replicas later on.

When data is distributed, concurrency control and recover must also be. A standard locking scheme won’t always protect distributed data, since different transactions can get locks on different data, so locking can be distributed, and deadlock detection must be distributed.

For recovery, logging can be done by consensus. A single location can be specified as the coordinator of a transaction, and all other locations involved are subordinates. The coordinator must receive acknowledgement from all subordinates in order to commit, and has the final call as to whether a commit or an abort happens. As such, recovery can determine committed transactions just by looking at the log of the coordinator.

This is a textbook chapter, and not a published paper. It does have some similarities. On the positive side, it’s well organized. Each section has a clear purpose, and the division between sections is clean. In general, it’s very easy to read. It doesn’t really have technical contributions, but the descriptions are useful for getting a general understanding of distributed DBMSs.

On the negative side, since the chapter covers many aspects of distributed databases, the chapter as a whole isn’t as focused. Different sections don’t have that much to do with each other, so the flow of the chapter is worse off. In addition, because it’s generalized, there generally aren’t specific examples or data that show the benefits of the described techniques or how exactly they work.

Review 12

This chapter mainly discuss the parallel and distributed database.
parallelism is motivated to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries. However, distribution is motivated for several distinct aspects including increasing availability, distributed access to data and analysis of distributed data.
Three main architectures of parallel database systems are shared-nothing, shared-memory and shared-disk. The shared-memory architecture is closer to a conventional machine, and many commercial database systems have been ported to shared memory platforms. Communication overhead is low because main memory can be used for this purpose, and operating system services can be leveraged to utilize the additional CPUs. But the shared-memory and shared-disk share a same problem of the scalability overhead of CPUs. As more CPUs are added, existing CPUs are slowed down because of the increased contention for memory accesses and network bandwidth. The shared-nothing architecture does not have the scalability overhead for lack of resource contention. But it also has some problems about data partitioning and reliability of the whole systems.
A parallel evaluation plan consists of a data-flow network of relational, merge,
and split operators. Good use of split and merge in a data-flow software architecture can greatly reduce the effort of implementing parallel query evaluation algorithms.
we have three main alternative architectures for distributed DBMS: client-server systems, collaborating server systems and middleware systems. A Client-Server system has one or more client processes and one or more server processes, and a client process can send a query to anyone server process. Clients are responsible for user-interface issues, and servers manage data and execute transactions. This architecture has become very popular for several reasons: simple to implement, inexpensive client machines and friendly graphical user-interface. we can have a collection of database servers, each capable of running transactions against local data, which cooperatively execute transactions spanning multiple servers. The idea of middleware is that we need just one database server capable of managing queries and transactions spanning multiple servers; the remaining servers need to handle only local queries and transactions.
Data is distributed in fragmentation or replication. Fragmentation can be concluded as horizontally and vertically. Replication is motivated for increasing availability and faster query estimation.
In general, a query involves several operations, and optimizing queries in a distributed database poses the following additional challenges: Communication costs must be considered. If we have several copies of a relation, we must also decide which copy to use. If individual sites are run under the control of different DBMSs, the autonomy of each site must be respected while doing global query planning.  
An alternative approach to replication, called asynchronous replication, has come to be widely used in commercial distributed DBMSs. Copies of a modified relation are updated only periodically in this approach, and a transaction that reads different copies of the same relation may see different values. Thus, asynchronous replication compromises distributed data independence, but it can be implemented more efficiently than synchronous replication.
The activity of a transaction at a given site can be referred as a subtransaction. When a transaction is submitted at some site, the transaction manager at that site breaks it up into a collection of one or more subtransactions that execute at different sites, submits them to transaction managers at the other sites, and coordinates their activity, which involves concurrency control and recovery that require additional attention because of data distribution.
In consequence, the book is helpful in understanding the basics of parallel and distributed DBSM and the ‘review’ part is quite helpful in understanding the structure of the chapter. It’s a classic material suitable for summarizing the knowledges after we have read several state-of-art papers. Maybe they can have more examples or graphs to illustrate some ideas.

Review 13

This chapter introduces parallel and distributed databases. The purpose of having a parallel database is to improve performance through parallelization of various operations. In a distributed database, data may be stored in a distributed fashion. The basic idea behind parallel databases is to carry out evaluation steps in parallel, therefore, there needs to be a parallel architecture. The tree basic structures are Shared Nothing, Shared Memory, and Shared Disk. Shared Nothing architecture provides linear speed-up, therefore the time taken for operations decreases in proportion to the increase in CPUs and disks. For Shared Memory architecture, communication overhead is low, however memory contention becomes a bottleneck as number of CPUs increases. The problem with Shared Disk architecture is that as more CPUs are added, existing CPUs are slowed down because of increased contention for memory access and network bandwidth. The ways of partitioning data introduced here is partitioning horizontally across several disks by assigning tuples to processors in round-robin fashion. Round-robin is partitioning is suitable for efficiently evaluating queries that access the entire relation. If only a subset of tuples are required, hash partitioning and range partitioning are more appropriate. To reduce skew in range partitioning, one approach is to take samples from each processor, collect and sort all samples, and divide the sorted set of samples into equally sized subsets. Next, the chapter introduces how to parallelize individual operations. For bulk loading and scanning operations, pages can be read in parallel and then be merged. For sorting operation, distribute tuples by using range partitioning. Each processor uses sequential sorting to sort elements in a range, and then elements across processors can be merged. The main challenge here is to partition the data so that each processor receives roughly the same number of tuples. The better approach for operation join is to execute smaller join on after another in parallel using all processors. Not only the operations can be run in parallel, queries can also be run in parallel by utilizing the parallel architecture. The next section introduces different types of distributed databases: homogeneous and heterogeneous. The difference is whether all servers run the same DBMS software. Specifically, there are three architectures: Client-Server, Collaborating Server, and Middleware. First one has one or more server processes, and each process can send a query to any server. The second one does not allow a single query to span multiple servers. The last one allows a single query to span multiple servers. Also, replication is needed to increase availability of data and make query evaluation faster. To update distributed data, all copies of a modified relation must be updated before the modifying transaction commits. Synchronous replication comes at a significant cost. Before an update transaction can commit, it must obtain exclusive locks on all copies. Asynchronous replication allows different copies of the same object to have different values for short periods of time. The chapter also discusses concurrency control issues in distributed system and how to avoid them.

This chapter has a thorough explanation on distributed database system and basically covers all of the topics, including architecture, operations, recovery, and concurrency control, that should be discussed in a chapter of a book. Since this is not a paper, I don't have any dislikes of this reading.

Review 14

Chapter 22 “Parallel and Distributed Databases” of “Database Management Systems” by Ramakrishnan and Gehrke provides an overview of parallel and distributed DBMSs, including different architectural choices, approaches for query evaluation and optimization, different storage options, and tradeoffs to consider. The chapter starts with discussing the difference between parallel DBMSs and distributed DBMSs, namely that the motivation for a parallel DBMS is to improve performance and the motivations for a distributed DBMS are more to increase availability and strategically distribute access to data. Parallel DBMSs may have data distributed, and distributed DBMSs may consider performance issues, but these are generally not the respective focuses.

Architectural options for parallel DBMSs include shared-nothing, shared-memory, and shared-disk; shared-nothing tends to be the most effective for large parallel database systems. A parallel DBMS could process a query by evaluating different operations in parallel and/or evaluating different parts of an individual operation in parallel. To take advantage of parallelizing an individual operation, input data needs to be partitioned; the chapter discusses round-robin, hash, and range partitioning options and their abilities to evenly distribute data in a meaningful manner. The chapter discusses data and operation partitioning and merging approaches for supporting parallel bulk loading and scanning, sorting, and joins operations.

Next the chapter discusses distributed databases. Often desirable properties in a distributed database are 1) distributed data independence (that users can make queries without needing to worry about where the data is actually stored), and 2) distributed transaction atomicity (that users can compose transactions which are automatically atomic; that users do not need to specifically worry about atomicity). There are a few options for distributed DBMS architectures: homogeneous vs heterogeneous distributed DBMSs (whether all servers run the same DBMS software), client-server, collaborating server, and middleware. Additionally, data can be stored in distributed DBMSs as fragmented (a single relation is split across multiple servers) and/or replicated (multiple copies of a single relation or fragment is stored on multiple servers). Relations may be fragmented strategically for local needs or to support expected parallel workloads. Replication is important for increased availability of data (in case a server goes down or there is a communication failure) and faster query evaluation (to use a local copy rather than communicate to another server). With data distributed, one must consider where data catalog information is stored, and when it is updated. There are also challenges in supporting distributed query processing, in particular considering the cost of communication across servers: which relations or parts of a relation to send and to which servers, where to perform the operations; semijoins and bloomjoins are different techniques discussed to address these questions. Another concern is updating distributed data via synchronous or asynchronous replication, with 2 competing factors of commit efficiency vs data consistency to consider. Finally the chapter discusses distributed transactions and approaches for distributed concurrency control and recovery.

I found the writing to be easy to digest, probably because this is a textbook written for university students. Examples for particular data made it easier to understand how different data partitioning and joins approaches work, and the communication costs associated.

Some important themes I noticed in the chapter for DBMS designers or someone choosing a DBMS were the importance of considering communication costs across servers, and transaction commit efficiency vs atomicity/consistency. These were nicely explained at separate points throughout the chapter, but I think it would have been nice to have a summary section at the end of the chapter of these and other important considerations.

Review 15

In chapter 22 of the Ramakrishnan and Gehrke’ s book, in this chapter, they mainly talked about the issues of parallelism and data distribution in a DBMS. Although the centralized DBMS had developed for so many years, due to the great demand on the performance of the DBMS (speed, reliability, availability, and scalability), parallel and distributed DBMS are introduced. In this chapter, they discussed the detail of parallel and distributed DBMS, this issue is quite important because nowadays, parallelism DBMS structure and distributed data storage are almost everywhere, it is necessary to learn the basic of these techniques. Next, I will summarize the crux which I learned from this chapter.

A parallel DBMS seeks to improve performance through parallelization of various operations. In a distributed DBMS, data is physically stored across several sites and each site is typically managed by a DBMS capable of running independently of the other sites. Apart from solving performance issues in a parallel DBMS, the distribution of data is governed by factors such as local ownership and increased availability. Also, the distributed data storage also enables the distributed access to data and the analysis of distributed data. There are three kinds of architectures for parallel DBMS, shared nothing, shared memory, and shared disk. One interesting thing here is that in shared memory and shared disk scenario, one cannot improve the system performance by increasing the number of CPU, because it will increase contention for memory accesses and network bandwidth. A big advantage of shared-nothing architecture is that it provides both linear speed-up and also linear scale-up. For evaluation, they introduce an approach called data-partitioned parallel evaluation. The shared-nothing structure is very amenable to data-partitioned evaluation. They introduced three kinds of partitioning methods: round-robin partitioning, hash partitioning, and range partitioning. In the paralleling individual operations parts, they introduced the bulk loading and scanning, sorting, joins. Apart from optimizing single operation in a query plan, we can also optimize different operations in a query and execute multiple queries in parallel. Within a query, there can be two kinds of interoperation parallelism, the first one is that the result of one operator can be pipelined into another, the second is that multiple independent operations can be executed concurrently. An important observation here is that a parallelizing optimizer should not restrict itself to left-deep trees and should also consider bushy trees, which significantly enlarge the space of plans to be considered.

For distributed DBMS, one of the most important property is the transparency, this property means that people can still use their service unaware even some parts of the system fail. Besides, the distributed data independence and distributed transaction atomicity are also preferred properties. The distributed DBMS can be divided into two categories, the homogeneous distributed DBMS and heterogeneous DBMS. The key for implementing the heterogeneous DBMS is the usage of gateway protocols. They introduced three kinds of distributed DBMS architectures, which are client-server, collaborating server and middleware. For distributed DBMS, in order to reduce the overhead, a single relation may be partitioned or fragmented across several sites, this fragmentation guarantee that data can be stored at the place where they are most often accessed or replicated. One can perform horizontal and vertical fragmentation for their relations. The two motivations for replicating data are 1. Increase the availability of data; 2. Faster query evaluation. In order to keep track of data like how relations are partitioned and replicated, we need to utilize a distributed catalog. For reducing the number of Reserves tuples to be shipped, they introduced two techniques, semi join, and bloomjoin. For the updates in the distributed scenario, it can be divided into synchronous and asynchronous replication. As they said in their book, synchronous replication is undesirable and expensive, the asynchronous system becomes popular. P2P replication is one kind of replication strategy, however, this method subjects to conflict resolution requirements. Primary Site is another async replication method, the crux of it is the change capture and apply from primary to secondary copies. Log-based capture and procedural capture are two kinds of capture methods. Next, they discussed the concurrency control issues in distributed databases. The lock management can be done in three ways: centralized, primary copy and fully distributed. The problem needs to solve is the distributed deadlock detection. In the end, they discussed recovery issues in a distributed scenario, generally speaking, recovery in a distributed system is harder because more kinds of failures can involve. Commit protocols are coming for solving the inconsistency and guarantee that all process should do commit or abort simultaneously. Classical commit protocols 2PC and 3PC are discussed.

I think the main advantage of this part is that the content is pretty comprehensive, it covers several important issues in the distributed and parallel DBMS design. I learn a lot by how to apply traditional algorithms in a relational DBMS to them. For example, we can partition the data to different processors so that we can have a parallel hash join between two relations. Besides, when introducing some techniques, for example like synchronous and asynchronous replication, they discussed both the pros and cons of each one, which gives the reader a better sense of these approaches.

However, I think there are some downsides to this paper. I think the content of this chapter is a little bit theoretical, they do not give a rich example of how the parallel and distributed DBMS are utilized in real-world production. I think they can give some advice which helps people understand how to make a tradeoff in choosing different kind of systems to solve their problem.