Review for Paper: 22-Chapter 22

Review 1

Parallel databases use parallel execution with the goal of improved performance. Distributed databases, in contrast, use servers at different sites to make data available in spite of local outages, and to allow data to be processed close to where it is requested. Parallel databases present issues such as which parallel architecture to choose and how to execute queries like joins and sorts efficiently in parallel. Distributed databases pose their own problems, including which client-server topology to choose, and how to perform distributed joins with minimal data shipping.

Chapter 22 on parallel and distributed databases covers many details unique to these non-centralized or non-serial database systems. Parallel databases can be architected as shared-nothing, shared-memory, or shared-disk. The chapter explains how shared-nothing scales up better than shared-memory or shared-disk, because those other designs produce heavy contention for memory access, while shared-nothing (where each blade server has its own memory and storage) has performance that improves linearly with the number of CPUs. Distributed databases can be designed primarily as client-server, collaborating server, or middleware systems. Client-server systems are limited by the fact that each server works with only one client at a time; collaborating server systems remedy this by letting a server break a client query into subqueries, which are delegated to other servers. Middleware systems take this to the extreme, by having one server receive all client requests, and delegate them to slave servers that execute subqueries only on local data.

The chapter contributes excellent descriptions of methods like semijoin and Bloom join, for executing joins in a distributed database with little data shipping. The worst way to perform a distributed join across two sites might be a nested-loop join, where one site “fetches as needed” the data from the remote, using many passes of data shipping. A slightly better way is to “ship to one site” one of the tables, so the join can be performed locally. The semijoin is an improvement on ship-to-one-site, which ships fewer rows or columns in typical cases. In a semijoin, a site A first ships the columns needed for a WHERE predicate to site B; site B selects the rows that match the received data and ships them back to A; finally, A completes the join. This reduces data shipping, relative to having B ship its entire table to A.


Review 2

The chapter throws light on the issues of parallelism and data distribution in a DBMS. Centralized database management systems maintain all the data at one site with execution of data being sequential. For better performance, parallel evaluation techniques and data distribution are being employed by the DBAs. These provide increased availability of data and faster read rates caused by distributed access of data.

The chapter provides an overview of the three architectures for parallel databases (shared memory, shared disk and shared nothing), with shared nothing being the leader since it doesn’t have any costs related to resource contention by the processes. It stresses on how individual operations like sort and join are handled on parallel systems along with a brief introduction on data partitioning (horizontal partitioning and vertical partitioning). Distributed architectures like Client-server systems (many clients and servers in a network), collaborating server systems (a query is able to span multiple servers) and middleware systems (one database server manages all multi-site execution strategies) are explained in addition to storage techniques used by the distributed databases. For recovery purposes, synchronous (all sites are updated immediately) and asynchronous (all sites are updated eventually) replication techniques are in use. Lastly, distributed transactions are handled using distributed concurrency control and distributed recovery with the use of two-phase commit (2PC) technique. In 2PC, a coordinator sends a prepare message to each subordinate after receiving which subordinate votes yes or no for committing or aborting a transaction. After receiving all the “votes” the coordinator sends the commit or abort message to all and wait for an acknowledgement from every subordinated before it writes and end log record for the transaction.

The chapter is successful in providing the readers with the basic concepts of parallel and distributed systems in simple language. These are supported with variants of the terms involved which address different issues faced during by the database systems.

The reading had a large number of typos making the read a painful task at some points in the text. Apart from that, not many real-life uses and examples support the concepts provided in the chapter.



Review 3

This chapter investigate the issue of parallelism and data distribution in DBMS. A parallel database system seeks to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries. In a distributed database system, data, is physically stored across several sites, and ea,ch site is typically managed by a several capable of running independent of the parallel sites. Three main architectures have been propose for building parallel DBMSs: shared memory. multiple CPUs are attached to an interconnection network and can access a comment region of main memory; shared-disk, each CPU has a private memory and direct access to all disks through an interconnection network; shared-nothing system, each CPU has local main memory and disk space. In a shared-nothing architecture, the main key to evaluating an operator in parallel is to partition the input data such that the data shipped is minimized and most of process is in individual processors. The author propose three ways: round-robin fashion, hash partitioning, and range partitioning. Then it discuss how to parallelizing individual operations such like bulk loading and scanning, sorting, and join. After these, it study two properties that a distributed databases is "desired": distributed data independence, and distributed transaction atomicity. For data independence, there are several aspects are considered: architectures(client-server, collaborating server, middleware systems), storing(fragmentation, replication), and distributed query processing. On the other hand for transaction atomicity, it studies: updating data(synchronous, asynchronous), concurrency control, distributed deadlock, and distributed recovery.
I like the way it presents these topic and how it run through different aspects of ordinary DBMS: from architecture to transaction. A drawback is that the span is rather too wide and sometime made me lose track of it.



This chapter investigate the issue of parallelism and data distribution in DBMS. A parallel database system seeks to improve performance through parallelization of various operations, such as loading data, building indexes, and evaluating queries. In a distributed database system, data, is physically stored across several sites, and ea,ch site is typically managed by a several capable of running independent of the parallel sites. Three main architectures have been propose for building parallel DBMSs: shared memory. multiple CPUs are attached to an interconnection network and can access a comment region of main memory; shared-disk, each CPU has a private memory and direct access to all disks through an interconnection network; shared-nothing system, each CPU has local main memory and disk space. In a shared-nothing architecture, the main key to evaluating an operator in parallel is to partition the input data such that the data shipped is minimized and most of process is in individual processors. The author propose three ways: round-robin fashion, hash partitioning, and range partitioning. Then it discuss how to parallelizing individual operations such like bulk loading and scanning, sorting, and join. After these, it study two properties that a distributed databases is "desired": distributed data independence, and distributed transaction atomicity. For data independence, there are several aspects are considered: architectures(client-server, collaborating server, middleware systems), storing(fragmentation, replication), and distributed query processing. On the other hand for transaction atomicity, it studies: updating data(synchronous, asynchronous), concurrency control, distributed deadlock, and distributed recovery.
I like the way it presents these topic and how it run through different aspects of ordinary DBMS: from architecture to transaction. A drawback is that the span is rather too wide and sometime made me lose track of it.



Review 4

This paper (more precisely a chapter from the book Database Management Systems by Ramakrishnan and Gehrke) provides an overview for parallel and distributed databases. This techniques come from the increased use of parallel evaluation techniques and data distribution. In general, parallel DBS is for improving performance through parallelizing operations such s loading data and evaluating queries, and distributed DBS is to physically distribute data across multiple sites where each site can be managed by an independent DBMS. This paper first presents general architecture for parallel databases. Then it introduces data partitioning and how it can influence parallel query evaluation and parallelize relational operations. Next, it presents general overview of distributed databases, including distributed catalog management, query optimization and evaluation, updating, and transaction management.

The general problem here is that there are some situations where the traditional centralized databases management system needs to perform better. In order to improve performance, there is a strong need to take use of parallelism. For example, if a single query can be executed in parallel in different sites, the total throughput will be increased. There are several motivations for distributed DBS, including increased availability, distributed access to data, and analysis of distributed data.

The major contribution of the paper is that it provides a detailed summary about parallel DBS and distributed DBS. It provides examples and graphs to illustrate new concepts and architectures. Here we will summarize the key components below:

1. Parallel DBS
a. Physical architectures (shared nothing, shared memory, shared disk)
b. Data Partitioning (range partitioning vs hash partitioning, data skew)
c. Different sorting ideas

2. Distributed databases
a. distributed data independence, distributed transaction atomicity
b. gateway protocols
c. architectures (client-sever, collaborating server, middleware)
d. fragmentation and replication (catalog management)
e. join (semijoin, bloomjoin)
f. updating (synchronous/asynchronous replication)
g. distributed concurrency control (distributed deadlock, distributed recovery, 2PC)

One interesting observation: I like the way the author introduces new concepts or terms. It is very clear and accurate with detailed examples, and I think it is a good example of textbook. I found the algorithms to solve distributed deadlock is very interesting, especially hierarchical algorithm which groups sites into a hierarchy. The idea is very simple and comes from real life, but it works well in such a complicated system. We could make much progress if we can apply the real life to system designs.


Review 5

In this chapter, the authors introduce distributed databases and the issues related to parallelism and data distribution in database systems. After introducing distributed databases, they discuss about alternative configurations for a distributed DBMS. Three physical architectures that are mentioned are 'shared-nothing', 'shared-memory' and 'shared-everything' architectures. After that, they introduce the concept of data partitioning and its effects on parallel query processing. They show that how several relational operations can be run by using partitioned data and talk about parallel query optimization. In the rest of the chapter, they discuss about more advanced issues in the area of distributed databases. For example, they discuss about catalog management, updating distributed data and transaction management.

I think, this chapter is a really good start point for everyone that wants to know more about distributed databases. It describes complex concepts with a very clear and easy language.


Review 6

This chapter provides an overview of scaling databases out to multiple processors, disks, and machines. They first cover parallel databases, and then move onto distributed databases, but many of the same ideas apply. They begin by reviewing the different architecture choices that can be made - as we have seen from other papers, shared-nothing systems are the winner for systems that scale. The authors then discuss how multiple machines can be used to speed up query evaluation. The output of one relational operator can be piped into another operator, on another machine. This is a very natural way to improve performance, however, relational operator trees are rarely very deep, so the speedup possible here is limited. In addition, data can be partitioned so that queries doing a full table scan can get their results in less time than it would take one disk to do a full scan of the data. They then discuss how individual operators, such as joins and scans, can be sped up my using multiple machines. They discuss many of the complications of creating a database for a distributed system, as well as many optimizations that can be used, such as bloom joins and semi joins.

This chapter does a good job of providing an overview of distributed and parallel databases. They present each problem clearly, and then discuss a few possible solutions, as well as the trade-offs between them. They're are a lot of different options for how to store, partition, update, and distribute data in a large system, and this chapter provides an solid background in the mainstream options.

This chapter really does try to cover a lot. Distributed databases are an exploding field, and we've seen many papers recently that present novel ideas to deal with issues of scale. This chapter seems to take a very "port traditional DBMS code to DS", rather than "create DBMS for a DS". Their discussion of distributed transactions, concurrency control and deadlock detection seems driven by their desire to make their current code work on a distributed system. I think the Amazon Dynamo paper showed that their are some dramatically different design decisions that you can make to increase the performance of a system at such scale. However, this is a textbook, and not an academic paper, so I can understand that its not purposing any profound new ideas.



Review 7

Problem and solution:
The paper is a chapter in a textbook. It talks about the new trend of the database development, which is parallelism and data distribution. Parallelism is used to improve the performance and the distribution is also for increasing the availability.

For parallelism, there are three main architectures: shared nothing, shared memory and shared disk. The shared-memory architecture has low communication overhead, but the shared memory and shared-disk ones will have the bottleneck when more CPUs are added. The shared-nothing architecture can be more extensive with linear speed-up. The aim of data partition is to minimize the data shipping during processing. Round-robin partitioning is used for accessing the entire relation, and hash partitioning and range partitioning are used for the subsets.

For distribution, there are two types of distributed databases, which are homogeneous and heterogeneous, and three architectures: client-server, collaborating server and middleware. The client-server has both multiple client and server processes. It is easy to implement and has good performance. The collaborating server has a collection of servers to execute transactions collaboratively. And the middleware makes a query to be executed in multiple servers without a management. The distributed database needs to solve the communications and the transactions in replications, so the locks are needed for concurrency. Also the failure tolerance and recovery need to be considered.

Contribution:
The contribution of the paper is conveying the idea of the parallel and distributed database. It helps us to think out of the box of regular database to improve the performance. Different from the centralized serial database, the idea behind parallel database has quite special pattern. The core is to find out a best partition of the data to balance the workload and minimize the communication between processor. The paper shows the parallelism of single operation, like sort and join. And also it mentioned the distributed database, which could not only improve performance, but also increase the availability. It creates replicas of partitioned data and manage them, which makes it different from the parallel database.

Weakness:
Though the paper is good, there are still some confusions. One is that I could not identify the parallel and distributed database clearly, as well as the necessity to apply distributed database in some certain situation. Take the queries in 22.10.1 as an example, it has no big difference to run it in parallel database or distributed database. Another trivial shortcoming is that the number of typos in the paper is really too big. It brings huge difficulty for me to read the paper.


Review 8

This chapter of the textbook gives us a high-level view of parallel and distributed databases, how the architecture differs from a regular DBMS, what operations need to behave differently, and various aspects of the DBMS that has to be changed or adapted for better performance and correct data management.

Parallel database seeks to improve performance by parallelizing various operations and data management across different machines, in order to improve scalability and performance. Other than performance improvement, distribution of data is motivated by: increased availability, distributed access to data, and analysis of distributed data. The DBMS tries to parallelize various operations in the following manner:
* Sorting - distribute the data through range partitioning, then sort the partitions individually then merge.
* Joins - there are couple ways to do this. one could perform local joins based on the partitioning, or use a hash function for optimal partitioning.
In distributed DBMS, data partitioning can happen in two ways:
* horizontal fragmentation: subset of rows of the original data
* vertical fragmentation: subset of columns of original data.
In addition to partitioning, data is often replicated in order to increase availability of data for faster performance using local copy. However, managing replicated data can be costly, as they can result in large coherency overhead depending on the workload or management scheme. To keep track of the location of distributed data, a centralized catalog structure is used to keep track of which data is replicated where.

This paper provides a good overview of the different aspects of the parallel and distributed database, and serves as a good intro to the topic. However, the material presented are often too high level, and don’t quite delve into the tradeoffs or in case of aspects that have multiple different options, doesn’t provide us any insights as to what gets used more or why one is more advantageous over another.



Review 9

This paper is an overview for parallel databases and distributed databases. Following are the summary and some thoughts.
Parallel databases:
1.The main goal of parallel databases is to boost performance, that is, ideally, if there are n CPUs, a single transaction can be done n times faster than in a single CPU. But this linear speedup is usually not achievable.
2.There are three main architectures for parallel databases: shared memory, shared disk and shared nothing. Shared memory is easier to be ported to, but as number of CPU increases, the contention of memory will become a bottle neck. Shared disk has the similar problem on disk contention. Hence, shared nothing architecture is highly scalable and widely used.
3.There are two way to do parallel: pipelining and data-partitioning. To exploit the benefit of parallel DB, operators should also be parallelized. In addition, data partition is significant to performance, since skewed part will become the bottle neck.
4.Query optimizer is different: pipelining and concurrent operator execution should be considered. Compared to system R approach, the search space is expanded. E.g, bushy join may be taken into consideration. In addition, multi-user environment of the parallelism becomes an issue.
Distributed databases:
1.Data is distributed for several reasons: High availability, distributed access, or the demand for analysis distributed data and etc. As for the distributed databases, the common desired goal is distributed data independence and distributed transaction atomicity. However, in some situation, properties above may not even be desirable, since the network goes even worse and systems are heterogeneous, it would be too expensive to keep those feature and the corresponding performance might not be satisfying.
2.In distributed DBMS, data is partitioned either horizontally or vertically or sometimes both among different sites. Those fragments are often replicated for availability or faster query. To access the data, name of each fragment data contains a local name, birth site name and also a replica-id. The catalog is stored in a distributed fashion. One good approach is that catalogs of the replicas are maintained by their own birth sites respectively, even if the data is removed from birth site. When a user wants to access a relation, he only needs to specify its local name. The distributed database will find the global name by checking the user site id and local name. To allow user access other users' data, synonym is implemented.
3.Basic query operations in distributed DB are different. a) only replicas that fits the predicates might be considered. b)When determining query site among replicas of the same data, network topology cost should be considered. c) The cost for shipping data should also be considered. In the joins, to minimize the total cost, semijoin and bloomjoin can be used. Compared to semijoin, bloom join has less processing cost but its effectiveness is sensitive to the method of hashing.
4.Query optimizer should take into consideration the communication cost and individual site autonomy.To achieve the second requirement, the centralized query optimizer generates overall plan which contains local plans for each site involved. As for the local plan, the corresponding site take it as suggestion. It can ignore it and take out the local query plan based on the more detailed local information.
5.To update distributed data, there are two mechanism: synchronous and asynchronous replication. To achieve synchronous replication, the first technique is voting: write majority and read enough. Read is relative expensive in this approach. The second technique is read any and write all. Read is inexpensive, but write takes long time and is blocking. Asynchronous approach compromises data consistency to performance. There are two common way: primary site and peer-to-peer replication.
6.To make distributed transaction atomic, lock management and recovery are both necessary. Lock manager can be distributed among all the sites. Each site is responsible for locking of its own replicas and some will be assigned the responsibility to detect distributed deadlocks. For dead lock detection, there are two approaches: centralized and hierarchical. The latter works well if deadlocks are among closely related sites. However, they both suffers phantom deadlocks.
7.Recovery needs to deal with network failure and multi-site transactions.Two phase commit(2PC) protocol is introduced: To perform a commit, the coordinator and subordinators must agree with one another. After a crash, the sub transaction may be aborted or retry the commit based on the log records. To be mentioned, the recovery will be blocked by the coordinator failure. Three-Phase commit is free from this problems but it introduces huge overhead in normal executions, hence it is impractical.

Discussions:
1.This paper shows a big picture and basic concepts of parallel databases and distributed databases. They both are really big topics, as you can see in the above summary, but the author organize them fairly well. In parallel databases, it introduces how parallel databases boost performance in several aspect as the architecture issue, query parallelization and optimizer concerns. In distributed systems, concept concerning data partition, replication, metadata management, query evaluation, atomic transactions are introduced to give one a good picture how distributed system acts like one. I read this chapter in an earlier version in 1994, in which it neither include asynchronous replication approaches nor notice the varied requirement for a distributed databases. Consider all this, this paper(new version) is a big advance.
2.In parallel databases, the discussion of speedup is not that accurate.First,as shown in first graph of figure 22.2, linear speedup is ideal in many cases. However, sometimes super linear is achievable. When the parallel DB scales in shared nothing architecture, the size of main memory also scales. It is possible that we could possibly put related relations in a query in memory! This may upgrade performance in magnitude. Second, only sub-linear is achievable in most cases. The reason should be discussed in detail. For my personal opinion, one is that communication overhead slows down the parallel query. The other one is that Amdahl’s law limits the speedup, i.e, the sequential part of the parallel operators limits the overall performance.
3.In distributed databases, it is not clear when it comes to data independence. In this chapter, it seems that user can only create and access data on a single site(default user site). What if user can login at different site and create relation locally? In the meaning of data independence, user could use local name to access data without worrying about where the relation stored. But the situation is not included in this section where user may want to query about the relations created among different sites.



Review 10

Summary:

This chapter covers the topics of parallel and distributed databases. Parallel databases use parallelization methods (multiple cpus/other resources) to improve performance. Distributed databases store their data across multiple machines, for the sake of availability and locality in addition to performance.

Parallel databases can use any of the three main architectures (shared memory, shared disk, shared nothing) but shared nothing provides the best scalability. Parallelization can happen between two operators, or within a single operator. In a bushy query plan, multiple operations which serve as inputs to a later operation are independent and can be performed in parallel. Within an operation, the input to an operation can be partitioned and each partition can be processed by a different processor. In a shared nothing system, these partition may be different from the way data is partitioned across blades. In this situation, each blade must partition its own data and send each of its partitions to the processor that is responsible for it. Query optimizers for parallel databases must make accurate estimates of the effect of parallelization in candidate plans.

In a distributed database, data can be fragmented horizontally/vertically, or replicated as needed. The goal of fragmentation and replication is to adapt to the expected workload, such that many queries will only need to access one or a few nodes. This is especially important as communication costs between nodes is expensive. Query optimization must take into account the cost of inter-node communication, especially in cases like joins where potentially many tuples may need to be sent back and forth. Joins (let’s say between A and B) can be optimized using Semijoins or Bloomjoins, which attempt to send an abridged form of A to do a preliminary join (which may eliminate much of the joined-to table) with B, and then send this result back to join with A. Consistency is also a big consideration in distributed databases, and usually an asynchronous replication scheme (like eventual consistency) is used because synchronous replication (strong consistency) is too expensive. Deadlock is managed by reporting dependencies to a centralized location. Recovery and atomicity is obtained through voting and acknowledgement schemes of replicas.

Strengths:
This chapter covers a wide array of issues with parallel and distributed databases and give many options for solutions. It is fairly easy to read and gives helpful examples.

Weaknesses:
I don’t know why parallel and distributed DBMS wasn’t split into two chapters. Also, since this is a textbook there was not as much logical flow between sections. Rather, the chapter focused on giving a summary of many different topics.



Review 11

The article is a chapter on parallel and distributed databases from the textbook “Database Management Systems” by Ramakrishnan and Gehrke. This rather long chapter compared to the previous chapters we have read on the topics of transaction management and concurrency control provides an overview of parallel and distributed databases with a great coverage of sub-topics.

In the era of big data that we live in, parallelism and scalability have become keywords that are essential to support in many database systems. A single-node database system has been proven to be not suitable to keep up with parallelism and scalability that are required to handle large-scale data in practice for a long time. This necessitated the development of parallel and distributed database systems. This article provides a broad overview of such databases and explains various techniques in implementing them.

The article is very rich in terms of the range of sub-topics it covers in the area of parallel and distributed databases. It starts with basic architectures of distributed databases and moves on to touch many of sub-topics, consisting of: parallelizing individual operations, parallel query optimization, data partitioning, data fragmentation/replication, distributed catalog management, distributed concurrency control and so on. One may say that the article is not perfect in its coverage of topics, but I think that the authors had put an enough effort to discuss most of relevant topics in order to provide a good overview of parallel and distributed databases.

Even though the article has done a good job in discussing most of relevant topics, it falls short in providing practical examples for each architecture or technique mentioned in the paper. Since it is basically a textbook, it focuses on explaining each subject drily without giving much reference to real-world examples. The article does mention a few times that some techniques are not used in practice due to their obvious flaws or disadvantages, but on the other hand, it does not address how other useful techniques are actually implemented, which could have engaged readers more into the topics and appreciate the techniques discussed in the article.

In conclusion, the article provides a comprehensive overview of parallel and distributed databases. The article could have been better with more practical examples or case studies in my opinion, but the authors have done a good job in covering many relevant topics in the area. It is a great introductory text for readers who start to look at the topics in parallel and distributed databases.


Review 12

This was a textbook chapter covering the many aspects of parallel database design and management. The chapter assumes a shared nothing architecture due to its linear speed up and scale up properties in a parallel or distributed environment. Parallel databases are used to improve performance, especially when data is large. Many operations performed by a centralized database manager can be paralleled and thus reduce the overhead imposed by a centralized system. Data distribution can also be done more efficiently for large companies by spreading data across multiple sites. In doing so, companies can increase data availability and exploit access pattern locality.

The chapter starts with a discussion of parallel databases and how certain operations, such as buck loading and relation scans, can be parallelized. It then talks a lot about distributed databases since they are a form of a parallel database. The performance of these kinds of databases is affected by how the data is partitioned and what fields it is partitioned on. Data partitioning can take two forms, horizontal or vertical partitioning. Furthermore, distributed databases may also employ replication for high availability and fast query evaluation.

Although replication seems like it would always be a performance gain, the chapter points out that distributed database design needs to take into consideration network communication cost more than even. Communication between sites should be low so as not to overload the network. Replication hurts this because it requires sites to make sure they have updated copies of the object to maintain consistency. Concurrency control and recovery are also issues. Since objects are distributed, the locking protocols must be distributed as well, but this makes reasoning about common locking issues such as deadlock more difficult. The chapter ends with a discussion of two-phase and three-phase locking, a modification of two phase locking that allows for coordinator failure.

A major part that was missing in this chapter was a more in-depth discussion of concurrency control. The chapter only mentioned locking and how it could be implemented in a distributed database, but in practice, most systems do not use locking protocols because they do not scale well with high loads. This problem is exacerbated in a distributed environment where not only do transactions need to lock objects locally, but also globally. The increased communication overhead from maintaining global wait-for graphs makes locking obviously infeasible. However, the chapter does not discuss any other protocol like timestamp based concurrency, which is presumably better in a distributed environment where sites can fail and result in transactions on that failed site holding locks until the manager times them out.


Review 13

This chapter gives an introduction to parallel and distributed databases, in contrast to centralized databases. Parallel databases parallelize operations to improve performance; distributed databases consider more factors such as availability and data distribution.

Parallel Databases:

The shared-memory and shared-disk architectures both suffer from contention of memory and the interference of interconnection network as the number of CPUs grows. Thus the shared-nothing architecture, which provides linear speed-up and linear scale-up, has been considered to be the best architecture for parallel DBMSs.

In short, there are three kinds of parallel execution of queries:
1. execute multiple queries in parallel
2. execute multiple operations in a single query in parallel
3. execute a single operation in parallel
In this chapter, it is stated that the parallelism often focuses on the second and the third one; that is, the DBMSs do not consider other queries when optimizing a query. The second type of parallelism is: within a query, independent operations can be executed concurrently; even if an operator takes the output of another operator as its input, we can still do pipelined parallelism. The third type of parallelism is: a single operation can be also executed in parallel by partitioning the input data, evaluating and combining the results. There are several ways to partition the data:
1. round-robin partitioning: good when the operator accesses the entire relation
2. hash partitioning: good when only a subset of data is being accessed; data can be distributed evenly
3. range partitioning: good when a range selection is specified; need extra effort to avoid data skew
In order to achieve parallel execution, additional split and merge operators are needed in addition to relational operators, which altogether form the parallel execution plan of a query. As for query plan optimization, there are also some issues need to be noted such that the cost estimation differs from serial execution and that the bushy tree plans may have higher cost but less execution time.

Distributed Databases:

There are some desirable properties of a distributed databases. One is that the system should provide an abstraction that the users do not need to specify where the data is distributed when writing queries. The other is that the transactions should run atomically as they do over purely local storage. If all the servers holding the distributed data are running the same DBMS, it is called homogeneous distributed database system; otherwise, heterogeneous distributed database system. There are three architectures:
1. Client-Server Systems: multiple client processes and multiple server processes; easy to implement; simple client-server interaction
2. Collaborating Server Systems: consists of a collection of servers that execute against local data, generate appropriate sub-queries for other servers, and then combine the results
3. Middleware Systems: there is only one server that is capable of managing multi-site execution strategies, and the other servers simply execute against local data
The chapter also mentions about how the data is distributed over multiple sites; for example, horizontal and vertical fragmentation, synchronous and asynchronous replication. In order to manage the distributed data, some catalog management approaches are also needed.

There are a lot of issues for distributed systems. For query processing and optimization, for example, the cost of shipping data needs to be considered so that additional techniques, Semijoin and Bloomjoin, are proposed. Also, the optimizor has to consider additional information such as the communication cost based on the location of sites. The replication and update of data is also another issue. As for transaction processing in distributed databases, concurrency control and recovery techniques are discussed.



The major contribution of this chapter is that it gives an overview of the basic idea, architecture, and the standard techniques of parallel and distributed databases. It also goes through the difference of between them and centralized databases by introducing how data partitioning and replication, operator execution, and query optimization work. For data warehouses in nowadays, it is necessary to store the large amount of data in a distributed fashion. This chapter discusses several different approach or system design to manage the data and execute queries and transactions efficiently. It is very well-written and provides some figures that are pretty helpful. However, the chapter mainly focuses on the general concept of parallel and distributed systems; it would be better if there is any real-world example and analysis about how these system design and techniques are adopted in some specific parallel and distributed systems in the industry.



Review 14

This chapter discusses parallelism and distributed databases. Distributed databases are important because of increases availability and distributed access to data. If one site holding a relation goes down, the relation can still exists on another site, resulting in increased availability. The locality of access patterns is considered during distributed access to data. This chapter approaches the discussion by first covering parallel database architectures and distributed database architectures, and then addressing transactions and concurrency control in distributed databases.

DBMS are a prominent example of parallel systems. There are three mains architectures for parallel DBMSs: shared-memory, shared-disk, and share-nothing. In shared-memory, each CPU accesses a common region of main memory through an interconnection network. Because communication can be done through the main memory, communication overhead is low. However, as more CPUs are added, memory contention because a bottleneck. In the shared-disk system, each CPU is given private memory storage, and can access each other’s disks through a interconnection network. Similar to shared-memory, memory contention is a problem because the interconnection network ships large amounts of data. In shared-nothing architecture, each CPU has it’s own main memory and disk space, and cannot access each other’s storage. Unlike the previous two systems, shared nothing architecture does not struggle with interference, where CPUs slow down because of memory access and network bandwidth contention. In addition, this architecture provides linear speed-up, so that the time that computations take decreases proportionally as the number of CPUs increases, and linear scale-up, so that performance is upheld if the number of CPUs are added proportionally to the increase in data. Some operations can that can be parallelized are bulk load, where the sorting of data entries for index construction is parallelized, and scanning, where pages can be read in parallel for the relation.

There are two types of distributed database systems: heterogeneous, in which different sites run under different DBMSs, and homogeneous, which with servers all run the same type of DBMS. Alternative ways of distributing functionality are Client-Server (in which a client process, which manages user-interface issues, can send a query to any one server process, which executes transactions and processes data), Collaborating Server (which eliminates the need to distinguish between clients and servers and allows servers to run transactions again local data across multiple servers), and Middleware (which is a layer that can make the execution of queries and transactions compatible across multiple database servers). In distributed transactions, lock management is distributed across sites in different ways. One way is centralized, so that one site handles lock and unlock requests for all other objects. This is a vulnerable scheme, because we are in trouble if the site that controls the locking goes down. Another way is primary copy, so that one copy of each object is designated the primary copy, and the lock manager at that side handles lock and unlock requests for that object. This solves the vulnerability issue of the previous scheme, but reading objects require communication between the site of the primary copy and the site of the copy that we are reading. Lastly, fully distributed systems have a lock manager at the site of the copy of an object to handle lock and unlock requests for that object. This is the most advantageous scheme in the case of frequent writes. Recovery is achieved through the commit protocol, in which either all parts of transactions commit, or no parts of the transactions commit.

Overall, I though the chapter was very comprehensive on the discussion of distributed database systems. It was a nice review of concepts we’ve learned previously this semester, such as parallel architectures and distributed transactions, and also introduced new concepts such as types of distributed systems and how to make communication between different types of DBMSs compatible.


Review 15

This book chapter is about parallel and distributed database systems, how they are implemented, and what advantages they give over centralized database systems. These systems can provide increased availability, and decreased latency.

There are different ways to partition data across disks to allow increased speed in I/O through parallel operations. If parallel operations can be successfully implemented substantial gains in performance can be achieved. The authors describe bulk scanning, sorting, joining and loading that can relatively easily be done in parallel. They state that distributed systems should maintain atomicity and data independence but this cannot always be achieved. They should also maintain concurrency control and distributed recovery.

The authors describe different kinds of distributed systems. Client-Server systems are easy to set up and easy for people to use and make good use of client processing power. A collaborating server adds to this the ability to run queries on multiple servers. The middleware architecture adds an additional server that coordinates the distribution of tasks to local and distributed servers for processing queries.

There are a bunch of tricks for distributed queries that use caching or union for joins or just compute and reduce for aggregate functions. Semi-join and Bloom join are methods for reducing the cost of transmitting rows across servers. The authors then go on to talk about synchronization and error handling. Synchronous replication is not always possible or easy to implement. Primary site and peer-to-peer replication paradigms offer alternatives. I thought that these sections were the most interesting, and the strengths of this chapter. The paradigms of distributed algorithms were interesting to see in the context of joins. The recovery procedures at the end were also quite insightful. We’ve seen some of these ideas earlier in the class but not the extensions to distributed systems. We see, as we often do in database systems, that the ideal solution isn’t always practical to implement. The chapter ends by saying that 3PC is not often used because of overhead and usually 2PC is used instead even though it has more flaws.

The drawbacks of the paper can be seen in sections of the paper that are less clear. The chapter looks like it was run through a bad optical character recognition program and some of the examples are harder to follow because of this. Though I enjoyed the material in 22.10 I thought it was hard to follow, especially 22.10.2. Section 22.7 didn’t offer enough details to see the real significance of the differences between architectures. I think if 22.7 had an example or two, and if 22.10 was more clear it would have been a more effective chapter.



Review 16

Part 1: Overview

This chapter discusses about parallel and distributed databases. Parallel databases tries to improve performance by pipelining and parallelizing operations. Parallel databases utilize data partitioning. Distributed databases store data across several sites, which are actually small and independent databases. These mechanisms can together increase availability and scalability, there are also issues like concurrency and analyzing distributed data.

Recall the three common database architectures, shared nothing, shared memory, and shared disk architectures, parallel databases actually have different parallel schemes. For shared memory or shared disk models, increasing number of cpus would actually slow down the overall performance as all processes are contending for memory or disks. For the shared nothing model, which is best for parallelism, we need to reorganize execution for the database queries. The idea case would be linear scale-up, but overhead of reorganizing would make it impossible. Besides evaluating different operators in parallel, we can also partition the data to be checked. This would make single operation is also carried out in a parallel fashion. Round robin partitioning is suitable for queries that need to scan the entire table. Scanning, sorting, joins can all be parallelized.

Distributed databases would be suitable for data with certain properties including distributed data independence and distributed transaction atomicity. Gateway protocols are added as a mask of the database and are used to provide external API for applications. Alternative system architectures are client-server systems, collaborating server systems, and middleware systems. One of the key problem for distributed databases is fragmentation. If some relation get big and is separately stored across the system and is also replicated, it would be hard to keep track of this relation. Distributed catalog management would require objects to have local names so that they can be pinpointed quickly. Note that centralized catalog structure could be vulnerable to failure. When doing distributed query processing, we need to carefully determine the data to fetch and ship the data to one single site. Updating distributed database should also be able to do asynchronous replication. Centralized site and decentralized peer-to-peer structure should always be taken in consideration when we do query optimization and execution, including deadlock handling, failure recovery.

Part 2: Contributions

This chapter forms a clear overview of parallel and distributed databases and also touches many detailed fields. As big data analysis thrives in the industry, parallel and distributed databases would become more and more important.

Part 3: Possible drawbacks

Parallel database and distributed database are hot topics, however in this area many principles still need to be theoretically proved. Some concepts including eventual concurrency are still not well defined.



Review 17

The paper covers overview of concepts on parallel and distributed database systems. Parallel database system seeks to improve performance through parallelization of various operations, such as reading data, building indexes, and evaluating queries. Whereas distributed database system primarily focused on physically distributing data across several sites thereby increasing availability and performance

The paper discussed that among the different parallel database architectures, shared nothing architecture provides better speed-up, as both shared-memory and shared-disk architecture suffers from resources contention as more number of CPUs are added to the system. Given the database is horizontally partitioned, parallelization can be exploited at different levels including: executing individual operations of query like sorting and joining in parallel , executing different operations in a query in parallel or even running multiple queries in parallel. In addition, the paper discusses about different issues related with distributed database system. In distributed database, data independence is important i.e users should be able to ask queries without specifying where the reference relations are located. In addition, even while the data is distributed, the atomicity of a transaction should be maintained. Distributing data across multiple sites can be done either by replicating data or partitioning vertically or horizontally. One important thing the author mentioned is that in distributed database, the communication cost does matter. In addition, the distributed DBMS should handle network failure which is not a main concern in centralized DBMSs.

The paper’s main strength is that it provides good distinction between parallel and distributed database systems. Whereas parallel database system is primarily motivated by improving performance among various operations including in intra-query and inter-query, distributed database is motivated by improving availability and performance by distributing data across multiple nodes. In addition, it addresses the main optimization techniques and implementation issues in both database systems which includes: improved hash join in parallel database system, efficient join across multiple sites and concurrency control techniques such as two-phase and three-phase commits in distributed database system.

The main limitation of the paper is that it doesn’t address techniques for emerging distributed database system. For example, eventual consistency is increasingly becoming common on NoSQL based distributed database instead of strict two-phase concurrency control system. In addition, the paper only addresses disk based database systems. However, there is emerging storage mediums such as main memory and flash based memory that can hold the complete database systems. Evaluation of the shared memory and shared disk parallel architectures in these mediums would have been be very insightful. Furthermore, it would have been great if the authors included a section to discuss about the disadvantages of both parallel and distributed database system. For example, distributed database is found to be complex in terms of maintaining its integrity and security.



Review 18

The issues of parallelism and data distribution are discussed in this chapter. Parallelism and data distribution have different focuses but they do not conflict with each other in database systems. Parallel database system emphasize parallelization of various operations, while in distributed database systems, data is physically stored across several sites to achieve data availability.

After a brief introduction, the text dives into both with more details. The key idea of parallel databases is to carry out evaluation steps in parallel whenever possible. There are 3 options in parallel architecture, which are shared-memory, shared-disk, and shared-nothing. Among those three, the last one is most preferred because it provides linear speed-up, meaning that existing CPUs don't slow dow when more CPUs are added to the system. Also, it supports linear scale up. Communication won't be a bottleneck while more extensive reorganization of DBMS code is required.

When it comes to query evaluation under shared-nothing scheme, there are 2 kinks of parallelism that the system can achieve:
1. evaluating different operators in a relational query execution plan in parallel
2. evaluate each individual operator in parallel
a. key: partition the input data
b. (called data partitioned parallel evaluation)

Data partitioning parallel databases focuses on operator parallelism. There are several options here: round robin, hash partition and range partition. They all have their pros and cons, and shall be selected accordingly.

Then the chapter talks more about distributed databases. They can be categorized into two branches: homogeneous and heterogeneous systems. The latter one is more attractive but we need well-accepted standards for gateway protocols. Sometimes, to achieve high data availability for reads, consistency and physical data independence are violated. Especially when it comes to replication, asynchronous replication is preferred even though some of the properties are not so perfect as ones for synchronous replication.

==highlight==
This chapter provides a systematic overview of parallel database and distributed database, and touch on some typical techniques and topics in both.

==weakness==
As said, these two topics are not completely separated, and some features, goals and techniques are shared. But the chapter didn't do a great job summarizing some of the common points in these two systems.


Review 19

This book chapter discusses distributed databases.
It begins with architectures for parallelism on the hardware level: sheared-nothing, shared memory and shared disk architecture. Then it talks about how parallelism is achieved in these parallel systems. Data partition is used to store data in different machines. And many operations are redesigned, such as loading, sorting and joins.
After providing knowledge necessary, this chapter begins to talk about parallel databases. First thing to discuss is the architecture. According to the functionality of different process, there are three different architectures: Client-Server Systems, Collaborating Server Systems and Middleware Systems. Then it talks in detail about how data are stored and queried in distributed database. After that distributed concurrency control and restore are discussed.

Good:
This chapter provides a comprehensive introduction for distributed database systems, which is very useful as a background knowledge if someone try to study a specific problem in distributed system. What I like about this paper is that is provides many interesting problem to look at, such as distributed concurrency control and recovery.

Weakness:
The paper is somewhat difficult to read because of those letter errors.
Except that, I am a little curious about the bandwidth of the network that connects different nodes. For example, when this paper discusses the distributed method for join, it says that we can make partitions and then send data to correct machine for join. If the table is large, then the network bandwidth might be a big problem. I wish to know more about how this problem is solved.



This book chapter discusses distributed databases.
It begins with architectures for parallelism on the hardware level: sheared-nothing, shared memory and shared disk architecture. Then it talks about how parallelism is achieved in these parallel systems. Data partition is used to store data in different machines. And many operations are redesigned, such as loading, sorting and joins.
After providing knowledge necessary, this chapter begins to talk about parallel databases. First thing to discuss is the architecture. According to the functionality of different process, there are three different architectures: Client-Server Systems, Collaborating Server Systems and Middleware Systems. Then it talks in detail about how data are stored and queried in distributed database. After that distributed concurrency control and restore are discussed.

Good:
This chapter provides a comprehensive introduction for distributed database systems, which is very useful as a background knowledge if someone try to study a specific problem in distributed system. What I like about this paper is that is provides many interesting problem to look at, such as distributed concurrency control and recovery.

Weakness:
The paper is somewhat difficult to read because of those letter errors.
Except that, I am a little curious about the bandwidth of the network that connects different nodes. For example, when this paper discusses the distributed method for join, it says that we can make partitions and then send data to correct machine for join. If the table is large, then the network bandwidth might be a big problem. I wish to know more about how this problem is solved.



This book chapter discusses distributed databases.
It begins with architectures for parallelism on the hardware level: sheared-nothing, shared memory and shared disk architecture. Then it talks about how parallelism is achieved in these parallel systems. Data partition is used to store data in different machines. And many operations are redesigned, such as loading, sorting and joins.
After providing knowledge necessary, this chapter begins to talk about parallel databases. First thing to discuss is the architecture. According to the functionality of different process, there are three different architectures: Client-Server Systems, Collaborating Server Systems and Middleware Systems. Then it talks in detail about how data are stored and queried in distributed database. After that distributed concurrency control and restore are discussed.

Good:
This chapter provides a comprehensive introduction for distributed database systems, which is very useful as a background knowledge if someone try to study a specific problem in distributed system. What I like about this paper is that is provides many interesting problem to look at, such as distributed concurrency control and recovery.

Weakness:
The paper is somewhat difficult to read because of those letter errors.
Except that, I am a little curious about the bandwidth of the network that connects different nodes. For example, when this paper discusses the distributed method for join, it says that we can make partitions and then send data to correct machine for join. If the table is large, then the network bandwidth might be a big problem. I wish to know more about how this problem is solved.



This book chapter discusses distributed databases.
It begins with architectures for parallelism on the hardware level: sheared-nothing, shared memory and shared disk architecture. Then it talks about how parallelism is achieved in these parallel systems. Data partition is used to store data in different machines. And many operations are redesigned, such as loading, sorting and joins.
After providing knowledge necessary, this chapter begins to talk about parallel databases. First thing to discuss is the architecture. According to the functionality of different process, there are three different architectures: Client-Server Systems, Collaborating Server Systems and Middleware Systems. Then it talks in detail about how data are stored and queried in distributed database. After that distributed concurrency control and restore are discussed.

Good:
This chapter provides a comprehensive introduction for distributed database systems, which is very useful as a background knowledge if someone try to study a specific problem in distributed system. What I like about this paper is that is provides many interesting problem to look at, such as distributed concurrency control and recovery.

Weakness:
The paper is somewhat difficult to read because of those letter errors.
Except that, I am a little curious about the bandwidth of the network that connects different nodes. For example, when this paper discusses the distributed method for join, it says that we can make partitions and then send data to correct machine for join. If the table is large, then the network bandwidth might be a big problem. I wish to know more about how this problem is solved.



Review 20

This chapter highlights the various properties of parallel and distributed database systems.

A parallel database system essentially is meant to improve performance through parallelization of various operations such as querying, building indexes and loading data. A distributed database system is where the data is physically stored across different sites and each site is managed independently by a database system.

Parallel database systems:
There are three kind of architectures in parallel DBMS - shared memory, shared disk and shared nothing. Shared memory and shared disk are relatively easier to port a given conventional system to in comparison to shared nothing.

However, in these two cases,one of the major disadvantages with the parallel database system is that as the CPU increases, memory contention becomes a major issue. Using Shared nothing does result in more changes in codebase but the architecture does scale with the increase in cpu and disks if they are proportional to the data.

Partitioning horizontally helps exploit the I/O of the system by parallelizing read/writes across the database. It can also help since indexes are built for each partition so only the partition that was affected needs to update the index.The range based partitioning is good in order to assign equivalent number of tuples to all the systems however, if a query spans multiple partitions for a given range, it is inefficient.

The authors have presented the idea of parallelizing sequential operator evaluation code by splitting streams in order to be processed by multiple database systems and then merged to return the result using the Parallel joins. However, this might be very inefficient when it comes to complex queries that are not equijoin and it is very reliant on the network system because that can really slow it down. There is also a requirement that the data is partitioned well in accordance to the queries it expects, otherwise, that will seriously reduce the advantages that you get from Sort-merge join, especially in a Shared Nothing architecture.

Distributed database systems:
These database systems were essentially built for high availability and distributes access to data. High availability can be achieved using replication.

The data can be fragmented either vertically ( using columns) or horizontally ( specific rows). The vertical partitions are appended with the tuple id in order tha information about the data is not lost. One of the clear advantages is that partitions can be replicated based on their use. For eg: the partitions that are considered hotspots can be replicated on all the sites whereas a partition that is only used locally needed not be replicated at all. This takes advantage of the fact that the DBMS that runs that partitions runs it independent of the other partitions.

Two important properties that are usually typical in a distributed database system are
1. Distributed data independence: User does not have to know where the data is physically located to query the database
2. Distributed transaction atomicity: User should be able to write queries accessing multiple sites as they would for accessing just a local site.
However, these properties are very difficult to support and might not in practice, be desirable.

One of the main disadvantages of distributed databases is to get an integrated access to all different locations at a given time. Keeping track of how data is distributed can be done using a site catalog where the main server has a index of which data is stored and replicated on which site and then the user can be redirected to the corresponding sites.

Since joins, like in parallel database systems can be very expensive, the authors have introduced two joins being Semijoin and Bloomjoin where the projections of the source tables ( in case of Bloomjoin, a bit vector with the hash value corresponding to the projection) is shipped to the second site and then those results are shipped back to the query site. If the databases are huge, this can result in a huge cost and can be quite an overhead if network is not considerably trustworthy.

Updating data on distributed databases can be done using synchronous or asynchronous replication. Since synchronous replication can result in a lot of contention because of the locks needed, various forms of asynchronous replication are used. MongoDB is an example of asynchronous replication being used on a distributed data system where modifying can be committed before all sites have been updated.

Overall, there should be specific use cases where using a parallel database system or a distributed system should be more advantageous than just using a single site system like the Amazon store, where integrated access is often not needed and partitions are built keeping the most common queries in mind.






Review 21

chapter 22: parallel and distributed database

In this chapter, the main topic is parallel and distributed database. To begin with, the idea of using distributed machines to process queries in parallel to increase performance and availability is put forward. Among all the possible alternatives, the best physical architecture to provide desired speedup is the shared-nothing design. And in the shared nothing structure, each operation can be implemented in parallel.

There are three types of distributed DBMS architecture design, client-server, collaborating server and middleware. In the last one, middleware, the software layer middleware are only used as a coordinator to manage the execution of query in multiple machines. Data(tables) are usually stored in replicated fragmentations in the distributed system.

To distributedly process the queries, the fragmentation of data is usually the major concern, especially in the handling of joins. There are three basic idea associated: firstly, fetched as needed, secondly ship to one site, and lastly,semijoins and Bloomjoins. The most efficient way would be semijoins and Bloomjoins that uses the projection/hash vector to pre-determine the needed tuples to send.

When updating the distributed data, one should always making the decision between synchronous and asynchronous replication. Since the former one serves faster for read ops and the later one can lower the overhead for writing. In terms of the distributed concurrency control, the using of centralized, Primary copy and fully distributed design also count on the ratio of read/write in the system workload. And lastly, the idea of 2PC and 3PC are introduced to achieve fast recovery from failure.

To summarize, this chapter serves well as an high level overview of the most important issues in the design and implementation of distributed and parallel DBMS. Nevertheless, there are still some weakness here. First of which is that, this chapter lacks detailed analysis in some design ideas, for example, in the part of distributed concurrency control, not enough real world examples are given for any of the three designs, it is only briefly introducing the design logic. The second weakness is that, this chapter only considers high level design ideas instead of in depth implementation topics. For instance, in the data fragmentation and updating distributed data part, the author never go in deep to talk about the system failure detection and recovery for replicas. And the last thing is that, in the 40 pages of book, only three pictures/sketches are provided. Considering this is a chapter that mainly talks about design ideas, there should be more pictures or design graphs created to help the reader understand the overall look of the system/design.




Review 22

In this chapter, it talks about the parallel database and distributed database.

Parallel Database:
A parallel database system seeks to improve performance through parallelization of various operations. There are three architecture for parallel databases, share nothing, shared memory and shared disk.
The chapter talks about 3 horizontally data partitioning round-robin, hash partitioning and range partitioning. Round robin partition is suitable for query that access entire relation and other two is good at access part of the relation when the partition is based on that predicate. For the range partitioning, there exists a problem of data skew, one way to solve this is to use sampling.
Then the chapter goes to how to parallelize individual operators. Scan and bulk loading is simple, just execute in parallel. For sorting, it first range partition on sort attribute into n pieces and then sort locally. For joins, it will use same hash function to redistribute both two tables and then do the local join. If after first partition, the part cannot fit in memory, it need a second hash to make smaller range to do the local join. The optimizer for parallel database is more complex because there are many kind of parallel(one operation or one query or multi-query) and the execution plan is more than traditional database.

Distributed database
Data in distributed database is stored across several site and each site is managed by a DBMS that can run independently. Traditional view of distributed database is the database should be used for query same as the traditional database and do transaction same as local database. But some time it is too expensive to achieve or impossible to achieve. There are two type of distributed database, homogenous distributed database run same DBMS in each site and heterogenous one run different DBMS.
Then the chapter talks about three architecture for distributed database:
Client-Server: Clients responsible for user-interface and the server manage data and execute transactions. But it cannot allow single query span multiple server which require the knowledge of the system.
Collaborating Server system: when a server receiver a query that need access other server, it will decompose the work to other servers.
Middleware System: One server manage query and decompose the word to multiple servers and other servers only do their own jobs
There are two ways to do the partition: one is horizontal that is partition the tuples and another is vertical that is partition the attribute. The partitions may have serval replica, the replica is for better availability and better performance for access.
In the distributed database, the name for the object is username+localname+birthsite, this can allow user to name a object regardless of the global namespace and thus achieve distributed data independence.
For distributed data processing, the cost of executing a query will include the communication cost. Thus it need to consider which site is used to access data. The chapter talks about the join operator in detail, there are three strategy, one is to fetch the information when useful which is too much communication. Another one is ship the whole table to another site and do the local join. The third one is to ship part of the table that is useful in the join to another site to do the join.
For updating the data, there are two ways. One is to commit after all site is changed, another one is copy of the relation is updated periodically not immediately.
The distributed database should also talks about how to make the transaction atomic as in traditional database and how to detect the deadlock. It talks about the 2PC to achieve the atomic in multiple site transaction.

Strength:
The chapter first talks about the parallel database and talks about how the parallel computing is applied to database to provide better performance. Also talks about the algorithm for parallel computing to implement the operator.
For the distributed database, the chapter talks about many concern for the distributed database and different implement for this. It start from how to build a distributed database like partition and catalog and then talks about how the implement make distributed transparent in transaction and query.

Weakness:
When talks about the join operator, it only talks about the natural join. But for other kind of join, how the parallel or distributed database make it more efficient. Also the chapter does’t talks about the multiple table join and pipeline in join operator.


Review 23

This chapter motivates the need for parallel database systems and discusses possible implementations. Three high-level designs are possible: shared-nothing, shared-memory, and shared-disk, but shared-nothing systems have the best scaling characteristics. The chapter takes a deeper look at parallel execution within a query, primary through taking advantage of partitioning for distributing work, shuffling data around when it is best for that operation. Another technique for parallelizing work is to split it into subproblems. E.g. this can be done for hash joins. Besides parallelizing individual operations, parallelization can also be achieved by pipelining operators and concurrently executing independent operators.

The chapter also discusses distributed database architectures. It names three types: client-server, collaborating-server, and middleware systems. The client-server system is essentially just a single-system database as only one machine does all of the work. In the collaborating-server system, servers must coordinate transactions among themselves. The middleware system, we have a similar situation but coordination is pushed into a central server that manages coordination while individual servers worry only about consistency on their own data. Data can be fragmented (or sharded) either horizontally or vertically (row-wise or column-wise) across machines. The chapter also discusses naming (using a global nameserver can be a bottleneck and is undesirable, there are ways to do this locally and achieve global uniqueness), distributed data independence (an extension of data independence so that users are not aware of where there data is stored), and query execution in distributed databases and how they are affected by fragmentation. The chapter also discusses replication (synchronous vs asynchronous) and possible consistency issues as well as distributed transactions.

The chapter did an excellent job of providing the motivation for parallel and distributed databases and the implementations of these systems. It also provided enough depth that a reader could have some initial thoughts about how to design such a system without getting into excessive depth. It also numerically made clear that shared-nothing architectures have the best scaling properties, as with shared-memory systems lock-contention will eventually dominate cpu consumption. I thought that having some performance measurements would have been useful to illustrate how far shared-nothing systems can scale would be nice, but this is not a major omission seeing that this is a textbook chapter.


Review 24

This paper gives an introduction to parallel and distributed databases, what the differences are between the two, and which algorithms are used to make distributed systems as efficient as possible. In the past we were able to get away with centralized databases because databases were not as big and transaction throughput did not need to be that high. However, with the increasing trend of distributed systems, parallel computing and the need to increase transaction throughput, parallel and distributed database systems and the algorithms behind them become more important.

With parallel databases, there are three architectures for the system: shared nothing, shared memory and shared disk. Shared memory is much closer to a traditional machine, so most DBMS can just be ported. The best performance increase, though, would be with the shared nothing architecture. The authors then explain how to parallelize individual operators such as scanning, sorting and joining. For joining, the authors focus on parallel hash join and how to improve the existing parallel hash join.

With distributed databases, there are also three types of architectures: client-server, collaborating server and middleware. One of the biggest considerations for distributed databases is how to fragment the database to improve performance and availability. The database can be distributed based on similarities in the data such as index ranges, but we have to then consider hotspots where certain rows will be accessed by many queries. Another challenge with distributed databases is how to replicate data to ensure availability while also preserving performance. Usually, the DBMS will require writing to multiple copies to maintain some consistency. We can either enforce immediate consistency and sacrifice performance or write to less than all of the replicas and have to read from multiple copies when reading.

Overall, the paper does a great job of explaining the challenges in implementing a distributed database and many methods to improve performance. However, the following are some weaknesses I see with the paper:

1. In many places, the authors describe multiple solutions to some challenge and indicate the tradeoffs, but they never fully explain how the tradeoffs will affect the performance of the system. For example, when they explain the difference between synchronous and asynchronous replication, they never explain how much more time is needed to wait for all copies to be written to.



Review 25

This paper discusses many elements of parallel database systems, such as architecture, parallelization of queries, and distribution of data. As databases become larger, and businesses become more global, it will be important to take advantage of the many benefits of parallel databases. However, implementing parallel databases effectively is a very challenging topic, and can actually decrease performance if it is not managed carefully.

The three main architectures used in parallel databases are shared-nothing, shared memory, and shared disk. Communication overhead is fairly low in both the shared-memory and shared-disk architectures, however, these systems do not scale well. As the number of systems in the network increases, contention for memory and disk access increase rapidly. Past a certain point, adding more nodes can actually decrease performance past that of a single computer. Shared nothing architectures require more complex communication protocols, since they all database coordination must be managed via the network connection, however, speed-up and scale-up both increase linearly, making this the most popular choice for high-end parallel systems.

Some other benefits of parallel systems include data partitioning and replication. Replicating data across several servers increases the availability of that data and can allow for live takeover in event of a server failure. Replication introduces some additional overhead however, as all replicas of the data must be updated. Synchronous updates can be prohibitively expensive in distributed systems, as locks must be coordinated at a global level, which can cause increased deadlock and severely decreased performance. Asynchronous updates can be made much more quickly, but allow for windows of inconsistency after data is updated. Data partitioning splits data horizontally (by rows) or vertically (by columns) between several servers. This can allow for more rapid processing of queries in cases where data is partitioned by range, since fewer servers may need to be contacted, and tuples may need to be shipped from less locations to complete queries. However, if data is skewed such that certain ranges contain far more values than others, the servers holding those ranges can become bottlenecks for related queries.

The process of committing transactions across parallel databases that make use of data replication can also be very complicated. The most common protocol used is 2-phase commit, in which a master server sends messages to all servers containing replicas of the data being updated telling them to prepare for commit. If any server responds with an abort message or does not respond at all, the master server aborts the transmission. If all replica servers respond with success, the master server tells them to commit. Once it has received confirmation that all servers have committed successfully it writes an end statement to the log. This scheme works well for the most part, but can be vulnerable if the master server fails at a certain point. In these cases, the replica servers may be forced to block until the master comes back online.

My chief complaint with this chapter is that it gives little data to support the performance increases of parallel database systems. In each section of the chapter, it shows how parallelism can improve performance in certain cases but can decrease performance in other cases. For example, partitioning data can incur high communication costs if subqueries must be computed at different locations then ship their results to yet another location. Enforcing consistency and atomicity between distributed systems can incur high overhead for ensuring that all data is updated in a timely fashion or recovers from crashes in a correct manner. There are severe costs for poor implementation when parallelism gets involved, and I would have liked a more detailed description of how real world parallel systems perform in comparison to single-server systems.



Review 26

This article introduced parallel and distributed databases, and discussed many aspects of them. Parallel databases are designed to improve the performance through parallelization of operations. In distributed database system, data is physically stored across several sites. Distributed database systems are more and more popular nowadays. The motivations of distributed database system are increased availability, distributed access to data, and analysis of distributed data.

First, the article talked about parallel databases. Three different kinds of the architecture are shared-memory, shared-nothing, and shared disk. Many operations can be done more efficiently using the parallel techniques, such as bulk loading, scanning, sorting, and joins. The key idea is to use many CPUs to cooperate on the individual operation. In addition, queries can also be executed in parallel for optimization. For example, the result of one operator can be pipelined into another, and multiple independent operations can be executed concurrently. Thus, both operations and queries of the DBMS can be improved by parallelism.

Second, the article talked about distributed databases. Three different types are homogeneous, heterogeneous, and multidatabase systems. The data in distributed database systems is stored through fragmentation and replication. Thus, data consistency became an issue because it is stored across several sites. The updating techniques are synchronous replication, asynchronous replication. The article also talked about the concurrency control and recovery in distributed database systems. The algorithm is a little different since the data is stored in distributed sites, and it has to deal with communication problems.

The strength of this article is the completeness of the discussion about parallel and distributed database systems, including motivation, architecture, implementation, and related algorithms. The ideas are illustrated by many examples, which make readers understand more easily.

The weakness of this article is few details on certain subtopic. Because the article covers many different aspects of parallel and distributed database systems, it did not talk them in details. For example, for the recovery algorithm in distributed systems, it did not provide a detailed description and algorithms.

To sum up, this article introduced parallel and distributed databases, which are more and more popular and important in recent database area.



Review 27

This chapter is an overview of parallel and distributed database systems. It is a high level overview that gives a good introduction to the goals and problems faced in that area of study. The main goal of making a database parallel or distributed is to increase availability and speed of access.

The three main types of parallel databases are: shared nothing, shared memory, and shared disk. In shared nothing each CPU has its own memory and disk corresponding to it and no data is shared between the machines. In shared memory the whole system has a global shared memory and disk and each CPU must communicate through a network to access that. In shared disk each CPU has its own memory but all the disk is shared. The main downside to shared disk and shared memory is the more CPUs that are added the slower they all become because there is more disk and memory contention. This speed up is better in shared nothing but consistency is a harder issue in shared nothing as all changes need to be pushed to everything due to nothing being shared.

Another issue with these form of databases is partitioning the data. Partitioning the data optimally is an NP problem and one that I know lots of research has gone into. This paper gives a very brief overview of partition styles but I would have liked a more in depth analysis of partitioning and what makes it difficult, especially with how in depth this paper talked about joining algorithms.

This paper also discussed the three main distributed DBMS architectures:
1.) Client-Server Systems: Client sends requests and all computations are done on server. Good for simplicity to implement and client machines being inexpensive. Bad because everything must communicate over a network.
2.) Collaborating Server Systems: Similar to client-server but there are multiple servers which are responsible for managing multi-site execution strategies.
3.) Middleware Systems: Similar to collaborating server but the servers are not responsible for managing multi-site execution strategies, the middleware is.

Next this paper discussed fragmentation, which is similar to partitioning in that it is a way of splitting up data across multiple machines. There are two main types of fragmentation:
1.) Horizontal Fragmentation: each machine will store a whole row (or tuple) and the whole table will be distributed over the machines on a tuple based system.
2.) Vertical Fragmentation: each machine will store a whole column or columns of data and the whole table will be distributed over the machines on a column based system.

The paper also touches on query optimization, updating data, distributed transactions, concurrency control, and lastly recovery. We have hit on almost all of these topics already in lecture so I will not go more in depth on any of them (and this review/summary is already a little long).

The final downside to this paper I will mention is the OCR that was used to convert the picture to text for a PDF did not do a very good job. It missed a lot of letters and converted them to other characters, which made the paper a large deal harder to read. This is not a problem with the chapter itself but it is with the PDF.

Overall I think this was a solid paper and it did a good job introducing distributed and parallel database systems. I think many of these things have already come up in lecture but this gave a good high level overview/refresher.



Review 28

This reading was relatively straightforward; giving an overview of requirements and varieties in architecture of both distributed and parallel database systems. Parallel systems are meant to handle operations in parallel, while distributed systems can be thought of handling the data in parallel by partitioning data up to some replication factor.

There are several different ideas behind parallel DBMS systems; shared-memory, shared-disk, and shared-nothing. Shared-nothing seems to be most popular because it doesn’t suffer from the performance bottlenecks introduced by hotspots (i.e. CPUs attempting to access the same piece of information) as the other two. Shared-nothing is easily scalable through the process of simply adding nodes to handle processing, though there is an additional overhead incurred by communication between processors. Many industry-level applications value scalability over this fact, however, or have implemented optimizations to circumvent this. In addition, the parallelism itself can take multiple forms, such as pipelined parallelism (where operations are broken up into “stages” which each thread handles separately, and data-partitioned parallelism, which takes advantage of multiple data sites, where operations are executed simultaneously across different data stores.

Data partitioning itself can also be present in a variety of forms: round-robin, hashing, and range partitioning. Round-robin partitioning simply rotates where data is inserted at every given iteration, range partition places new data into “buckets” depending on some index or attribute value. Hashing performs essentially the same operation, but via a hash function. Round-robin and hashing are good for when there isn’t an obvious partitioning key, especially since range partitioning can be susceptible to data skew (many values in one attribute range). The architecture choice also changes the way that distributed databases operate (middle-man, collaborative serving, client-server), but some of these lend themselves to parallelizing operations more efficiently than others, such as scanning/sorting/join operations. Separately, it is discussed how hashing and sorting for partitioning can make scanning and join queries more efficient. The discussion is followed by a description of concurrency control (replication, P2P replication, asynchronicity) in distributed systems, which is more of an encompassing view of the concepts of concurrency control we have seen in other papers read in the class.

I am confused about a few topics, however, such as the advantages and operations of bushy trees vs left-deep trees mentioned in the summary, as well as how the “capture” operation works in asynchronous replication. Overall, the chapter was, as they usually are, broad in its description, but could have used more visual examples to help understand the material (I found myself googling many of the concepts to solidify understanding), as well as concrete use-cases to make the ideas more clear.


Review 29

This chapter discusses parallel and distributed databases: the motivation behind the concepts, the architectures of parallel and distributed databases, how database operations are executed in these two concepts as well as how the executions differ from the single database concept.

The first part of the chapter talks about parallel database. The basic idea of parallel database is to improve performance by carrying ou evaluation steps in parallel whenever possible (i.e.: parallelization of loading data building indexes, query evaluation, or other DB operations). Even though data may be stored in distributed fashion, the focus is performance consideration. The chapter continues with the architectures for parallel database: shared nothing, shared memory, and shared disk. Between these three, shared nothing is considered the best architecture. Next is the query evaluation, in which the chapter takes into account the data partitioning and parallelizing sequential operation evaluation code. For the operations (assuming that it is horizontally partitioned), the chapter discusses bulk loading and scanning, sorting, joins, and improved parallel has join. Last, it discusses parallel query optimization. It highlights that the plan that returns answer quickest may not be the plan with the least cost. Several parameters (i.e.: available buffer space and number of available processor) are also known only at run-time.

The second part of the chapter talks about distributed database, in which the distribution of data is governed mostly by local ownership and increased availability, other than performance issues like in parallel database. Two types of distributed database are homogeneous (same DMBSs) and heterogeneous (different DBMSs). For the architecture, there are 3 types: client-server systems, collaborating server systems, and middleware systems. Client server system is the simplest, while middleware system is designed for a complex system since the middleware server does not contain any data but only serves as execution coordinators. Next, the chapter talks about data storing using fragmentation and replication. Of course, distributed systems need catalog management to keep track of the distributed data. The chapter explains the object naming convention, catalog structure, and independence of distributed data in relation to the changes and how it affects the catalog. Distributed query processing is also explained, taking examples from nonjoin queries and join queries. Join queries can be very expensive to execute, thus this section is focused on the strategies in optimizing query execution cost utilizing “fetch as needed” and “ship to one” site principals. The strategies are semijoins and bloomjoins. Between these two, bloomjoins seems to be the more optimal one. Next, the readings considers updating distributed data. There are two alternatives: synchronous (which come at significant cost) and asynchronous replication. Since peer-to-peer replication is not discussed further, asynchronous replication is elaborated further through the practice of primary sites replication, in which changes are propagated in two steps: capture (log based) and apply. It mentions a bit about data warehousing as an example of replication. The last three sections explains distributed transaction, distributed concurrency control, and distributed recovery. In concurrency control, it explains that distributed deadlock detection can be a bit tricky, since the waits-for graph should be viewed globally, and there is also the fact that the graph may not be quickly updated in the event of an aborted transaction. The recovery part is focused on restart after failure and the Two-Phase Commit (and the refinement of it). Last, it mentions Three-Phase commit, which – while careful – is not used in practice.

The main contribution of the reading is it gives quite a thorough but easy to understand explanation concerning parallel and distributed databases. Not only it explains how each of the concept works, but also what are the consequences entailed for executing the operations – especially for distributed database system – because they are different from single database concept. This reading is well-suited for introductory reading.

Unfortunately, it also means that the concept discusses in here is rather general. It mentions briefly about heterogeneous and homogenous distributed database systems, and although it is mentioned that the systems come at significant cost (especially heterogeneous), it never really discusses the cost difference between the two systems (as in how executions of operations in one system compares to the other). It is also unfortunate that the reading ceases to discuss peer-to-peer replication further, although it mentions situations where peer-to-peer application will not lead to conflict. I find it somewhat ironic, because nowadays more and more architectures use peer-to-peer replication. However, this chapter seems to dismiss it and choose to assume the working of a system based on primary site instead.--




Review 30

The purpose of this chapter is to discuss parallel and distributed databases. It provides the reader with an introduction to these two concepts while relating them back to what the reader already knows about these concepts in terms of general computer science and points out the differences that must be considered when these concepts are implemented in a DBMS.

This article walks through the various components of parallel databases, then distributed databases. In each section, it breaks down a major concept and then provides brief overviews of specific ways that the concept is implemented in a DBMS. For example, consider the section on Distributed concurrency control in DBMSs. The section first discusses the challenges of distributed concurrency control and then proceeds to introduce several kinds of lock management: centralized, primary copy, and fully distributed. It then discusses the pros and cons of each of the three presented methods before moving into a discussion of distributed deadlocks.

I don’t think there are any main contributions of this textbook as it is not presenting any new information or insights. Its strengths are in its methodical presentation of the various DBMS components in parallel and distributed environments and how they differ from a normal DBMS.

As far as weaknesses go, again, this is a textbook, so there are no weaknesses in the methods used. However, I wish that, for the sake of the learning reader, more concrete examples had been provided. I wish that there had been more real-world application discussion as well, as it would be interesting to know just what commercial DBMSs are using out of these techniques. I also found it interesting to consider this discussion after many of the recent papers we’ve read have discounted parallel DBMSs in particular as not being worth the overhead in the trend toward in-memory databases with high computing power.



Review 31

Review: Parallel and Distributed Databases

Chapter Summary:
This chapter focuses on the issues of parallelism and distributed DBMS. In the beginning it introduces parallel and distributed database systems and the alternative hardware configurations for a parallel DBMS as well as data partitioning and its influence on parallel query evaluations. Following that it shows how data partitioning can be used to parallelize several relational operations. To wrap up, the chapter discusses parallel query optimization.

In the later part of the chapter, it is focusing on distributed databases. With an overview of distributed databases, it starts discussing other alternatives of distributed DBMS architectures and options of distributing data across distributed database systems.

Parallelizing database system aims for performance improvement via the parallelization of various operations such as loading data, building indexes, and evaluation queries. In a distributed database system, data is physically stored in different sites and managed by indented DBMS. In this fashion, it earns the advantages of increase availability, distributed access to data, and analysis of distributed data, which all leads to improvement in performance w.r.t. multiple query operations. The key ideas of implementing such distributed database systems are how to partition data to have the stored in different sites, and how to optimizer queries so they can be executed in parallelized manner.

In a distributed database system, in order to make the impact of data distribution transparent, it is desired to have the following properties: distributed data independence, and distributed transaction atomicity. The former suggests that a given query should receive the same feedback no matter where it is executed over the distributed system. The later suggests that data updates should have the same effect without being specified to a particular location/site.

The chapter later introduces and finishes with discussion of how distributed systems are implemented and managed regarding to the goal of preserving the introduced desirable properties.



Review 32

Summary:
This chapter mainly introduces parallel databases and distributed database. It clarifies the nuance differences between parallel Database System and Distributed Database system: Parallel DB is trying to distributed computation among multiple cores to improve performance whereas the distributed DB is trying to distribute data among multiple notes to reduce latency and provide better availability. It also introduces 3 different architectures of Distributed and Parallel DB, namely the Shared Nothing, Shared Memory, Shared Disk structure, and indicates that the main structure used to deployed current Parallel and Distributed DB is shared nothing structure.

It introduces the main concepts in parallel DB by first introducing the two parallelism principles: 1. pipelined Parallelism, which also known as dataflow model: The output of the first operator is the input of the second operator, and can be evaluate immediately. For operators are not pipelined in the stream, they can evaluate individually. 2. Data-partitioned parallel evaluation: by partition the input data and send them to individual processors, we parallel the evaluation process of the operators. It also introduces 3 different partition techniques and their pros and cons: 1. Hash partitioning, which is good for queries that contains "=" or join queries; 2. range partitioning, which suites for range selection queries but may cause data skew; 3. round-robin partitioning which suites for queries access the entire relation. It also mentions the way to resolve the data skew problem in range partition algorithm: it can use a sample of data to evenly partition data in range. This paper also introduces a variety of parallel operators, including the scan and load operator, sort operator and also the join operator. It also introduces the essential components of the operators, namely the split operator (Mapper in MapReduce) and the merge operator (The Reducer in MapReduce). It especially introduces the join operator, and introduces the parallel hash-join and its improved version to reduce disk I/O overhead by keeping the partition small and in memory, and also mention the sort-merge join algorithm in parallel DB scenario.

It introduces the concepts in distributed system by first introducing the two properties in Distributed DB: 1. Distributed Data Independence: the distributed processes is transparent to clients, and 2. Distributed Transaction atomicity: same atomicity property for local DB. It shows the two difference databases systems, 1. the homogeneous distributed system that every nodes runs the same DBMS, and there is Heterogeneous DB where each may run different DBMS than the others. The later can be archived using a gateway protocol for example JDBC. It also gives introduction to the 3 different architecture of Distributed DB: 1. Client-Server system, where client is responsible for user-interface and the server is responsible for executing the queries. 2. Collaborating Server System which each server also responsible for splitting queries among data partitions and nodes. 3. Middleware systems, where there is a database system mainly responsible for the distributed service of the whole system, and connect the local Database to the clients with coordination. It also mentions the 2 different way of storing data in distributed database system: 1. Horizontal Fragmentation, where you partition your data in row base, and 2. Vertical Fragmentation, where you split your data in column base. Besides, it gives detailed introduction on replication. It shows that replication can both increase the availability of data and is also good for faster query evaluation. It explains the differences between synchronous and asynchronous replications: 1. Synchronous replication has two approach: 1. voting and 2. read any, write all; for 2. Asynchronous replication, it introduces primary site and P2P replication, with an emphasize on primary site replication. It introduces the Capture step and apply steps for implementing asynchronous replication on distributed DB, and shows that Log-Based capture with continuous Apply has less cost, however the Procedural Capture and Application drive Apply provide more flexibility. Before that, it also introduces different catalog structures: it first introduces that the centralized catalog suffers single point of failure issue as well as the huge overhead of maintaining the global catalog, it the introduced the catalog system used in R where each node maintains local replica information and the catalog of the local replica’s replicas which would solve the issue of centralized catalog. It also introduces two join techniques in distributed databases, namely the semi-join: send the projection of joining column to the other site and compute the join and then send the reserved indexes back to compute the local join. It also introduces bloom join, which basically is apply bloom filter technique for the joining column to reduce data streaming. It also gives great introduction on distributed concurrency control, where the chapter discusses the centralized lock manager, primary copy lock and also a fully distributed lock manager. It also introduced deadlock detection in distributed database, and shows the problem of phantom deadlock that happens because the global waits-for graph didn’t capture the abortion of local wait-for graph, in other word, it is a consensus problem that is common in distributed system. It gives step by step introduction to 2 phase commit to solve the recovery problem in distributed system, and by observing the different scenario after abortion of coordinator and subtransactions, it revisit the 2PC and present the Two-phase commit protocol with presumed Abort.
Moreover, it also mention the method of unique naming in Distributed DB, the concepts of cost based optimization data ware housing, and also the 3 phase commit.



Strengths:
1. This paper gives a great introduction of Parallel and Distributed Database system, with detail information on the structure, differences, principle and concepts. It help its reader to build a big picture of the Parallel and Distributed DB, help them understand the implementation details as well as the design concerns for these two emerging DBMS.
2. This paper covers almost every aspect of Distributed database and Parallel Database, and it discuss important concepts such as the concurrency control, join operation and recovery with step by step introduction of the algorithms, which would be really helpful if its readers try to build those modules.
3. In many part of its introduction, it shows the trade off with design consideration, for example the trade off of flexibility and efficiency in Capture and Apply.


Weaknesses:
1. Although this paper gives a comprehensive and comprehensible introductory of parallel and distributed DB, it fails to provide any industrial examples for these two important DBMS. It would be great if this chapter can also introduce some widely used related projects like Hive and SparkSQL in Parallel DB and Dynamo and Cassandra in Distributed DB.
2. This paper in some points fails to use the widely accepted terminologies to describe the technology behind the parallel DB. For example, it introduces that there are two essential operator for dataflow model, split and merge operator, however, this two operators in Big Data and distributed computing area are called Mapper and Reducer in terms of MapReduce proposed by Google; and the Write-all read-any and voting technique is essentially the quorum technique used in Dynamo. It would be great if this paper can at least mention these terminologies to provide the big picture for its readers.



Review 33

This article briefly introduces the motivation, design and implementation of distributed databases. As the size of data grows and the requirement on performance and availability becomes more strict, traditional centralized databases are not feasible for such workload. In a distributed database, the data is physically stored across multiple sites, with each site managed by a DBMS instance. Many factors can significantly impact the distributed database systems, such as data locality, site autonomy. Given this observation, the design of distributed system is different from traditional ones.

The architecture of distributed databases can be classified to shared-nothing, shared-memory and shared-disk, among which the shared-nothing has been most widely adopted as it’s more scalable. The article mainly talks about shared-nothing architecture. In distributed databases, data are partitioned so that each site has a part of the whole data. To execute queries faster, the parallelism in both single operator level and query plan level. For example, the performance of join can be optimized by first partition the tables based on the joined column, then for each partition, the data are distributed to all processors for joining so hopefully each piece can fit into memory. Considering a set of operators to be executed, they can either be pipelined or just executed concurrently if they are independent of each other.

The distributed databases can be implemented in several models: client server, collaborating server or middleware systems. Middleware system model are suitable for the systems that consists of heterogeneous databases. To store data in a distributed fashion, data fragmentation and replication are applied. Managing the catalog involves allocating universally unique names and having catalog coordinators. The cost-based optimization is more complex as the cost involves both disk I/Os and network communication cost. Careful trade-offs are to be made when doing joins. Updating data is also different from traditional DBMSs. A distributed concurrency and recovery protocol is applied.

The most interesting technique introduced in this paper is Two-Phase Commit. It’s quite similar to paxos algorithms which is also used to come out with consensus in distributed system. It’s very elegant to use a round of preparing and another round of final decision to coordinate the sites and make them all agree on commit or abort. The implementation is simple and effective.

One weakness is the lacking of introduction on the front-end side, which is how the query requests are allocated to a specific site, and how the distributed query plan is generated.