Review for Paper: 27-Spark SQL: Relational Data Processing in Spark

Review 1

This paper introduces Spark SQL, a new module in Apache Spark that integrates relational processing with Spark's functional programming API. It makes two main additions to previous Spark systems, including right integration between relational and procedural processing, and a highly extensible optimizer.

Spark SQL achieves the following goals: support relational processing both within Spark programs and on external data sources / provide high performance / support new data sources including semi-structured data and external databases / enable extensions with advanced analytics algorithms such as machine learning.

The paper introduces some details of components of the Spark SQL system. DataFrame API is a major abstraction to let users intermix procedural and relational code, and the data model and operations related with DataFrame are illustrated. An extensible optimizer, Catalyst, is another major component in the system. This optimizer makes it easy to add new optimization techniques and features to Spark SQL, and enables external developers to extend the optimizer. Spark SQL also supports developer extensions, such as customized data sources, and user-defined functions.

The evaluations performed show both SQL query processing performance, and Spark program performance. For SQL performance, it's shown that Spark SQL is faster than Shark and competitive with Impala. Besides performance benefits, Spark SQL also helps non-SQL developers write simpler and more efficient Spark code with DataFrame API.

I like this paper very much due to two reasons. The first is, it uses pipeline graphs to illustrate its system design very well. For example, figure 1 clearly shows the interface to Spark SQL and the interaction with Spark, which helps me understand the system's architecture; figure 3 shows a pipeline of phases when planning a query, which corresponds well with the text illustrations below, and make reading the paper much easier for me. The second reason I like this paper is, in section 7 the paper provides several examples to illustrate how Spark SQL can be integrated into real-life research applications, such as computational genomics, and I like this part since it provides an insight into both the application and the system proposed.


Review 2

Systems like MapReduce was designed for big data applications which require a mix of processing techniques, data sources and storage formats. Those systems are not easy to program and require manual optimizations. Therefore, new systems try to provide a relational interface to big data and provide richer automatic optimizations. However, relational systems cannot meet user demand for advanced analytics like machine learning and graph processing. This paper describes Spark SQL.

This paper describes Spark SQL, a new component in Apache Spark trying to combine relational and procedural systems to take the advantages of bot. The paper first introduces the background of Spark SQL. Then it presents the important components DataFrame API. Catalyst optimizer and advanced features. Finally, the paper ends with the experimental evaluation and conclusion.

Some of the strengths of this paper are:
1. Spark SQL supports new data sources easily including semi-structured data and external databases amenable to query federation.
2. Spark SQL provide high performance using established DBMS techniques.
3. Spark SQL enable extension with advanced analytics algorithms such as graph processing and machine learning.

Some of the drawbacks of this paper are:
1. Spark SQL has no support for transactional tables.
2. The paper discusses little about the future work and improvements on Spark SQL.



Review 3

Many systems involving a mix of processing techniques, data sources and storage formats take advantage of declarative queries to provide richer automatic optimizations. However, the relational approach is insufficient for many big data applications. In practice, most data pipelines would ideally be expressed with a combination of both relational queries and complex procedural algorithms, which remained largely disjoint, forcing users to choose one paradigm or the other.

Therefore, this paper combined both models in Spark SQL, a major new component in Apache Spark, which builds on the earlier SQL-on-Spark effort, called Shark. Rather than forcing users to pick between a relational or a procedural API, however, Spark SQL lets users seamlessly intermix the two.

Spark SQL runs as a library on top of Spark and exposes SQL interfaces, which can be accessed through JDBC/ODBC or through a command-line console, as well as the DataFrame API integrated into Spark’s supported programming languages. The DataFrame API lets users intermix procedural and relational code, and advanced functions can also be exposed in SQL through UDFs.

The key contributions of the Spark Sql are as follows:
1. Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark’s built-in distributed collections.
2. To support the wide range of data sources and algorithms in big data, Spark SQL introduces a novel extensible optimizer called Catalyst. Catalyst makes it easy to add data sources, optimization rules, and data types for domains such as machine learning.

The main advantages of Spark SQL are as follows:
1. Support relational processing both within Spark programs (on native RDDs) and on external data sources using a programmer- friendly API.
2. Provide high performance using established DBMS techniques.
3. Easily support new data sources, including semi-structured data and external databases amenable to query federation.
4. Enable extension with advanced analytics algorithms such as graph processing and machine learning.
5. Overall, Spark SQL improve developer productivity with mixed procedural and SQL applications.

The main disadvantages of Spark SQL are as follows:
1. Unsupportive Union type: Using Spark SQL, we cannot create or read a table containing union fields.
2. In order to be processed by Spark SQL, data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of big data technology.


Review 4

Problem & motivations:
Big data processing becomes more popular these years. MapReduce gave users a powerful yet low-level procedural programming interface while the traditional declarative queries are not suitable for some advanced analytics, such as machine learning and graph processing.

Main Contribution:
Firstly, it provides an integrating of two ways to manipulate data. One is DataFrame/Dataset API, the other one is the normal SQL query. Spark module works for structured data processing. The distributed collections (the Spark) is organized into named columns. Also, it provides the in-memory caching like the Spark does. Secondly, it provides Catalyst Optimizer which greatly enhance the performance of the user queries. The reason Catalyst contains a general library for representing trees and applying rules to manipulate them.



Review 5

This paper introduces Spark SQL, which is a module in the widely used Apache Spark that adds support for relational processing and integrates it with Spark’s functional programming API to get the best of both worlds. Spark users now have access to the advantages of relational processing, like declarative queries, while SQL users can leverage the capabilities of Spark to perform complex tasks like machine learning. Through a declarative DataFrame API, it integrates relational and procedural processing, and its optimizer, named Catalyst, is highly extensible, making it straightforward to build new customized features to handle the various requirements of modern data processing. In contrast to pure relational approaches which are insufficient for many big data applications and low-level systems that support MapReduce which can be complex to optimize and program, Spark SQL offers users to use the best features from both paradigms without having to choose one or the other. It follows a previous effort to build relational support Spark, which was called Shark. Unfortunately, Shark suffered from some limitations, such as the inability to query data inside a Spark program, the fact that Shark could only be called using a SQL string, and the difficulty of extending the Hive optimizer to cover new features such as new data types for machine learning. Spark SQL aims to fix these issues in order to meet the following goals:

1. Support relational processing on RDDs and on external data sources
2. Provide high performance using established DBMS techniques
3. Make it easy to add support for new data sources, like semi-structured data
4. Enable extensions for advanced algorithms such as graph processing and machine learning.

Spark SQL is built on top of Apache Spark, which is a general cluster computing engine that has APIs in Scala, Java, and Python, where data is organized into collections called Resilient Distributed Datasets (RDDs) that are collections of Java or Python objects partitioned among many machines. They can then be operated on with map, filter, reduce, and other functions. Spark SQL uses many of the same concepts from the RDD in its abstraction, a DataFrame, which is a distributed collection of rows with the same schema, which can be considered to be equivalent to a table in a relational database. It is lazy in that a DataFrame object is a logical plan to compute a dataset, but no execution occurs until a user calls an output operation such as save. Users are able to call relational operators such as projection, filter, join, and aggregations, much like their relational equivalents. In order to interoperate with procedural code in Spark, Spark SQL allows users to construct DataFrames against RDDs of objects native to the programming language. Additionally, Spark SQL utilizes columnar cache to reduce memory footprint, and offers user-defined functions. Besides the new data model, the authors implemented a new extensible optimizer, Catalyst. It supports both rule-based and cost-based optimization. The main data type of Catalyst is a tree made up of node objects, where each node has a node type and zero or more children. Trees are manipulated using rules, which are functions from one tree to another. The general tree transformation framework is done in four phases:

1. Analyzing a logical plan to resolve references
2. Logical plan optimization
3. Physical planning
4. Code generation to compile parts of the query to Java bytecode

Besides these features, Spark SQL allows user-defined types, schema inference for semistructured data, integration with Spark’s machine learning library, and other ways that allow it to adapt well to new use cases as they appear.

The main strength of this paper is the introduction of a new module on a widely used system, Spark, that finally offers support for relational processing while greatly improving on the functionality and performance of a previous attempt, Shark. Besides the improvements in extensibility and flexibility over Shark, the performance results shared by the authors show significant performance improvements for various queries on the order of 2x improvements nearly across the board. Additionally, the use of DataFrames allowed Spark SQL to outperform handwritten versions of the code (that do not use DataFrames) by up to 12x. This has important implications for the various applications that currently use Spark and could benefit from being able to work with data in a relational manner.

While the results for this system are certainly noteworthy, they are somewhat arbitrary test conditions where the authors created a random dataset of 1 billion integer pairs on which they tested various queries. It would be interesting to see how Spark SQL performed against real-life workloads, especially if they tested ones with different characteristics. After all, one of their goals was to create an extensible optimizer to handle varying use cases. Also, a performance comparison against lower level computing frameworks like some MapReduce system would have been interesting, to see the tradeoffs of Spark SQL.


Review 6

The main contribution of this paper was the Spark SQL module in Apache Spark. Spark SQL has APIs in Scala, Java, and Python and includes libraries for streaming, graph processing, and machine learning. Spark SQLs predecessor, MapReduce was difficult to program, and that led the field to seek a more user-friendly experience by offering a relational interface to the underlying data. This made certain things difficult however, like working with semi-structured or unstructured data as well as performing advanced analytics. Spark SQL attempts to combine the ideas of procedural and relational systems, which were historically separate in previous distributed data processing systems. Spark SQL builds on a previous contribution of the author, called Shark, which was SQL-on-Spark, except this more recent contribution combines the two options for relational versus procedural.

Spark SQL features a DataFrame API that can perform relational queries as well as an optimizer called the Catalyst that eases the addition of data sources and optimization rules. Spark SQL has a functional programming API that lets us create Resilient Distributed Datasets (RDDs), which are fault tolerant because we can recover their data by following lineage logs, which we saw in previous readings. Dataframes have the advantage of storing data in the compact columnar format, using SQL to perform multiple aggregates at once, and using the Catalyst relational optimizer.

Catalyst uses features of the Scala programming language to help with its query optimization rules. The Dataframe API allows users to mix relational and procedural code. A DataFrame is like a table in a relational database but with rows distributed across multiple nodes. They are also lazy in that the data itself is not stored but rather a plan to construct the data from stored data, if it is called. The catalyst allowed user defined functions for optimizations working with big data and

I like how this paper explicitly stated the shortcomings of the predecessors of Spark SQL and told a story of what motivated the current paper’s contribution. The story was thorough and covered all the challenges at each stage of the problem’s history. I did not like, however how the



Review 7

Spark: Cluster Computing with Working Sets
This paper gives an introduction of Spark SQL interface. There are two main advantages in Spark SQL: 1 it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. 2 it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points.

DataFrame is the main abstract defined in Spark SQL, it is a set of distributed record fo data which the record has the same schema. Catalyst optimizer define the class of Tree and use it to execute query. Spark SQL can support semi-structure dataset like JSON and also support machine learning component.

The main contribution of this paper is the presenting of Spark SQL which is a new module in Apache Spark providing reach integration with relational processing. It offers benefits such as automatic optimization and letting users write complex pipelines that mix relational and complex analytic. It also supports a range of features.

The advantages of this paper is that Spark SQL has many great features such as analyzing large scale of data and supporting many data types for machine learning.

The drawback of the paper is hard to find but when I search for more resources the spark SQL is said of lacking advanced security features.


Review 8

The paper proposed Spark sql, an application to integrate relational processing with Spark. In this paper, two major stuff are introduced: DataFrame API and Catalyst.

A DataFrame is a collection of records with a unified structure that is logically equivalent to a table in a relational database. The difference with RDD is that the DataFrame also records the schema information of the data.

DataFrame's API supports the existing procedural APIs of Spark, and also adds new relational operation APIs such as select, groupBy, and so on. Users can use the DataFrame API by writing code, or they can start Spark SQL queries via JDBC/ODBC.

Catalyst is an extensible query optimizer. As can be seen from the Spark SQL API architecture diagram above, the Logical Plan corresponding to the DataFrame will be optimized and converted by Catalyst, and finally become the corresponding RDD Physical Plan, and then submitted to Spark for calculation.

Catalyst's query optimization for DataFrame can be divided into the following steps: First, the information in the DataFrame object constructed by the user through the DataFrame API layer can be used to construct an AST (Abstract Syntax Tree) that computes the operation of the corresponding data. Second, the Logical Plan at this point represents the logical execution plan for calculating the data with the DataFrame, where the schema information for all source data is loaded. The next processing for Catalyst is Logical Optimization, which optimizes the logical execution plan. In the third step, Physical Planning, Catalyst converts the Optimized Logical Plan into a number of corresponding Physical Plans and selects the optimal Physical Plan based on their computational cost (Cost-based optimization). Finally, Catalyst uses the Quasiquotes function provided by Scala for the selected optimal Physical Planning to generate the corresponding RDD calculation code and submit it to the Spark engine for execution.

The contribution of spark SQL mainly falls on it proposed spark sql to integrate the relational process of Spark. Secondly, it provide a relational processing for RDD and external data sources through a programmer-friendly API. Moreover, it support extensions to advanced analysis algorithms, including graph processing algorithms and machine learning algorithms. At last, spark sql support more external data sources, including semi-structured and unstructured data sources.



Review 9

Spark SQL is a system that integrates a relational query language with the Spark API. The authors had previously released another system, “Shark,” which was a modification of hive and forced users to choose between the traditional procedural spark API and the SQL interface. In addition to SQL and the procedural Spark API, DataFrames are added. These are a lot like what users would work with in Python’s Pandas framework. Spark SQL is useful because many types of data processing can be used together, and it is easy to pull in multiple data sources - from Spark RDDs to external databases.

One big contribution is the Catalyst optimizer, which handles optimization for the queries and DataFrames. One piece of Catalyst is code generation - specifically generation of Java bytecode. The authors mentioned that this is a large part of what makes Spark SQL competitive in terms of performance with Impala, which uses C++ and LLVM. I found it impressive that Spark SQL was able to be competitive with a system taking advantage of LLVM.

This idea of having a SQL interface for Spark is important because it allows users who might not be functional programming experts to interact with data in spark. This is the same thing that has happened with hive - people who don’t know about the underlying data sources and have no idea how to write a MapReduce program are able to gain insights from data. Moreover, people who are able to do both are able to choose the correct tool for the job with seamless integration from Spark SQL. This is the main promise of the system.

My biggest complaint about this paper is that it seemed like the Spark SQL system was trying to do *everything*. It almost felt as if there were too many features, and they were all mentioned in the paper. I got the idea that it would be somewhat difficult to read and understand code written with the Spark SQL module, and that it might be hard to decide how to write code efficiently. For instance, it was mentioned that the DSL for DataFrames can be used to interact with data, but also you can use SQL. I think it would be difficult to learn about all of the possible use cases of Spark SQL and how to best take advantage of it for your use case.



Review 10

This paper introduces Spark SQL, a new module build on top of Spark. It aims to bridges the gap between relational queries and complex procedural algorithms since a purely relational approach is insufficient for nowadays big data applications. In addition, a major problem of existing relational interface on Spark is that it can only query external data stored in Hive. However, many data-intensive application needs to access data from multiple sources. Therefore, the main goal of Spark SQL includes: support relation processing from multiple data sources (both internally from Spark RDD, and externally from Hive for example) and enable simple addition of new data sources and advanced analytics algorithms. Spark SQL realize these goals through two components: DataFrame and an extensible query optimizer called Catalyst.

DataFrame is a distributed collection of rows with the same schema. It keeps track of its own schema and supports various relational operations. It can be constructed from a Spark RDD or from external data sources. Just like RDD, DataFrames are lazy, operations on it only construct a logical plan and no execution occurs until special “output operation” is called.

Catalyst is an extensible optimizer and is the heart of Spark SQL. Catalyst uses standard features of the Scala programming language called pattern-matching to specify optimization rule. Using this feature, complex rules can be represented in only a few lines of codes. Catalyst contains a library for representing trees and rules. Trees can be used to represent an expression, a logical query plan, a physical plan etc. While a rule is just a function that maps a tree to another. Catalyst is used to perform four tasks: resolve references in a logical plan, optimize logical plan, create physical plan and generate Java bytecode. It also supports users to define new data sources and types as long as certain rules are followed.

Overall, I think this is a well-written paper and not very difficult to understand. The only suggestion I have is that maybe the author could explain more about DataFrame. The paper only mentions its API and the operations that can be applied to it. There’s no information about how its implemented and how its represented internally.





Review 11

In the paper "Spark SQL: Relational Data Processing in Spark", Michael Armbrust and Co. discuss Spark SQL, a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Current big data applications require a combination of processing techniques, data sources, and storage formats. To meet these needs, MapReduce was created. However, MapReduce still had many limitations. Namely, the low-level procedural programming interface burdened users with many manual optimizations in order to gain performance. In response to this, systems such as Pig, Hive, Dremel, and Shark rose to answer these concerns. However, Armbrust wants to take it a step further and provide support for writing declarative queries and advanced analytics such as machine learning and graph processing. Spark SQL enables users to get the best of both worlds - the benefits of relational processing and the benefits of complex analytics libraries in Spark. Specifically, there are two main contributions of SparkSQL: it offers a tighter integration between relational and procedural processing and includes a high extensible optimizer that make it easy to add composable rules, control code generation, and define extension points. If one could label SparkSQL, it would be a evolution of SQL-on-Spark and Spark itself.

Spark SQL has four goals that it wants to achieve in order to avoid the original problems that its ancestor, Shark had:
1) Support relational processing on Spark programs
2) Provide high performance with standard DBMS techniques
3) Easily support new data sources
4) Enable an extension to advanced analytics algorithms (ML, Graph Processing)
Keeping these goals in mind, the structure of Spark SQL (and the paper) is the following:
1) Programming Interface: Spark SQL runs as a library on top of Shark. It exposes the SQL interface and allows users to intermix procedural and relational code. Furthermore, Spark SQL's main abstraction, DataFrame, is equivalent to a table in a relational database and can be manipulated in a similar way. DataFrames keep track of their schema and support various operations with optimized execution. Spark SQL also adapts the nested data model from Hive for DataFrames and supports all major SQL data types. Some additional important things are that Spark SQL can materialize "hot data" in memory using columnar storage and supports user defined functions (as they are a vital extension for database systems).
2) Catalyst Optimizer: Typical work flow looks like the following: SQL Query & DataFrame -> Tree construction -> Retrieval of an optimized logical plan -> Physical plan -> Use a cost model to determine the best physical plan -> Obtain RDDs.
3) Advanced Analytics Features: 3 features were added to support analytics:
i) Spark SQL includes a schema inference algorithm for JSON and other semi-structured data
ii) Spark SQL is incorporated into a new high-level API for Spark’s machine learning library
iii) Spark SQL supports query federation, allowing a single program to efficiently query disparate sources
4) Evaluation: As one might expect, Spark SQL has better performance than Shark and Impala in terms of all possible things one might do in a declarative SQL query - although UDFs are more or less the same (very slightly faster). Additionally, to justify the use of DataFrames, they were evaluated alongside Python APIs and Scala APIs that deal with distributed aggregation. DataFrames greatly outperformed them.

Much like other papers, this paper had some drawbacks. The first drawback is the misplacement of the "Related Work" section. It was placed near the end of the paper rather than the beginning of the paper. Since they talk about Shark, Extensible Optimizers, and Advanced Analytics in both this section and the introduction, it would have been great to use this as a way to facilitate discussion earlier on. Another drawback refers to my disappointment of a lack of future work. Spark SQL seems to be at its early stages and does not fully support all current DBMS techniques (such as user defined types) - something that can be handled in the future. To circumvent this, they made it open source, but it seems like they were simply avoiding the issue by doing so.


Review 12

This paper describes Spark SQL, which is an extension of Spark that allows for easy processing of relational data. One of the disadvantages of Spark is that it requires a decent amount of coding to run properly; Spark SQL rectifies that by introducing a DataFrame object, which allows the user to run standard relational operations, and its own optimizer, Catalyst, which processes these operations. In addition, Spark SQL supports easy methods for users to customize their own extensions to the program.

A Spark DataFrame is essentially the equivalent to a table in an RDBMS. A DataFrame can be built from any one of several sources, like external datasets or other internal structures. The user can perform standard relational operations on a DataFrame, like filtering, joining, and aggregation. However, the DataFrame lazily executes these operations; it doesn’t actually apply them until it needs to output some result. The DataFrames can be manipulated with Scala, so users can define their own functions on DataFrames using Scala or Java functions.

The Catalyst optimizer is used to evaluate operations on DataFrames. It represents a plan as a tree of nodes, and it applies rules to these nodes. Each rule is usually some kind of pattern matching; node structures matching a certain pattern are changed into something else. However, rules can be implemented as any arbitrary code. Catalyst groups rules into batches, and runs each batch of rules on the tree until it stops changing.

Spark SQL has several benefits. Notably, it contains several of the advantages of Spark, while allowing users to interact with standard relations instead of Spark RDDs, so that users don’t have to learn too much of a foreign language to use it. On the other hand, since Spark is built in Scala, it’s very easy to extend; users can easily define new data sources, functions, and optimizer rules. They can even define their own types, as long as they can map them back and forth between Catalyst structs.

The downside of Spark SQL seems to be mostly that it doesn’t get as impressive of speed improvements. In the experiments, Spark SQL outperformed Shark, but was only on par with Impala. Aggregation with DataFrames seems to be faster than with custom code, however. These were the only experiments run, which seems to be a downside of the paper itself; there aren’t as many comparisons for Spark SQL.



Review 13

Spark SQL solves two interesting problems simultaneously: better support for declarative / SQL programming on nested / semistructured / big data, and better integration of procedural with declarative programming.

The main idea is to provide a DataFrames API which abstracts a data source (RDD, Hive table, CSV file, etc) and provides relational-style operations which can be optimized. This allows developers to seamlessly operate on many different data storages, including Java/Python objects. They also introduce the Catalyst optimizer, a highly extensible query optimizer. One important thing to notice, in my opinion, is how flexibly the entire system was built: it is easy to define new data sources, optimizations, UDTs, and UDFs, all of which play nicely together.

The contributions that it provides another way to manipulate data, which is actually a very demanding feature for data processing frameworks like Spark and MapReduce. Thanks to the great integration of SparkSQL and Spark, developers can enjoy the rich features provided by relational models and at the same not loosing flexibility of procedural processing.

Spark SQL shares fate with Spark and has the weakness that performs worse in Spark-weak intensive workloads without iterations.





Review 14

This introduces Spark SQL, which is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Spark SQL makes two main additions comparing with previous systems. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. DataFrames are collections of structured records that can be manipulated using Spark’s procedural API, or using new relational APIs that allow richer optimizations. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. The goals of Spark SQL are support relational processing both within Spark programs and on external data sources using a programmer friendly API, provide high performance using established DBMS techniques, support new data sources, including semi-structured data and external databases amenable to query federation, and enable extension with advanced analytics algorithms such as graph processing and machine learning. Then the paper goes into details of programming interface. The main abstraction in Spark SQL’s API is a DataFrame. DataFrames keep track of their schema and support various relational operations that lead to more optimized execution. Spark SQL uses a nested data model based on Hive for tables and DataFrames. In addition, Spark SQL also supports user-defined types. Operation wise, DataFrames support all common relational
operators, including projection, filter, join, and aggregations. Although DataFrames provide the same operations as relational query languages like SQL, they are significantly easier for users to work with. As noted earlier that Spark SQL supports user defined functions, its DataFrame API supports inline definition of UDFs, without the complicated packaging and registration process found in other database systems. The second major component of Spark SQL is Catalyst Optimizer. Catalyst supports both rule-based and cost-based optimization. The main data type in Catalyst is a tree composed of node objects. Each node has a node type and zero or more children. Trees can be manipulated using rules, which are functions from a
tree to another tree. The paper presents examples to illustrates how rules work. Catalyst’s general tree transformation framework has four phases: (1) analyzing a logical plan to resolve
references, (2) logical plan optimization, (3) physical planning, and (4) code generation to compile parts of the query to Java byte code. There are several advanced analytics features discussed in the paper: schema Inference for semistructured data, integration with Spark’s Machine Learning library, and query federation to external databases.

The advantage of this paper is enough examples to illustrate how Spark SQL works. When explaining technical details of DataFrames, the paper uses many actual examples to illustrate each feature.

The disadvantage of this paper is structure and font size. The layout seems to be very compact which is not comfortable to read. Also, size of examples are different, some font sizes are very small, which is not comfortable to read.




Review 15

“Spark SQL: Relational Data Processing in Spark” by Armbrust et al. describes Spark SQL, a new component in Apache Spark that allows users to both write relational queries and create complex procedural algorithms over their data. Data is stored in DataFrames and can be accessed either using SQL or procedural API calls.



Review 16

In this paper, they proposed a novel library above the Spark called Spark SQL which provides relational processing both within spark programs and external data sources. The problem they are going to solve is to create a novel platform that can produce both relational and procedure functionalities in Spark. This problem is very important due to several reasons, first of all, the demands in big data application are increased very quickly, the traditional MapReduce programming interface subject to the low-level and procedural programming interface. For the declarative query, we are also facing the insufficiency problem like the Extraction-Transformation-Loading (ETL) with various data sources and great demands for advanced analytics like graph processing and machine learning. At that time there is a big gap between relational and procedural systems. The key idea of Spark SQL never forces users to pick either relational or procedural, let them seamlessly intermix the two. Next, I will summarize the crux of Spark SQL with my own understanding.

Before the introduction of Spark SQL, Shark is a relational interface on Spark that is a modified Hive system that implemented traditional RDBMS optimizations. However, Shark is subject to some limitation like the limited data sources, prone to error and hard for the extension. In order to avoid these drawbacks and support more preferable features, they introduced Spark SQL which supports relational processing both within Spark programs (on native RDDs) and on external data sources, provides high performance using established DBMS techniques. It is also flexible to support new data sources, including semi-structured data and external databases and extensible with advanced analytics algorithms. The Spark SQL use a data model called DataFrame which is a distributed collection of rows with the same schema. The DataFrame can be constructed from external data sources and existing RDDs. It supports relational operators as well as Spark operations and can be evaluated lazily. This model supports relational operations vis a DSL which build up an AST of expression that is then optimized by Catalyst. It also provides in-memory caching which is a columnar storage with hot columns cached in memory, this caching is useful for interactive queries and iterative algorithms, besides DataFrane supports UDFs.

Another important component is the optimizer with an extensible design called Catalyst. It is flexible because it is easy to add new optimization techniques and features to Spark SQL and enable external developers to extend the optimizer. Catalyst provides rule-based and cost-based optimization and utilized the standard features of Scala. The main data type in Catalyst is called a Catalyst tree and its rules support pattern matching functions that transform subtrees into specific structures. For the query processing, Catalyst uses four phases general tree transformation framework, the four phases are analysis, logical optimization, physical planning, and code generation. An attribute is unresolved if its type is now known or not matched to an input table, to resolve attributes in the first phase, Catalyst lookup relations by name from the catalog, map named attributes to the input provided given operator’s children then generate UID for references to the same value and finally propagate and coerce types through expressions. In the logical optimization, it applies standard rule-based optimization/ In the physical plans, it generates one or more physical plans from a logical plan with both two kinds of optimizations. Finally, it generates Java bytecode to run on each machine which speeds up the execution. Furthermore, Catalyst is extended by supporting new Data Sources and UDTs. Besides, it also provided some advanced analytics features like schema inference algorithm for JSON and other semi-structured data, Incorporated Spark SQL into a new high-level API for Spark’s machine learning library, it also provides Query federation, allowing a single program to efficiently query disparate sources.

There are several technical contributions of Spark SQL. First of all, it achieves ETL on various data sources through the DataFrame API which is able to perform relational operations on both external data sources and Spark’s build-in RDDs. Second, it supports advanced analytics by using Catalyst Catalyst that utilizes the feature of Scala to add composable rules, control code generation and define extension points. Apart from these contributions, there are many advantages of Spark SQL. Since the Spark SQL is mainly based on Spark, it is user-friendly to Spark users because they can easily extend their program and enjoy the new features of the Spark SQL. Also, it supports both procedural and relational APIs that is flexible for a different use case. Besides, Spark SQL utilize many good features of Scala language and the code in the query optimizer are pretty short, which make it easy to maintain.

I think the downsides of this paper are minor. First of all, I think they can further the schema inference techniques using novel entity resolution techniques that can handle a more different type of data even unstructured data. Second, in the physical plan generation phase of query planning, they utilize the cost-based physical optimization. However, this optimization only supports select join algorithms, I think it is the main limitation of the query optimizer. They should explore the further usage of cost-based optimization and add more supports.



Review 17

This paper introduces Spark SQL, which is a module integrated in Apache Spark. Spark SQL’s main goal is to bridge the gap between relational queries and procedural algorithms (commonly used in analytics). People aren’t happy with just relational queries because many analytics applications (such as machine learning with big data) involve things like semi-structured data or unstructured data, which the relational system does not do well with. Spark SQL bridges this gap with 2 main parts:

1) Spark SQL provides a DataFrame API. DataFrames in Spark SQL are similar to Dataframes in R / Python’s Pandas, and hold structured data while providing an API to run relational queries *as well as* procedural algorithms. Because of their implementation in a full programming language, they are also easier to work with compared to declarative query languages like SQL. It is also much more optimized, using techniques such as in-memory machine in order to improve performance.

2) Spark SQL provides an optimizer called Catalyst. Catalyst’s primary feature and advantage is its extensibility—the authors were very concerned with the future usability of this optimizer and thus made it extendable by avoiding complicated language-specific requirements as other optimizers in the past have. Catalyst’s main structure is representing trees that have rules (functions such as pattern matching) applied to them.

The paper’s main strength/contribution is similar to the main advantage of the hybrid workload data blocks paper—it recognizes that there are many situations where people want the advantages of two different models in the same DMBS, and provides a solution to provide this. I don’t see any theoretical weakness, but I do wish they compared Spark SQL against more than just Shark & Impala so we could see performance over other architectures as well.