Review for Paper: 30-Spark SQL: Relational Data Processing in Spark

Review 1

Apache Spark allows client programs to run distributed code whose dataflow is a directed acyclic graph, a more general model than MapReduce. But Spark does not innately provide a declarative, SQL-style interface, so much effort is required from native Spark programmers to write efficient algorithms. To fix this problem, a variety of declarative solutions have been implemented, such as Shark, an early SQL-on-Spark interface. Unfortunately, Shark made it difficult for users to mix SQL code with Scala code, such as conventional Spark programs, and it lacked an optimizer that could manage the overall dataflow of a Spark project.

The Spark SQL project addresses these problems with a SQL-like query language that is integrated with a Scala interpreter. In this way, select, group by, aggregate, and other query tools can be integrated with code in Scala and Python, similar to how object-oriented data models allow data manipulation code to be called directly from C++. Spark SQL has an optimizer called Catalyst, which users can extend via a Scala library. If users add user-defined data types or functions to a Spark SQL fork, they can make corresponding extensions to the optimizer, providing rules (rule-based or cost-based) for guiding the optimizer in constructing query plans over their new data types.

Spark SQL performs dramatically faster than naive implementations of certain types of Spark queries, due to optimizations in storage methods and query plans. For example, the DataFrame storage format used by Spark SQL automatically stores data as columns, which allows only the needed attributes to be read from storage for certain queries, and allows for more compact representations than the Scala objects used in native Spark code. Another feature of DataFrames is that they are lazily evaluated, so computation is not wasted generating data that is not required. For example, an intermediate table in a computation may be pipelined into an output table to improve performance by the optimizer, or may never be created at all if there is a more efficient way to compute the result (such as using an index to count matching rows).

Spark SQL is slower than Impala, a rival Spark library for relational queries. Impala is implemented in C++ and LLVM, and thus expected to be slightly faster. But it appears that Impala producers better query plans than Spark SQL in some cases, due to effective selectivity estimation. This means there is room for improvement in Catalyst.


Review 2

This paper focuses on Spark SQL, a new module in Shark which integrates relational data processing with Shark’s functional processing. Since big data applications require various processing techniques, data sources and storage formats for different types of workloads, a new model combining both procedural and relational algorithms was required, catering to multiple needs. This motivated the authors to integrate the two onto Shark.

Spark SQL supports relational processing both within Spark programs and on external data sources using a programmer friendly API. It supports new data sources including semi-structured data all the while providing high performance using DBMS techniques. The programming interface of Spark SQL require a DataFrame API along with an optimizer called Catalyst, built on to Spark. DataFrame is a distributed collection of rows with the same schema. It is equivalent to a table in a relational database, and also distributed collections in Spark (RDDs). It requires a nested data model based on Hive for tables and DataFrames. It uses In-memory Caching and User-defined functions. A new extensible optimizer was devised to implement Spark SQL based on Scala. It has an extensible design making it easy to add new optimization techniques and features to Spark SQL, allowing even third-party developers to modify it according to their needs. Features such as Schema Inference for Semistructured Data, Integration with Spark’s Machine Learning Library and Query Federation to External Databases, to address the various problems in big data workloads.

The paper is successful in providing a comprehensive study of Spark SQL. Performance evaluation is provided based on different type of query functions on various Database systems. It also provides an insight into practical research applications rounding off the paper nicely.

However, the performance evaluations show it is competitive with Impala and not the sole winner. Even though applications of this system are many, it is yet to be established in the market since it is a new system. It is reported to be buggy and is unsupportive of some third-party applications. The paper should have highlighted on the support of third-party applications extensively.



Review 3

Big data applications require a mix of processing techniques, data sources and storage formats. The earliest systems designed for these workloads, such as MapReduce, gave users a powerful, but low-level, procedural programming interface. Programming such systems was onerous and required manual optimization by the user to achieve high performance. As a result, multiple new systems sought to provide a more productive user experience by offering relational interfaces to big data. However, the relational approach is insufficient for many big data applications.
First, users want to perform ETL to and from various data sources that might be semi- or un-structured, requiring custom code.
Second, users want to perform advanced analytics, such as machine learning and graph processing, that are challenging to express in relational systems.

Therefore they think a combination of both relational queries and complex procedural algorithms might be an answer, and they built Spark SQL which lets users seamlessly intermix the two.

There are two key contributions about this paper: an API called DataFrame and extensible optimiser, Catalyst.
First, Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark’s built-in distributed collections. This API is similar to the widely used data frame concept in R, but evaluates operations lazily so that it can perform relational optimizations.
Second, to support the wide range of data sources and algorithms in big data, Spark SQL introduces a novel extensible optimizer called Catalyst. They designed an extensible query optimizer called Catalyst. Catalyst uses features of the Scala programming language, such as pattern-matching, to express composable rules in a Turing- complete language. It offers a general framework for transforming trees, which we use to perform analysis, planning, and runtime code generation. Through this framework, Catalyst can also be extended with new data sources.

I think it’s a great paper and the result makes developers can enjoy the rich features provided by relational models and at the same not loosing flexibility of procedural processing.


Review 4

This paper presents a new module called Spark SQL, which is a major new component in Apache Spark. In short, Spark SQL integrates relational processing with Spark’s functional programming API, which is designed to offer richer APIs and optimizations while keeping the benefits of the Spark programming model. Spark SQL has two main additions, a declarative DataFrame API that can provide tighter integration between relational and procedural processing, and a highly extensible optimizer Catalyst that makes it easy to add compassable rules, control code generation and define extension points. This paper first gives an overview about Spark. Then it describes DataFrame API and Catalyst optimizer. Then it provides performance evaluation. Finally, it provides external research built on Catalyst and related work.

The problem here is that relational system is sometimes insufficient, especially for many big data applications. For example, users want to perform ETL on data sources that might be semi-structured or unstructured which requires custom code. Also, users want to perform advanced analytics, such as machine learning and graph processing. Thus this paper provides a new approach for such applications called Spark SQL.

The major contribution of the paper is that it provides a good module called Spark SQL which can provide rich integration with relational processing while maintaining the properties of Spark. It has been implemented in Apache Spark and it shows a great performance according to test results. Here we will summarize the key components of Spark SQL below:

1. DataFrame API which performs relational operations on both external data sources and Spark’s build-in distributed collections
2. Extensible query optimizer Catalyst which uses features of Scala to express compassable rules in a Turing-complete language
3. Capable of being extended with new data sources (semi-structured data, user-defined functions, user-defined types fro domains)
4. Provide hight performance using established DBMS techniques

One interesting observation: this paper is in general good and it presents with a well-performing module. It also gives detailed description about the main additions DataFrame API and optimizer Catalyst. However, this paper does not mention a lot about how they get the design ideas for those two key components. It might be better to add the motivations and where the ideas are generated, which might be better for others to understand and innovate.


Review 5

This paper introduces SparkSQL and describes its features and components. Although, the Spark API is easy and programmers can simulate simple SQL queries using it, an SQL engine becomes very critical when we have complex queries that writing them with the Spark API is hard and/or are hard to be optimized manually.

Spark SQL provide a declarative way to have SQL queries running on Spark.
The most essential part in Spark SQL is the DataFrame class. In Spark SQL user can write the queries and SQL strings and also the can use declarative Dat aFrame API to write their queries. I've used the DataFrame API in SparkSQL and I can confirms the ease of use and good performance of them.

Catalyst, which is the query processor and optimizer is another important part of Spark SQL. Catalyst bring automatic query optimization to Spark SQL. It is designed in a way to be really flexible and extendable.


Review 6

Spark SQL is an incredibly bold idea - it attempts to combine Spark procedural processing with the relational model, and the declarative nature that accompanies that. The platform seeks to leverage the benefits of both relational processing - declarative queries and optimized storage - with the ability to do complex analytics that require a procedural language to describe the algorithm. Spark SQL achieves this through the DataFrame API - a dataframe is the Spark SQL equivalent of a table - which integrates with procedural code. By combining this with Catalyst - the optimizer - we have a very powerful system.

The system is motivated by the recent "big data" movement. Traditional RDMSs do not scale, but the Map-Reduce method that has become popular also has serious limitations. Spark solves some of those problems, by allow more iterative computations, and a few other things. Spark SQL is the next evolution of Spark, but it is based heavily on the lessons learned from the authors implementation of Shark, another SQL-on-Spark attempt.

The DataFrames API is similar to the query builders seen in modern web development frameworks - it provides a object-oriented way to build queries - .join(), .groupBy(), etc. - but users can also just enter in raw SQL statements. The API can also be used to query data stored in Scala, Java, and Python RDD's - the system will automatically infer the schema. The DataFrames can also be used directly with Spark SQL's built in machine learning algorithm based on the "pipelines" concept.

Catalyst, the optimizer for Spark SQL, is also build in Scala, like the rest of the Spark system. Catalyst is designed to be extensible, so developers can add support for new data types. It supports both cost-based and rule-based optimizations. The optimizer first comes up with the logical plan, which it optimizes, before then considering physical plans. It selects the 'best' physical plan based on some cost model.

Spark SQL's performance seems roughly on par with Impala, which is impressive - Impala is a MPP(Massively Parallel Processing) SQL-on-Hadoop solution written in C++. For a system combining relational and procedural styles, while being very easy to use, is impressive. It will be interesting to see which of these "big data" architectures ends of winning. Spark's support of Scala makes it very powerful, but the language is different enough that it might prove to be a steep learning curve for new users.


Review 7

Problem and solution:
The problem mentioned in the paper is that the earlier generation of big data systems which have the procedural models like MapReduce require the user to do manual optimization, which is complex. While the current relational database does not work well for the big data applications. So a new system with relational interfaces and procedural performance are needed. SparkSQL is proposed as a solution to this problem as an extension to Spark. The user does not need to decide to use a relational model or a procedural one. The module can optimize the relational queries, and run it in a Spark in procedural process. In this way, not only the Spark users have the benefits of queries optimization, but also the SQL users can use the complex libraries in Spark easily without making low level programming. To accomplish this, SparkSQL uses DataFrame API to do the relational operations and use the extensible optimizer Catalyst to add a data source, rules and data types. It is an important improvement of the Spark and becomes popular in the big data community in a short time.

Main contribution:
The main contributions of SparkSQL are that it creates Catalyst. It is built on Scala and enables easy rule making comparing to other extensible optimizer. Catalyst uses trees to organize the data. One of its strengths is optimization and the other one is its extensibility. The optimization function is implemented by doing logical plan analysis and optimization, then physical planning and compiling of queries to Java code. And the extensibility is more interesting. It enables the wide capability of SparkSQL. It allows the user to add some extensions without even understanding the rules. New datasource can be added with corresponding APIs. And the developers can define new types and the Catalyst maps them to its built-in types for easy use.

Weakness:
Though SparkSQL is quite powerful, there are still some weakness. One significant weakness is that since it uses extra data structures and models like DataFrame and Catalyst to enable the relation processing, the performance must be lower than the low-level configured big data systems because it takes much time to analyze and optimize the queries.


Review 8

This paper introduces Spark SQL, which is a component built on top of Spark Core. Spark SQL supports SQL language and command-line interfaces and JDBC. It provides data abstraction called DataFrames, which supports structured and semi-structured data. Spark SQL also provides an extensible rule-based optimizer, Catalyst to generate efficient execution code from logical plan of the operations to DataFrames. In addition, Spark SQL provides ML-lib, Spark’s machine learning library, to make the data analysis more simple and productive.

Spark SQL have great contributions as it integrates declarative API to Spark (like SQL) and designs a corresponding optimizer for that. DataFrames is more convenient than Spark’s procedure as it simplifies the data manipulation. For instance, onerous coding does in procedure is eliminated as the complex operation can be passed once to DataFrames using simple SQL language. In addition, the lazy approach to operate DataFrames expand its potential in optimization, like pipelining.Besides the more productive API, Spark SQL provides the framework to make further optimizations, which support rule-based optimization. The optimizer is extensible. Hence, not only external user could add features to it to make it fit to the more specific workload but also the optimizer itself can be easily upgraded. This makes the interface provided by Spark SQL not only easy to code but also efficient in execution.

Though Spark SQL , there are still some concerns:
1.Compared to optimizer in relational DBMS, Catalyst is not mature enough. The metadata of DataFrames is limited compared to the variety of information in relation metadata, such as histogram. In addition, though the rule-based Catalyst is extensible, the logical plan it generates sometimes may become sub-optimal.
2.Though the syntax for DataFrames is similar to SQL, yet it is not as expressive as SQL. Much labor in coding might still be required.
3.The Spark SQL add a extra layer above Spark. Though it provides productive API, yet the code it generated might be less efficient to code written directly into Spark Core. Consider this, when performance is really important and the code can often be reused, skip Spark SQL can become an option.


Review 9

Problem/Summary:

Spark improved upon the MapReduce model by allowing acyclic dataflows and cached intermediate results using Resilient Distributed Datasets (RDDs). RDDs could be operated on and assigned procedurally using different operators. Other systems like Shark have built upon Spark, to allow SQL statements to be executed within the Spark framework. However, SQL is a declarative/relational language instead of a procedural one, and so systems like Shark did not allow the mixing of these two types of languages. With SparkSQL, the programmer is able to manipulate data in a language that has both procedural and relational elements.

The main datatype in SparkSQL is the Dataframe, which like an RDD is a collection of data which is distributed across nodes. However, Dataframes keep track of their own schema, which allows them to be manipulated using relational-like operators. Since these expressions are embedded in a programming language, intermediate results can be saved and reused, and more complex control flows can be implemented. SparkSQL also lazily computes Dataframes just like RDDs, which allows for optimizations.

SparkSQL uses its own Catalyst query optimizer, which is easily extensible through rules. Rules contain a pattern matcher for certain tree structures in the query plan, and perform an action on any part of the tree that matches its required structure. Rules can be implemented using standard Scala language features, which makes writing rules concise and powerful. The Catalyst query optimizer also compiles code to speed up performance. Currently, it does not use much cost-based query optimization.

Strengths:
This paper clearly points out some limitations of current systems, and shows how SparkSQL exceeds previous systems in its expressiveness and flexibility. The authors are also aware of what use cases they are targeting with SparkSQL, and so have introduced smart extensions (JSON schema inference, integration with Spark’s machine learning library).

Weaknesses/open questions:
One big weakness is the lack of cost-based optimization, but I expect that this is being worked on.



Review 10

The paper describes Spark SQL, a new module in Apache Spark, which is designed to integrate relational processing with Spark’s functional programming API. It has been quite common for big data community to support relational interface into their distributed data processing systems. For example, there is Apache Hive for Hadoop and has been Shark for Spark already. Then, why do the authors implement a new system to support relational interface when they already have Shark? Shark is fundamentally limited because it is based on Apache Hive and the authors have something more ambitious in mind with Spark SQL.

The authors identify that many of the workloads in Spark are expressed well with both relational and procedural interfaces. For loading and storing data, a user is more likely to prefer to use relational interface (e.g. SQL) and let the system handles the detail. For processing and manipulating data, a user should be able to specify complex steps that are required to process the data, which should be done better with procedural interface. Spark SQL aims to achieve this by providing an interface to mix both relational and procedural process seamlessly.

There are two main components of Spark SQL, DataFrame API and Catalyst. DataFrame is a special data object that is used by Spark SQL and Catalyst is an extensible optimizer inside Spark SQL. These two components together enable the functionality to program large-scale data processing job that mixes SQL and a procedural language. The use of Scala language enhances this feature and makes it a lot easier and intuitive to program. This seamless integration of two different interfaces is the key contribution of this paper.

One thing that I have found slightly disappointing in the paper is the lack of a detailed description on the actual usage of the Catalyst optimizer in terms of their extensibility and potential problems. For example, I have a personal question with their Catalyst optimizer that how it prevents optimization rules that conflict each other or do not converge to a fixed state. It is a natural problem that could occur because you allow Catalyst to be arbitrarily extensible by a user, but the paper does not address this issue.

In conclusion, Spark SQL is a new module in Apache Spark that integrates relational and procedural interfaces, which makes it very easy to express the large-scale data processing job. The seamless integration of the two interfaces is the key contribution of the paper. It could lead to a new unified interface for large-scale data processing in the future.


Review 11

Users are used to writing declarative queries, but the relational model performs poorly when handling big data. Furthermore, it is hard to perform complex analytic such as machine learning in a relational system. This paper describes Spark SQL which combines traditional relational processing with SPARK. Spark SQL allows users to use the relational API of SQL and the procedural API of Spark. In order to provide this functionality, Spark SQL introduces the Dataframe API and the Catalyst optimizer.

A dataframe represents a logical plan to compute a dataset, which can be considered a table in SQL. A dataframe is only evaluated when an output operation such as count() is performed. Dataframes can have relational operations such as project and group by performed on them. Furthermore, dataframes can be computed from RDDs or semistructured data.

The Catalyst optimizer is a extensible optimizer that supports cost and rule based optimizations. It is designed such that is is easy to implement new optimizations and easy to extend its functionality. To create plans, the optimizer looks at tree objects and applies rules to the tree to manipulate it. Rules are applied using pattern matching that identifies subtrees and replaces them with a new subtree. Each rule represents a transform pass over the tree and may need to be executed multiple times to fully transform the tree.

Spark SQL is a very interesting approach to Bigdata. The paper recognizes that most users are used to and prefer the relational model due to its simple interface and declarative query language. However, there is also recognition of a need for supporting more complex functionality such as machine learning over big data and so Spark is used as well. I think the most useful aspect of the system is the fact that Spark SQL expresses the relational model is a procedural way. This transformation of the relational model is similar to development in the past when researchers were trying to create database models that were more akin to programming languages. I think Spark SQL is a successful attempt at this and it has the added benefit of possessing the powerful distrusted data processing aspects of Spark.


Review 12

This paper proposes Spark SQL, which is integrates relational processing with Spark. The authors state that most data pipelines are better expressed with a combination of both relational queries and procedural algorithms. Spark SQL achieves this by allowing SQL users to use the "procedural" analytics libraries in Spark, and allowing Spark programmers to use the "relational" (declarative) query processing features.

The relational/procedural integration is offered by DataFrame API and Catalyst, which are the two main contributions of Spark SQL:

1. DataFrame API:
In addition to SQL interfaces, Spark SQL integrates DataFrame API into Spark's programming languages.DataFrame is a distributed collection of rows with the same schema, which is just a (sub-)table in a relational database. Many Spark components or libraries take and produce DataFrames. Users can perform relational operations and manipulate DataFrames. The operations are integrated into programming languages such as Scala, Java, or Python, which is very convenient for the users. In addition, DataFrame operations in Spark SQL go through Catalyst.

2. Catalyst:
Catalyst is an extensible relational optimizer, which supports both rule-based and cost-based optimization. It is based on Scala and utilizes the features of functional programming language, which makes optimization rules easy to specify. The basic idea of Catalyst is to construct trees and apply rules on them. A Catalyst tree consists of node objects. The node types are defined in Scala and the nodes are manipulated using functional operations. Users can apply rules (e.g., pattern matching in Scala) on subtrees.

This paper also gives experiment results of Spark SQL by evaluating its performance on both SQL query processing and Spark programs. For SQL queries, Spark SQL outperforms Shark and show comparable performance to Impala. For Spark Programs, DataFrame API performs much better than Python and Scala APIs since most of the work is compiled into JVM bytecode.

The major contribution of this paper includes that it presents the motivation of Spark SQL and its advantage, as well as its design and implementations. It is a combination of relational and procedural models. According to this paper, Spark SQL is one of the most acitvely developed components in Apache Spark, the most active open source project for big data processing.


This paper proposes Spark SQL, which is integrates relational processing with Spark. The authors state that most data pipelines are better expressed with a combination of both relational queries and procedural algorithms. Spark SQL achieves this by allowing SQL users to use the "procedural" analytics libraries in Spark, and allowing Spark programmers to use the "relational" (declarative) query processing features.

The relational/procedural integration is offered by DataFrame API and Catalyst, which are the two main contributions of Spark SQL:

1. DataFrame API:
In addition to SQL interfaces, Spark SQL integrates DataFrame API into Spark's programming languages.DataFrame is a distributed collection of rows with the same schema, which is just a (sub-)table in a relational database. Many Spark components or libraries take and produce DataFrames. Users can perform relational operations and manipulate DataFrames. The operations are integrated into programming languages such as Scala, Java, or Python, which is very convenient for the users. In addition, DataFrame operations in Spark SQL go through Catalyst.

2. Catalyst:
Catalyst is an extensible relational optimizer, which supports both rule-based and cost-based optimization. It is based on Scala and utilizes the features of functional programming language, which makes optimization rules easy to specify. The basic idea of Catalyst is to construct trees and apply rules on them. A Catalyst tree consists of node objects. The node types are defined in Scala and the nodes are manipulated using functional operations. Users can apply rules (e.g., pattern matching in Scala) on subtrees.

This paper also gives experiment results of Spark SQL by evaluating its performance on both SQL query processing and Spark programs. For SQL queries, Spark SQL outperforms Shark and show comparable performance to Impala. For Spark Programs, DataFrame API performs much better than Python and Scala APIs since most of the work is compiled into JVM bytecode.

The major contribution of this paper includes that it presents the motivation of Spark SQL and its advantage, as well as its design and implementations. It is a combination of relational and procedural models. According to this paper, Spark SQL is one of the most acitvely developed components in Apache Spark, the most active open source project for big data processing.


This paper proposes Spark SQL, which is integrates relational processing with Spark. The authors state that most data pipelines are better expressed with a combination of both relational queries and procedural algorithms. Spark SQL achieves this by allowing SQL users to use the "procedural" analytics libraries in Spark, and allowing Spark programmers to use the "relational" (declarative) query processing features.

The relational/procedural integration is offered by DataFrame API and Catalyst, which are the two main contributions of Spark SQL:

1. DataFrame API:
In addition to SQL interfaces, Spark SQL integrates DataFrame API into Spark's programming languages.DataFrame is a distributed collection of rows with the same schema, which is just a (sub-)table in a relational database. Many Spark components or libraries take and produce DataFrames. Users can perform relational operations and manipulate DataFrames. The operations are integrated into programming languages such as Scala, Java, or Python, which is very convenient for the users. In addition, DataFrame operations in Spark SQL go through Catalyst.

2. Catalyst:
Catalyst is an extensible relational optimizer, which supports both rule-based and cost-based optimization. It is based on Scala and utilizes the features of functional programming language, which makes optimization rules easy to specify. The basic idea of Catalyst is to construct trees and apply rules on them. A Catalyst tree consists of node objects. The node types are defined in Scala and the nodes are manipulated using functional operations. Users can apply rules (e.g., pattern matching in Scala) on subtrees.

This paper also gives experiment results of Spark SQL by evaluating its performance on both SQL query processing and Spark programs. For SQL queries, Spark SQL outperforms Shark and show comparable performance to Impala. For Spark Programs, DataFrame API performs much better than Python and Scala APIs since most of the work is compiled into JVM bytecode.

The major contribution of this paper includes that it presents the motivation of Spark SQL and its advantage, as well as its design and implementations. It is a combination of relational and procedural models. According to this paper, Spark SQL is one of the most actively developed components in Apache Spark, the most active open source project for big data processing.

If there is any drawback in this paper, I would say that the experiment can be more thorough with more details.



Review 13

Motivation for Spark SQL

Spark SQL is a module in Apache Spark that adds relational processing to Spark’s functional programming. Spark programmers can thus benefit from relational processing like declarative queries and optimized storage, and SQL users use analytics in Spark like machine learning. Existing tools that handle big data, such as MapReduce, have low-level, manual optimization; Declarative queries in SQL are intuitive to write, but do not provide easy ways of expressing advanced analytics like machine learning and graph processing. The motivation for Spark SQL is the thus need for processing techniques, data sources, and storage formats in big data applications.

Details of Spark SQL

Spark is a computing engine built on clusters with APIs for streaming, graph processing, and machine learning. Users manipulate distributed collections called Resilient Distributed Datasets, which are collection of objects split across a cluster and can be used (map, filter, reduce) to send functions of the programming language to nodes on the cluster. Spark is a fault-tolerant, system can recover lost data through linage graphs of RDDs. The RDDs in the API are evaluated lazily, each representing a “logical plan” to compute each dataset, but waits for output operations to start a computation. The goals for Spark SQL are to support relational processing for Spark programs on RDD and on external data sources using a easily usable API, provide high performance using known DBMS techniques, support new data sources easily, use semi-structured data and external DB amenable to query federation, and enable extension with advanced analytics algorithms such as graph processing and machine learning. Spark SQL runs on top of Spark and accessed through JDBC/ODBC, command-line console, or the DataFrame API that allows users to mix procedural and relational code. DataFrame is a distributed collection of rows with the same schema, is equivalent to a table in relational database, and keeps track of scheme and support various relational operations that lead to more optimized execution. The DataGrames, constructed from tables in a system catalog or existing RDDs, can be manipulated with various relational operators and are viewed as an RDD of row objects. Each DataFrame object are lazy, and represents a logical plan to compute a dataset, with no execution until a special output operation from user. The Data Model of Spark SQL uses nested data model based on Hive for tables and DataFrames. Spark SQL supports major SQL data types and non-atomic data types (structs, arrays, maps, unions), and user defined types. The DataFrame Operations are relational operations using domain-specific language. Common relational operators takes expression objects in a limited DSL and lets Spark capture structure of the expression. Spark SQL is built on abstract syntax tree and passed to Catalyst for optimization, unlike Spark API that takes functions containing arbitrary Scala/Java/Python code and then opaque to runtime engine.

Strengths of the paper

I appreciated the many examples and easy to follow explanation of the implementation of Spark SQL. It was interesting to learn about a system that can combine relational databases and functional programming.

Limitations of the paper

I would’ve liked to have seen some more experiments comparing number of instructions executed rather than just run time. It would have also been appreciated to see a discussion of operations that SparkSQL cannot handle, and what plans are being made to handle those in the future.



Review 14

Spark SQL is a module for Spark that is introduced in this paper and combines relational processing from SQL and procedural processing from Spark. Spark SQL implements an API called DataFrame that allows automatic optimization and the implementation of pipelines for complex analytics. Spark SQL supports semi structured data, and data types for machine learning by implementing the Catalyst optimizer which is easy to use through Scala. The paper compares four types of queries to Shark and Impala and then describes two research applications that could benefit from this system.

This paper implements two significant contributions which are the DataFrame API and then Catalyst optimizer. The DataFrame API makes it easy to support and use relational processing with RDDs as well as external data sources and use established DBMS techniques. Catalyst makes it easy to implement optimizations for various problems with big data. The authors give the examples of generalized online aggregation and computational genomics. In computational genomics people look for overlapping regions in a set of data which is a general problem that could have wider applications in big data. They have a decent evaluation comparing to Shark and Impala. They say that Spark SQL is competitive with Impala which is written in C++. The DataFrame API is shown to be faster than the Python and Scala APIs. These all constitute strengths of the Spark SQL paper.

There aren’t many drawbacks of this paper. It was written by members of the Databricks company who have been modifying Spark for the past five years based on their interactions with client companies. The most notable drawback of this paper is seen in the evaluation. The benchmark queries are not very extensive. We see that Impala is still better than Spark SQL in some situations. If the authors had elaborated on the differences between these systems it could have strengthened the sixth section of this paper.





Review 15

Part 1: Overview

This paper presents a new module in Apache Spark framework that can integrate relational queries with Spark’s functional and procedural APIs. The module is called Spark SQL. Spark SQL is designed for relational database users to use complex analytics libraries in Spark. The DataFrame API which integrated with Spark procedural functions is actually declarative. Built on previous work called Shark, Spark SQL better integrates relational queries and procedural function calls and therefore can provide users consistent APIs. The goal for Spark SQL is to support relational processing both with Spark programs and on external data sources and provide programmers with user-friendly APIs, to provide high performance using existing database techniques, to enrich data source types, and to enable extension with advanced analytics algorithms including graph processing and machine learning.

Dataframe API takes in input queries from applications including JDBC, Console as well as user defined functions in Java, Scala, or Python. Users can perform relational operations on Dataframe APIs using DSL. Relational operations including projection, filter, join and aggregations are supported. In memory caching is used when materialization is needed. It can save some memory compared to storage with JVM objects. User defined functions is more user-friendly compared to relational databases, programmers can use Java and Scala. Rules are used by Catalyst module in Spark SQL, to manipulate storage trees and can help speed up pattern matching.

Part 2: Contributions
Catalyst, the highly extensible optimizer, is embedded in Spark SQL module and thus Spark SQL can provide users with easy Scala programming language and JSON APIs. Although MapReduce algorithm and big data thrives in the industry, people badly need a way to easily implement machine learning algorithms on top of key-value data stores.

To better support machine learning programmers, Spark SQL imitates some APIs from R language. Catalyst is used that different data source types can be supported. DataFrame API is carefully designed for analytical users and applications. DataFrames can be easily transported between applications written in Java, Scala, or Python.

Spark SQL outperforms the native Spark code in computations expressible in SQL by 10x faster. It also beat existing SQL-only systems on Hadoop.

Part 3: Drawbacks
Spark is still kind of young and is working out its way of not only SQL databases. There are still many issues to be solved and many standards need to be settled down. Now people are splitted into Hive group or Spark SQL Group and they need to be interlinked together in the near future.



Review 16

The paper describes Spark SQL which integrates relational data processing with the Spark functional programming model. The problems addressed by Spark SQL are twofolds. On one hand relational database system, though they are preferred for writing declarative queries, are found to be insufficient for many big data applications. Users want to perform processing to and from various data sources that might be semi- or unstructured data, which requires custom code. On the second hand users want advanced analytics such as machine learning and graph analysing that are challenging to express in relational systems. This requires complex procedural algorithms. The main solution provided by Spark SQL is combing relation models and procedural algorithms together, giving user the benefits of both. Relational systems allows room for query optimization whereas procedural algorithms provide the complex programming required by advanced data analytics.

Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark’s distributed collections. DataFrames are collections of structured records that can be manipulated using Spark’s procedural API, or using relational APIs that allow optimizations. DataFRames make ease to compute multiple aggregates in one pass using a SQL statement which is difficult to express using traditional functional programming. In addition DataFrame operations are optimized by a relational query optimizer called Catalyst. Catalyst has an extensible design feature. It can be extended by users by adding data source specific rules that can push filtering or aggregation into external storage systems, or support for new data types. Catalyst supports both rule-based and cost-based optimization.

Furthermore, Spark SQL provides different advanced features. It includes a schema inference algorithm for JSON, allowing user to query data right away rather than using different libraries to map JSON to Java objects. In addition Spark SQL is integrated with Spark’s machine learning library, allowing efficient large-scale processing. It also supports query federation which allows a single program to efficiently query disparate sources

One of the main strength of the paper is that Spark SQL allows to combine techniques from relational systems and procedural programming. It tooks the declarative query writing language feature from relational systems and the ability to write complex algorithms from procedural programming model. Most other similar systems avoid the relational systems all together and tries to build a from the ground-up, leaving all the investment on RDBMs for long time. In addition, Spark SQL provides support for important class of problems including machine learning and graph analysis.

One of the main limitation of the paper is that it mixing relational features with complex functional programming algorithm might make the overall system more complex than SQL. It would have been great if the authors were able to discuss the potential problem in mixing these different model together. In addition, Spark SQL’s Catalyst main features involves in its extensibility by users. It could be better to identify the benefit of Catalyst without this feature as involvement from user in the query optimizer requires more expertise and minimize adaptability by large customers.


Review 17

===overview===
This paper introduces a new module in Apache Spark, Spark SQL, to integrates relational processing with Spark's functional programming API. The two main additions are:
1. much tighter integration between relational and procedural processing.
2. include a highly extensible optimizer, Catalyst, to add composable rules, control code generation, and define extension points.

The interface is trying to solve the dilemma where people have to choose between two classes of systems - relational and procedural. This paper explains the effort put to combine both models through two contributions:
1. a DataFrame API that can perform relational operations on both external data sources and Spark's build-in distributed collections.
2. Catalyst, a novel extensible optimizer to support the wide range of data sources and algorithms in big data.

Providing this new interface to Spark, the following goals are reached:
1. Support relational processing within Spark programs and on external data sources
2. provide programmer-friendly API
3. flexible in terms of supporting new data sources
4. high performance in using established DBMS techniques.
5. extensible with advanced analytics algorithms

===strength===
The diagram describing the interfaces to Spark SQL, and interaction with Spark is helpful and clear. It demonstrates the key concepts of the project. Sample codes between lines are short but concise, which help understanding a lot.

===weakness===
Some terminologies and abbreviations were not explained at the beginning, which are sometimes distracting and confusing. It makes the project seems complex, but the point of a paper is not to show off, but to successfully communicate idea.


Review 18

This paper describes Spark SQL, a declarative language feature that extends Spark. This paper contains two main parts: Spark SQL itself and Catalyst, query optimizer for Spark SQL.
Spark SQL:
The central concept in Spark SQL is DataFrame. It can be viewed as a table in DBMS. It contains data with certain scheme. Spark SQL operations such as ‘select’, ‘join and ‘where’ can be applied on DataFrames, just as SQL operations applied to a table in DBMS. But there are a few differences: 1) since Spark SQL can be used among other code, Spark SQL operations are evaluated lazily; 2) its logical plan analyze is done eagerly, which means programmer can get error when writing the code. Spark SQL also supports other features like In-Memory Caching and UDF.
Catalyst:
Catalyst is an extensible optimizer designed for Spark SQL. The most interesting thing about this optimizer is that it allows programmer to extend the optimizer using a standard, simple to understand method. The operations to be done by the Query plan is arranged in a tree structure just like other optimizer. A programmer is capable of writing his own code to transform the Catalyst tree. After optimization, this query plan will carried out by operating on corresponding RDDs.
There are also other important things about Spark SQL. With goal of supporting Big data and machine learning in mind, Spark SQL has Schema Inference for Semi structured Data like JSON and Integration with Machine Learning library, making it easier to use.

Contribution:
The main contribution of this paper is about the implementation of SQL on top of a procedural data manipulation system. The idea of user optimizing the optimizer is also inspiring.
What I like most about this paper is that DataFrame can be manipulated like normal RDD. This means we can mix the declarative operation with spark’s procedural operation.
Weakness:
Since it is writing a SQL, I hope there could be results comparing performance between DBMS and Spark SQL.



Review 19

Spark SQL is the functional model that lets the user be able to use complex machine learning algorithms, database procedures using the functional style of the original Spark.

Spark introduces Catalyst, the optimizer that helps users perform typical database operations in a straightforward manner. One of their key contributions seems to be the dataframe that is schema-based data collection unlike the RDD which would help boost the performance of the system. Since the dataframes are compiled into abstract syntax trees that are faster to execute than code written in one of the languages. Spark also stores data frames in a columnar format which as we learnt from C-store can be very useful if this is specific to machine learning algorithms in Spark. A lot of the features seem to have been included to make Spark SQL more user-friendly that I was really impressed with. For eg, Spark SQL reporting an error the moment the user types invalid code or refers to an invalid table thereby increasing productivity. They provide support for database queries in the java/scala/python program thereby enabling DBMS functionality support in programming languages.

One of the other things that was great was that Spark SQL enables the user to work with JSON data in an easier manner because their schema inference algorithm is able to infer the type of data. However, if there are ambiguous data types that the system is not able to recover, Spark SQL might end up storing a lot of the data as strings thereby not resulting in the kind of efficiency you would get otherwise. One of the disadvantages that I saw from their results is that, but for the results produced by conversion to data frames, the results of using python or java code are not as impressive; so in case of a situation where you are trying to import code from a different domain on to this and you can't leverage the dataframe optimization, you may not reap complete results since it will not use Catalyst.


Review 20

Spark SQL: Relational Data Processing in Spark

In this paper, the author introduced a new module in Apache Spark which provides relational support on top of Spark’s low level API. Although hadoop and google map-reduce provides a powerful option for large scale data processing, there is a fundamental shortcoming: the low-level programming for declarative query is complicated for programmer. And user may want to query on both relational and complex procedural algorithms.

Hence Spark SQL is created to satisfy this need. Spark SQL is designed to provide a lazily evaluated DataFrame API,which abstracts a data source (RDD, Hive table, CSV file, etc) and provides relational-style operations which can be optimized. The major components of this solution is a highly extensible optimization engine called Catalyst. And a DataFrame is equivalent to a table in a relational database, and can also be manipulated in similar ways as RDDs. At the same time, DataFrame can keep track of their schema and support various optimized relational operations.

The major contribution of this paper is that:
1. The influential idea put forward: mixing declarative/relational programming with procedural programming. Compared the previous existing idea is to add UDFs into relational dbms, the new idea seems to value both functionality in large scale data processing.

2. Performance gain: it is 100 times faster than hadoop. Since Spark SQL has in-memory support, the latency caused disk read it largely avoided.

But there is still some weakness in this paper:
1. The memory consumption is obvious, and there is only a few things memtioned about grabage collection in this paper around this issue.

2.Just like Map-reduce, it is still in a single node query logic. Different cpus running on differernt machines can’t share data with others. Hence it is still difficult to run sophisticated algorithms on top of this framework, compared to MPI.




Review 21

Big data applications require a mix of processing techniques, data sources and storage formats. On one hand, people needs the flexibility provided by procedural programming interface so that they can run logic like ETL and machine learning, on the other hand, once getting the structured data, people often wants to use declarative ways to do the jobs.

This paper talks about Spark SQL which is a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. The system allows user to do both relational query API and procedural API.

Spark offers much tighter integration between relational and procedural processing through a declarative DataFrame API to allow relational processing. Thus allows user to write declarative queries and letting user to write complex pipelines that mix relational and complex analysis. DataFrame is a distribute collection of rows with the same schema. A DataFrame is equivalent to a table in a relational database, it keeps track of their schema and support various relational operations that lead to more optimized execution. DataFrames can be constructed from tables in a system catalog.

It also includes a highly extensible optimizer-Catalyst that makes it easy to add composable rules, control code generation and define extension points. Thus it is easier to add optimization rules, data sources and data types to Spark SQL.

Strength:
This paper talks about Spark SQL and two main component - DataFrame and Catalyst in detail. And it uses experiment result to show that Spark SQL provide substantial speedup over previous SQL-on-Spark engines. An concept I like most is to mix declarative programming with procedural programming, I think it can also be useful to many other area.

Weakness:
The Spark SQL simplify the work of DB programmer by making work more declarative and saves time for the programmer. But one thing should notice that maybe performance is more important than programmer’s time for parallel computing model. And always optimizer cannot optimize the program better than programmer because the programmer know more about each specific job.


Review 22

This paper presents a high-level interface called SparkSQL for querying data on Spark. Spark provides a rich API for manipulating large datasets. However, to make even simple queries requires writing programs and understanding some of the details of Spark. SparkSQL addresses this by providing a SQL-like interface on top of Spark, allowing developers to freely mix Spark and SparkSQL queries.

SparkSQL provides a library that allows users to make SQL queries on top of RDDs using
the DataFrame API and Catalyst. The DataFrame API allows users to write queries in a SQL-like manner while splitting query building across multiple languages. Catalyst is an extensible query optimizer that uses code-generation to achieve high performance and supports user-defined types.

The authors provides numerous examples demonstrating the usefulness of SparkSQL and show that SparkSQL performs favorably against existing systems. However SparkSQL has a number of limitations:
* As Spark is built for data analytics, SparkSQL is limited in its applications as it is unable to update data
* SparkSQL does persist data across program instantiations. This means that SparkSQL is unable to create indexes or perform other optimizations to speed up queries
* For usecases where users want to export data into Spark for analysis separately from data generation, SparkSQL will be unable to process changes made after the data was last updated. In other words, when operating on data taken from other databases, SparkSQL will return stale results


Review 23

This paper presents a new module in Apache Spark called Spark SQL, which allows users to use a relational API to query into Spark. A few papers ago, we read that Spark can be used with Scala, a functional programming language, but the popularity of relational databases imply that users like to write declarative queries instead. Therefore, Spark SQL was created to allow users to mix both relational and procedural programming, bridging the gap between the two types of APIs.

At a high level, Spark SQL is a computing engine with interfaces to Java, Scala, Python, JDBC and the console. The user programs use the DataFrame API to interact with Spark SQL and then in turn, Spark SQL runs the queries through the Catalyst Optimizer to send to Spark. By doing this, Spark SQL tries to accomplish the following:

1. Support relational APIs for both programs and external data sources.
2. Still maintain good performance with existing DBMS techniques.
3. Easily support new data schemas like semi-structured data.
4. Create extensions that work with machine learning and graph processing.

After getting the queries from either the program or an external source, Spark SQL runs the SQL query or DataFrame through the Catalyst Optimizer. The optimizer goes through the analysis phase, the logical optimization phase and the physical planning phase. Finally, after looking at the cost model, the optimizer will generate code to create the RDDs for Spark.

Overall, the paper does a great job of introducing Spark SQL and its advantages over regular relational APIs and Spark. However, I still feel like there are weaknesses with the paper:

1. For some queries, wouldn’t the programmer want to write it out in Scala and Spark to gain on the performance? If the query was big enough, the time difference between the hand-written and generated queries would be significant enough that the programmer would not want to use Spark SQL.



Review 24

This paper discusses Spark SQL, a module for Apache Spark that integrates relational processing with a functional programming APO. Spark SQL was designed to provide users with a programming interface that made it much easier to write and optimize code for complex applications such as machine learning tasks. It is widely used for large systems that contain thousands of nodes and up to 100 petabytes of data. The primary features of Spark SQL are a declarative DataFrame API that ties the simple declarative syntax to procedural code, and Catalyst, an optimizer that makes the tasks of writing composable rules and generating code at runtime much simpler.

DataFrames are the Spark SQL equivalent of tables. They are distributed collections of rows that all share the same schema. DataFrames allow for highly efficient execution of code, as they are lazy. That is, they represent a logical plan for computing a dataset. The actual execution of this plan does not occur a special output operation is called. Storing only the computation plan rather than immediately materializing a dataset provides much more opportunity for optimization and allowed the authors to achieve a 2x speedup over Shark, their previous system, for most queries. DataFrames allow for a great deal of extensibility of code and ease in programming, as users can create DataFrames directly against RDDs and create operators that can scan native objects in a seemingly relational manner.

The Catalyst optimizer maintains data as trees composed of nodes. Users can write “rules” which are functions that operate on and manipulate these tree structures. Catalyst uses these rules in combination with Scala’s pattern matching functionality to allow users to have more fine-grained control over query optimization. When compared against Shark and Impala, Spark SQL outperformed Shark in every case and was competitive with Impala. The authors also ran experiments showing that the integration of relational and procedural programming with DataFrames significantly outperformed code run using SQL and Spark for disparate parts of the workload.

My chief complaint about this paper is that it does not offer any data about Spark SQL’s performance on very large datasets. At the beginning of the paper, the authors state that Spark SQL is used for databases that run on thousands of nodes and process petabytes of information, yet the dataset in the experiments they run is only a few gigabytes in size. As Impala tended to outperform Spark SQL slightly on the least selective query in each set, I would expect this performance gap might grow as the dataset grows to the order of terabytes or petabytes. Therefore, I would have liked to see the authors demonstrate the value of their system on the enormous datasets they claim it is being used for.




Review 25

This paper present Spark SQL, a new module in Apache Spark providing rich integration with relational processing. Spark SQL lets Spark programmers use the benefits of relational processing, and lets SQL users call complex analytics libraries in Spark. To offer the benefit of letting users write pipelines that mix relational and complex analytics, Spark SQL extends Spark with a declarative DataFrame API to allow relational processing. Thus, this paper introduces this novel approach for data analysis.

First, the paper talks about the programming interface of Spark SQL. Spark SQL runs as a library on top of Spark. It exposes SQL interfaces, which can be accessed through JDBC/ODB. It also provides DataFrame API, which let users intermix procedural and relational code. A DataFrame is equivalent to a table in a relational database. Unlike RDDs in Spark, DataFrame keeps track of their schema to support various relational operations that lead to more optimized execution. For data model, Spark SQL uses a nested data model based on Hive for tables and DataFrames. In addition, Spark SQL also can let users write user-defined functions to optimize their operations.

Second, the paper talks about Catalyst optimizer, which is a new extensible optimizer based on functional programming constructs in Scala. There are two reasons for designing Catalyst. First, the authors want to make it easy to add new optimization techniques and features to Spark SQL. Second, the authors want to enable external developers to extend the optimizer.

The strength of the paper is that it provides a complete description of Spark SQL, including the background, review of Spark, programming interface, optimizer, and advanced features. This complete flow makes users clear about the motivation and the implementation of Spark SQL.

The weakness of the paper is that it provides few examples when introducing the features of Spark SQL. I think it would be easier for readers to understand Spark SQL by providing some real world examples.



Review 26

This paper is an introduction to SparkSQL, which is a system built on top of Spark that supports relational operations for big data analytics. It is composed of two novel pieces of work, the DataFrameAPI and the Catalyst Optimizer. These two pieces make it so that you can write a program for SparkSQL and it can utilize the RDDs that we have already discussed that Spark created.

The need for spark was that previous relational interfaces on top of Spark, such as Shark, could only be used to query external data and not data inside of spark, and secondly to call Shark from Spark users had to build SQL strings which was inconvenient and error prone.

The DataFrame API was crucial because it is much easier to work with DataFrames because of their integration into full programming languages. Because of this, users are able to pass them around in functions and have much easier use of them. Also they are now able to construct DataFrames directly against RDDs of Spark and no longer have to only use external data.

The Catalyst Optimizer is crucial for performance of SparkSQL and allowing users to add optimizations. One of the main goals was to make it easy to add new optimizations and tackle problems unique to “big data”. They also wanted external developers to be able to extend the optimizer and support new data types. They have allowed for users to created their own data types which makes it much nicer to work with.

The last major advantage of SparkSQL in my opinion is its integration with Sparks machine learning. This is something that we are using in our project and I’m sure many people have used as machine learning is a very hot topic right now. This is one of the most crucial parts of SparkSQL in my opinion.

In terms of performance I really liked the evaluations shown in this paper as they are trustworthy because SparkSQL doesn’t blow shark and Impala out of the water in all graphs. It shows that SparkSQL is clearly better at some things and slightly worse but still comparable in others. This made the metrics very believable and I really appreciated that.

Overall, I think this was a solid paper as it did a good job going relatively in depth on the implementation of SparkSQL, but provided sufficient examples and imagery to support their descriptions and make it much easier to comprehend. Also there is a clear use for SparkSQL and I think it is an important thing to know in the database community, especially in this era of machine learning; definitely worth the read.



Review 27

The old way to do analytics was performing actions traditionally through relational DB systems. Highly optimized query engines allow this to be a powerful design decision, but only at a small scale. In the age of “big data,” scalability is now just as important as availability or consistency, with the ability to have a programmatic language for DB’s a major plus. Spark SQL is the answer to merging both worlds; instead of storing data in an RDBMS, data is stored on the Hadoop file systems with many relational operations mimicked through Spark.

The first step to doing this is adding a schema to RDD’s; declarative transformations on partitioned collections of tuples (in Scala, one example of doing this is by mapping RDD data to “case classes”). The most interesting part of the system is the ability to interface data and code within the Hive ecosystem, as well as the older Spark architecture (RDDs + SQL query-like functionality). But even expression interpretation can be difficult to memory-optimize in the Java Virtual Machine. They capitalize on Scala reflection (i.e. self-modifying programs) and code generation to be able to efficiently integrate expression interpretation into the code base.

However, in my experience working with both Hive and SparkSQL, I would say that Spark SQL is in one of the earlier stages of its development, and that Hive has more fully-fledged SQL support. SparkSQL would be extremely useful once it is more stable, as it would then be more likely to be adopted in a massively scalable context (i.e. the last time I worked with it, it wasn’t necessarily production-ready). The concept itself, however, is awesome and it feels really cool to use it.


Review 28

The paper talks about Spark SQL: a new module in Apache Spark that integrates relational processing with Spark’s functional programming API. The motivation is because users are usually forced to choose between the more familiar declarative SQL queries but less sophisticated data processing and the more advanced data processing techniques, data source, and data format but with low-level programming interface. So why not integrate them?

The paper starts with background goal of SQL, which includes explanation about Spark, the predecessor of Spark SQL that offers a functional programming API for data manipulation in Resilient Distributed Datasets (RDD), which – although useful – is also limited. It talks about previous try of relational system on Spark (Shark) and then explains Spark SQL’s goal. Next, the paper discusses Spark SQL’s programming interface. The main component is DataFrame, a distributed collection of rows with the same schema. Once constructed, the DataFrame API can be manipulated with relational operators. The paper further explains the basic workings of DataFrame, including its operations, comparison with relational languages, how to query native datasets, in-memory caching, and user-defined functions. The other main component is Catalyst, an extensible optimizer that makes it easier to add data sources, optimization rules, and data types for domains such machine learning. The paper explains its tree structure, its rules, and its use Catalyst in Spark SQL (the phases are analysis – logical optimization – physical planning – code generation). Extension points of Spark SQL are ability to process various data sources (including CSV, Avro, Parquet, JDBC) and mapping User-Defined Types (UDTs) to structure composed in Catalyst built-in types. There are three additional features that are added to Spark SQL for advanced analytics: schema inference for Semistructured data, integration with Spark’s Machine Learning Library, and Query Federation to external database. Next, the paper does performance evaluation by comparing Spark SQL with Shark and Impala. A small section is dedicated to discuss the use of Spark SQL for research application (generalized online aggregation and computational genomics). Last, the paper discusses related works that inspires Spark SQL, how they differ and how Spark SQL improves the initial works.

The main contribution of this paper is it presents a database for big data application that combines the complex procedural algorithms with relational data queries, which until it was written the two classes of systems were still largely disjointed. Spark SQL intends to take advantage of both Spark’s powerful data processing techniques and the “familiarity” of relational SQL; making it simpler and more efficient to write data pipelines that mix relational and procedural processing, while providing substantial speedups.

However, I think it would be better if it also explain how data is obtained. While I realized that this is an OLAP-oriented system, it is highly possible that the data comes from a distributed system. From the description, it seems like Spark SQL would use up a lot of memory. How would the data processing process compete with network and other data grabbing process? Another point, I think it is good to give memory utilization comparison between Spark, Shark, and Impala to see how each system uses up CPU resource.



Review 29

The purpose of this paper is to introduce SparkSQL which is a new module part of the Spark system that allows the user to not only use the functional language that is built into Spark, but also allows the user to use relational language from SQL statements. SparkSQL is useful because it also allows users the ability to work in the original spark language and a SQL-based language while intermixing the two (without having to do a context switch or anything).

This paper presents two main technical contributions in their effort to intermix the functional spark language with the relational language of SQL. The first main contribution of this paper is its DataFrame API, which allows the user to perform the relational operations similar to those found in regular SQL, but on Spark’s built-in functionalities and collections as well as on external data sources whose results can then be fed into a Spark instance. These data frames consist of structured data records that can be manipulated either by Spark’s usual functions or using relational operations. The second main contribution of this paper is their query optimizer called Catalyst. This optimizer is extensible, meaning it is easy for the user to add additional data sources and types as well as corresponding optimization rules so that their system can be optimized for specific domains such as machine learning, which Spark is commonly used for. With these additions of DataFrame and Catalyst, SparkSQL opens up the doors to further automated optimization of Spark queries since it adds a lot of functionality on top of the original Spark API that includes an intelligent query optimizer in Catalyst.

I think a main strength of this paper comes in its methodical discussion of motivation for the work and then how their contributions address a problem and provide improvements to a user’s interaction with the Spark System. They present thorough empirical results that demonstrate the improvement that both adding relational queries and the user of their optimizer contributes to the Spark system. I also appreciated their inclusion of the “research applications” section at the end of the paper. This further shows how these tools might be implemented in research and not just practical settings.

As far as weaknesses go, I have a hard time coming up with one. I found that the confusing parts of the paper were made much more clear by the inclusion of helpful diagrams or bits of pseudocode. I am impressed by the content and empirical results in this paper.


Review 30

Review: Spark SQL: Relational Data Processing in Spark

Paper Summary:
This paper proposes Spark SQL, a new module that inegrates relational processing with Spark’s functional programming API. The advantages of Spark SQL is that it offers tighter integration between relational and procedural processing and it includes a highly extensible optimizer, Catalyst, that makes it easy to add composable rules, control code generation, and define extension points.
The motivation of this work comes from the needs of a more flexible system that has the popular properties of relational systems and also be capable to handle big data applications. This requires a joint system of both the relational and procedural system, which is the reason of proposing Spark SQL.

Paper Review:
As a paper proposing a new system, the paper provides very clear information of the system’s properties, implementations and APIs. The combination of relational system and procedural system is done by introducing the Catalyst optimizer which uses tree structures and apply rules to manipulate those structures as its framework. As an advantages, the paper provides extensive experimental results to demonstrate the performance of the proposed system. And it is particularly helpful showing the comparison graphs in Figure 8 that demonstrates the gains of the proposed system. As is shown in those figures, Spark SQL, in many cases, outperforms baseline models. However in some cases it doesn’t. It would be interesting to know in what general case will Spark SQL significantly earn an edge compared to other models.



Review 31

This paper introduces SparkSQL, a new Relational Data Processing Model on top of the cloud computing framework Spark. SparkSQL, which is a new component replace its predecessor Shark, has two main advantages: firstly, it provide the DataFrame API, which offers much better integration between relational and procedural processing. Secondly, it embedded with a highly extensible optimizer Catalyst, which support both rule based optimization and also cost based estimation and is very flexible to add new rules.
The DataFrame API helps its user to easily use set operation of relational processing with the spark built-in RDD. It allows user to program with data using python, java or Scala through the DataFlow api and utilized the optimization provided by SparkSQL. Moreover, unlike Hive or Shark, user can register temporary table right from the text file or json files and query among them seamlessly with table in HiveContext.
The Catalyst optimizer has many advantages by utilizing the features of Scala programing Language, for example, it uses the pattern matching to find the node to apply the optimization pass, and also it utilizes the Scala’s functional programming nature to easily build AST and generate runtime code. A single SQL plan start from the unresolved logical plan, and through the help with Catalog in Analysis phase provide the logical plan, and after applied Logical Optimization, and physical planning, Catalyst would select the best physical plan based on the cost model and submit it to the Spark system.
Moreover, this paper also conducts an impressive experiment which shows that the performance of SparkSQL has better performance for all the task comparing to Shark and perform better in most of the case comparing to Impala. This paper also introduces several researches has been done utilizing SparkSQL and its Catalyst optimization module.

Strengths:
1. This paper Introduces SparkSQL, a new relational Data Processing Model on top of the Cloud Computing framework Spark. It outperforms its precedence Shark on every aspect.
2. SparkSQL has an extensible optimization module Catalyst, which can be used to easily add new rules and new data format.
3. SparkSQL has a expressive API DataFrame, which can be used seamlessly with Spark RDD and also HiveContext.
Weakness:
1. SparkSQL only supports limited optimization rules and also limited cost estimation optimization. However, since SparkSQL is still under actively development, I believe SparkSQL would become more mature as time goes.
2. In this paper, it has done a complete comparison with Impala and Shark, but it did not do any comparison with Hive and parallel database. It would be more complete if they can provide such comparison.



Review 32

This paper describes SparkSQL, a relational data processing framework that leverages Spark’s functional programming API. SparkSQL provides a hybrid of relational and procedural process via DataFrame API and procedural Spark code. It also implements an extensible optimizer called Catalyst. By leveraging the functional programming language, Scala, it is easy to implement rule based optimization in Catalyst. This paper proposed an abstraction of Spark’s SQL API, named DataFrame. A DataFrame is a distributed collection of rows with the same schema, equivalent to a table in relational database. It also provides a similar interface to Spark RDD for manipulation. DataFrames support all common relational operators. Users can construct DataFrames directly against RDDs of objects native to the programming language. SparkSQL can automatically infer the schema of these objects using reflection. Spark SQL can cache hot data in memory using columnar storage. SparkSQL applies columnar compression schema to reduce memory footprint. It speeds up the interactive queries and the iterative algorithms by caching data in memory. The Catalyst optimizer stores query plans as trees internally. After taking a SQL query, it first parses it into a syntax tree, then analyze the unresolved logical plan by applying metadata. It abort invalid queries early at this time. Then for the valid plans, Catalyst applies rule based optimization. From the optimized logical plan, several physical plans are generated and one physical plan is selected based on a cost model. Finally the selected physical is transformed to RDDs and executed by Spark. For better applicability, SparkSQL also Integrates with other tools such as Spark’s machine learning library.

The main advantage of SparkSQL is providing a declarative DataFrame API that allows relational processing. Another benefit is from the functional programming language, as the pattern matching feature provided by Scala makes implementing rule based optimization much easier.

One weakness of SparkSQL is that the Catalyst query optimizer is quite simple and naive. Currently Catalyst only support simple rules like filter push down. A more sophisticated query optimizer will be desirable, as it’s essential for performance gain. Another drawback is that SparkSQL’s cache manager requires user to provide cache hint. We hope future SparkSQL can provide an automatically materializes tables.