Big Science needs to store large amounts of data and perform complex analytics operations on it, such as regridding arrays, clustering, and regression. Users like particle physicists and genomics researchers are sometimes not well served by traditional RDBMSs, which lack native support for complex analytics queries and array data models. The developers of SciDB seek to deliver an array-oriented database that meets the needs of Big Science customers.|
SciDB is based on a multidimensional array data model instead of the table (relational) data model of RDBMSs. Arrays can have many dimensions, with integer values for each dimension specifying a “cell,” which can contain a tuple of values or even a nested array. For example, a satellite imaging project could build a 3d-array, with dimensions for latitude, longitude, and time, and tuple values for red, green, and blue wavelength image values.
SciDB presents many novel features that make it an efficient solution for the needs of Big Science. The proposed database will have native operations for making pseudo-coordinates for array cells, which will make projections and other operations easier to handle. Many other array operators like filtering and reshaping are natively supported also. The database will have tools to track the provenance or lineage of each tuple, so that if wrong data is found, it will be easy to find out where the data point came from, and what other data points have been derived from the incorrect values. In a similar vein, SciDB will be “no overwrite,” meaning that if an array is declared as “updatable,” then updates to cell values will be written to a new index along the “history” dimension, instead of clearing the previous values. This will allow scientists to update values in their data sets without losing their old data.
One limitation of this work is that the plan for SciDB was not carried out fully in later years. The authors specified that SciDB would be developed by a non-profit organization and maintained as open source. While this did happen for a while, recently the company Paradigm4 was founded to continue developing SciDB, and certain features are available only in a paid version. Additionally, some features like data provenance have not been fully implemented so far.
The paper talks about SciDB, a new scientific database system which has a common set of requirements for regular scientific activities. The various science communities like the Astronomy department, Biological department and remote sensing require much more out of the current Relational Database systems. The use of Postgres for science databases caused inefficient results. The features which were required most of the time were absent. The SQL operations required for such databases had wrong functionality. So to mitigate this problem, the authors proposed a new database system which had common requirements for multiple scientific domains.|
The SciDB Data model shuns Hadoop as a starting point since it is slow and doesn’t have a hierarchical architecture. It has nested multidimensional arrays with array values being tuple of values and arrays. It has enhanced arrays with shape function which supports irregular boundary. It contains of co-ordinate systems containing user defined functions that map integers to different things. The query language has a parse tree representation of array operations with a binding to languages such as MatLab, C++ and Python. It has user-extensible operations as well. The operations include the standard relational ones along with User defined ones. The storage is done on an extendable grid of Linux machines with built-in high availability along with disaster recovery. It also handles operations on data while loading it. The arrays are stored in the form of chunks which are partitioned across the grid.
The paper provides a comprehensive study of SciDB. The requirements and the solutions are provided as the paper progresses. The operators section deals with both standard and aggregate operators.
There is not much experimental data on the paper since the system was still in the making when this paper was published. But it doesn’t mention the amount of users requiring such a database, even in the scientific computing. No performance tests and lack of implementation makes this system less convincing.
What is the problem addressed?|
Conventional databases fail to support several requirements for scientific computation. After discussions between academic database researchers, this paper presents the results of this requirements, and sketches some of the SciDB design. Intermingled are a collection of research topics that require attention.
There are several challenges that SciDB encountered:
- Scale. Science has always been intensely “data driven”. With the ever-increasing massive data-generating capabilities of scientific instruments, sensors, and computer simulations, the average scientist is overwhelmed with data and needs data management and analysis tools that can scale to meet his or her needs, now and in the future.
- Supporting complex analytic methods. Data warehouses have focused on providing easy-to-use interfaces for submitting SQL aggregates to data warehouses. Besides aggregates, scientists are more interested in “complex analytics” is defined on arrays in linear algebra, and requires a new generation of client-side tools and server side tools in DBMSs.
- Provenance. One of the central requirements that scientists have is repeatability of data derivation. They need track down the possible cause of error. As such, it is crucial to keep prior versions of data in the face of updates, error correction, and the like. Therefore it’s necessary to have no-overwrite storage, in contrast to most commercial systems today, which overwrite the old value with a new one.
- Interactivity. SciDB will also support POSTGRES-style user-defined functions (methods, UDFs), which must be coded in C++, which is the implementation language of SciDB. As in POSTGRES, UDFs can internally run queries and call other UDFs. We will also support user-defined aggregates, again POSTGRES-style.
1-‐2 weaknesses or open questions? Describe and discuss
"We expect to have a usable system for scientists within two years.” Strictly speaking it’s not a scientific paper, but rather the summary of the discussion they had on the conference, so the paper lack of evaluation and validation. Moreover it’s a little ambiguous to say that is science data base since they didn’t specify what kind of operation they support.
This paper summarizes the requirements for science data base users and presents a new science database system called SciDB. In short, SciDB is a new data base system designed for scientific uses such as astronomy, particle physics, fusion, and biology. SciDB has many difference than traditional DBMS. For example, SciDB uses an array data model, instead of table, and more specifically a multi-dimensional, nested array model with array cells containing records. This paper first gives an overview about motivation and solution. Then it moves to the details of system design, including data model, operations, extendibility, language bindings, no overwrite, open source, grid orientation, storage, etc. Finally, it concludes with a summary of the state of the project.|
The problem here is that scientific data base users have different requirements than traditional database users. For example, for some scientific users tables are not a natural data model that matches the data, and SQL is not a satisfied interface language. Another problem is that there is not general solution, and there are some existing prototypes that are hard to maintain and update. Another problem is that science DBMSs are a “zero billion dollar” industry, so it won’t get attention from large commercial vendors. Therefore, this paper summarizes some requirements for science data bases and provide a good solution, a new system called SciDB.
The major contribution of the paper is that it provides a detailed summary about requirements for science data bases. Scientific users often complain about the inadequacy of current commercial DBMS offering, but there is no general realization to solve this problem. Here we will summarize some key elements of SciDB.
1. multi-dimensional, nested array model
2. new structural operators (create new arrays, reshape, structured-join)
3. array operations are user-extendable
4. parse-tree representation for commands
5. non-profit foundation to manage the code
6. streaming bulk loader and storage within a node
7. in situ data
8. named versions
One interesting observation: this paper is a good reference for science data bases requirements. It provides detailed explanation of different requirements with many examples. One possible weakness of this paper is that it does not provide a general structure of SciDB, but several fragments of design ideas for SciDB. It might be better to provide some big picture of the new system and a more detailed summary.
This paper is focused on the fact that traditional RDMS are built for standard business uses, and so are not built for scientific purposes. The standard table structure, as well as SQL, making traditional RDMS ineffective for scientific computing. Since the market for scientific databases is not very large, and the profit margins are essentially non-existent, commercial vendors will not create a scientific database. Because of this, the way this product is being created is quite unique. The initial team spent a year surveying several fields of science to come up with their requirements. From their, they identified some corporate partners to fund the development, since very complex business analytics share many of the requirements as "big science" systems.|
Currently, most scientific projects have to build their own custom software from the ground up. However, as the size of the data grows, and the software stack with it, this is not a sustainable solution. Their needs to be some reusable, yet extensible, database for scientist.
This paper is the fulfillment of Stonebrakers expectation - that we will need custom "big data" solutions tailored to unique subsets of problems. The scientific computing world seems to be one of those subsets, and by providing enough extensibility, SciDB should be able to cover most scientific use cases. The system seems like it could have extremely high performance - the array data type works well and scientific data is (typically) very easy to partition. SciDB is also able to act on "in situ" data, where it can query data from external sources. However, one of its main benefits is its ability to "cook" data within the database, instead of offloading to an external program to change the data.
One major let let down for this paper was that they haven't finished it at the time of publication! We don't get to see any results on performance, opinions from users, or even have proof that it works. They don't even have a proper benchmark, since this is a new type of workload. At the time of publication, they were working on the benchmark, and expected to have it done in 2009. I would have liked to have seen a finished product!
Problem and solution:|
The problem proposed in the paper is that current commercial DBMS is inadequate for “big science”. Since scientific database users and very complex business analytics share a common set of requirements for a new science data base system, the solution proposed is to define the requirements set across several science disciplines, and try to satisfy the requirements with a new database design named SciDB.
The requirements includes:
1. An appropriate data model. The relational data model of tables are not suitable for the needs of the scientists because of its rigidity. They don't appear natural to the scientific needs and the SQL language is not a good method to retrieve data from scientists' perspective. A good alternative is an array model, which may contain multiple dimensions and each combination of the dimension number is a single array cell that holds the data value. A "free" advantage of an array model is that it is a generalization of the relational tables given that the tables with primary keys could be conceptualized as a one-dimension array. Other alternatives include graphs and meshes. These cover the situations where the array model is not ideal, but are harder to build as well. As the array model works for most of the community, the authors chose to implement it instead. The array model comes with well-defined syntaxes on creating and enhancing the arrays, along with the mechanism of user-define functions of POSTGRES-style.
2. The second requirement are a set of well-defined operators. They defined structural operators that are data-agnostic and performs only structural manipulations on arrays, possibly with various dimensions. The other type of operators is the content-dependent operators, which takes a logical predicate on data values to decide the operations they are to perform.
3. For analytical and computational needs, the SciDB should be extendible by customization. Users could define their own operation methods and data types in POSTGRES style.
4. The SciDB should be able to support multiple language bindings, as there is no unique language that is supposed to be used behind the scence.
5. To guarantee that data are trackable and not discarded, the SciDB should prevent overwriting the old data by the users. This is accomplished by adding a new dimension called "history", which keeps track of the data changes while preserving the original data values.
6. To make sure the large-scale, long-time scientific project are maintainable and recompile at will, the SciDB should be open-source. To accompany it with a comparable strength as the commercial DBMS's, a non-profit foundation is founded to support the user community.
7. Partitioning of data storage should suit the users' need of large datasets. To do this efficiently, the SciDB has a default fixed partitioning mechanism, while a user-defined dynamic partitioning scheme is available.
8. The SciDB needs to break data into disk blocks. An R-tree is used for tracking and background threads are used for optimization. Further optimization is expected to take place as well.
9. The SciDB is supposed to have the ability to support ad-hoc analysis without much time spent on loading the huge amount of data. This has been a big concern for many scientists and SciDB implements adaptors to speed up the usage of these "in-situ" data.
10. The SciDB should have good integration of "cooking process". Aiming at a powerful data manipulation ability internally, the SciDB should integrate the cooking process within the DBMS without external processing when the users need that.
11. SciDB should support different scientific data manipulation on the same set of data bundles. To achieve this the authors proposed a tree like structure using named versions.
12. Data derivation should be clear from the outset. SciDB should support backward tracking of data changes when necessary.
13. Uncertainty is common in scientific fields and SciDB should capture this. It will no longer store just single data values, but also normal distributions of those.
The authors proposed a well-defined database management system with an array data model and accompanying operations and features that capture the requirements of the scientist communities. It will contribute to at least the people whose research could be reflected naturally using the array model and those who desire the good features such as data tracking, uncertainty description, and extendible customization.
1. It still cannot satisfy the need of a graph, sequence, or mesh models, such as for those in biology, genomics, or solid modelling.
2. There is not an existing implementation yet. No solid product and test result could prove that the SciDB design is effective, efficient, robust, and widely-applicable.
Traditional relational databases don’t meet all the requirements of the scientific community. Specifically, many scientists would like to manipulate their data as an array, instead of using SQL tables. In trying to replicate array functionality on traditional databases, the scientific community has been stuck with underperforming systems. The authors of this paper describe a design for SciDB, a database that will try to address these concerns.
SciDB uses multidimensional arrays as its core data model. An array can have as many dimensions as needed, and access occurs by specifying the coordinates in each dimension for the desired data, which can contain multiple attributes. SciDB provides operations to manipulate arrays by reshaping their dimensions, and filtering/aggregating/joining data. SciDB will also have the ability to be extended to provide specific operations as needed.
SciDB also tries to incorporate other requirements of the scientific community. Data history can be accomplished by adding a history dimension to arrays. Data can be queried “in situ” (without bulk loading) to avoid long load times for ad hoc queries. Data cooking (transforming and cleaning raw data) can be enabled by the user, and different cooking methods can be supported through data versioning. In addition, SciDB will also support uncertain data, by recording error bounds for values.
In addition to supporting the requirements of scientists, the authors comment that SciDB would likely be useful for many other types of data analytics, such as user engagement metrics on sites like ebay.com.
The data model of SciDB is based on a real need in the scientific community, and the authors do a good job of explaining how SciDB’s properties would suit this community
The paper is impressive in its ambition and thoughtfulness in addressing the requirements, but of course until the project is completed it’s still possible to doubt whether their vision is feasible. The only section that touches on implementation concerns is the discussion of how data partitioning might work in SciDB.
The paper discusses SciDB, a specialized database for scientific data. Traditional databases focus on processing business data, which do not suit well for scientific data. The paper proposes SciDB to provide a compatible environment for different science domains.|
The paper argues that science people do not like relational tables from traditional DBMS and most scientific users demand different data model. The paper concludes that it is an array data model that can make most of these users happy. The paper then continues to define data structure and operators for supporting the array data model.
The authors introduce the database features that are necessary for scientific users in this paper. There are many concerns regarding scientific data and the paper seems to address them quite well. However, things like challenges in implementing such features and detailed techniques or algorithms are missing in the paper because SciDB had not been implemented at the time. The paper is basically a proposal of what they are going to do. Since SciDB is now available, it would have been better if there is a newer paper with implementation details and experiments.
In short, SciDB is another example of extending traditional databases in order to capture users in different domains.
This paper outlines some of the requirements for science databases and how those requirements are being reflected in SciDB. The paper starts by stating that SciDB will use a multi-dimensional data model due to the ease in programming compared to mesh array and graph models. SciDB will carry similar features as Postgres and allow for user defined functions and operations. This is an important feature since it will allow different scientific communities to adapt SciDB to fit their needs. The authors of the paper are mainly designing a general purpose database for science use due to the numerous difference in requirements. However, one basic requirement that all the scientists agreed on was the need for no overwrite of old data and allowing for uncertainty in datatypes such as in the form of error bars or probability distributions. The paper concludes stating its current financial fundraising progress and that it expects SciDB to be ready within two years of this paper.|
It is interesting to see that the database community actually has meetings with other research communities to create specific databases outside of relational databases. It is well known that relational databases are not a good solution to most needs, but especially to scientific needs. It also seems there is a huge need for this due the number of supporters the project already has. The cool thing was that eBay was included due to their unique business needs for this type of database. One question I have is suppose the object oriented programming model for databases actually beat out relational databases in the past, would these types of databases been better suited to supporting these unique database needs? It seems that SciDB is going to be in C++, which is already an object oriented language and to be an OO database would be more customizable. Maybe if OO databases had taken off, they would have been better for a general use case due to easy customizability inherited from similar programming languages.
This paper presents the requirements (big science) they found in scientific database users and very complex business analytics, and describes the design of a science database management system called "SciDB". A subset of the special features of SciDB are listed as below.|
1. Data model:
A multi-dimensional, nested array model is proposed because of the requirement of a large subset of the scientific community and the complexity of building a mesh model. This paper proposes a multi-dimensional, nested array model with array cells containing records, which in turn can contain components that are multi-dimensional arrays. In addition, SciDB is implemented in C++ and is also able to provide POSTGRES-like user-defined functions (UDF).
2. SciDB operators:
The SciDB operators can be classified into two categories: structural operators and content-dependent operators. Structural operators only care about the structure of inputs and ignore the content or data values; they provide possibilities for optimization. Content-dependent operators consider the data that is stored in the input array, for example, filter and aggregate operations.
3. Programming languages:
Since the scientific database users might want to use different programming languages, SciDB seeks to support different languages. This paper states that SciDB will have a parse-tree representation for commands with multiple language bindings that map from the language-specific representation to this parse tree format.
The major contribution of this paper is that it discusses many potential database usage requirements from the science community and tries to address those challenges. Also, they build the SciDB to develop such DBMS and make it a open source project, which would be a great contribution for the science community.
Motivation for SciDB:|
The purpose of this paper is to specify a common set of requirements for SciDB, a new science database system. Currently, most scientific users are forced to use relational tables because that is what is available in current systems, but most scientific data does not naturally match that data model, and so few are satisfied with SQL as the interface language. SciDB uses an array data model, which also meets the needs of those who desire tables, that satisfies the needs of a large subset of the science community.
One of the main categories of SciDB operations is a structural operator, which creates new arrays from the structure of the inputs and are data-agnostic. Subsample is an example of a structural operator that takes an array and a predicate as input and outputs a subset of from the input array that has the same dimensions as the input. SciDB’s other main category of operators is content-dependent operators. An example of this is Filter, which takes an array and a predicate, and outputs an array of the same dimensions that contains the elements of A if the predicate is true, and NULL otherwise. SciDB’s fundamental arrays operations are user-extendable so that users can add their own data types. SciDB has to deal with various languages, so it has multiple language bindings that map from language-specific representations to the parse-tree representation for commands. SciDB does not overwrite, as is required by most scientists. SciDB divides the load-stream into site-specific substreams that appear in the main memory of the associated node. SciDB can work on data that is “in situ” without requiring a load process. Cooking, or processing raw information into finished information, can be done inside the engine if the user chooses to do so. SciDB also allows for named versions, which is vital to many science users. SciDB records a metadata repository that allows scientist to enter programs that were run along with their run-time parameters, making the record of provenance available. Uncertainty is conveyed in SciDB by the location of an item in an array cell.
Strengths of the paper:
I felt that the paper did a good job of providing a concise overview of the features needed for a science database. I had not thought about how the format of science data would be different before, so it was interesting to read about all the different considerations to representing science in a DBMS, and the struggle that it was to fit science data in a relational DBMS before.
Limitations of the paper:
I would have liked to see some real world examples of where the features of SciDB, such as storage within a node and no overwrite, are used. I also felt that the paper was missing more detailed discussion on the implementation of the different SciDB features, such as extendibility, “in situ” data, and integration of the cooking process, and could’ve provided pseudocode or more explanation of how these features were implemented.
This paper titled "Requirements for Science Data Bases and SciDB" gives an overview of the research that has gone into creating databases specifically for scientific computing applications. The paper talks about the author’s communications and collaborations with eBay, Large Synoptic Survey Telescope, Microsoft, the Stanford Linear Accelerator Center, and Vertica. They mention the data model, operators, and requirements by users in these scientific communities. The paper then talks about its other uses, benchmarks and concluding remarks about the future of this type of research.|
The advantages of this paper include it's insight into the real world and practical applications of database technologies. There are a lot of companies on this paper that use databases for scientific purposes. The realization that tables are not a natural data structure for these kinds of jobs is an important one. Additionally, the insight into structural operators and operations that are useful in these fields is a useful contribution toward constructing databases for scientific computing. The ability for users to add their own functions and data types and the fact that this database is open source allow others great flexibility in the use of SciDB.
The drawbacks of this paper are evident near the end when the authors begin the discussion about non scientific uses of this type of database. The authors state that it is surprising that the database system they have developed is more broadly applicable. This makes the reader wonder whether they have not yet pinned down the scientific computing specific functionality. If they have and this is not the case then we are led to question whether or not the authors have been including the correct parties in this discussion. If SciDB is useful for a broader class of applications then it should probably involve parties in the bigger picture. Nonetheless, the realization that the methods are more broadly applicable is a useful contribution.
Part 1: Overview|
This paper presents a database design, SciDB, which is suitable for science analysis. Astronomy and particle physics, fusion, remote sensing, oceanography, and biology queries share some requirements, which is called “big science” requirements in this paper. The XLDB-1 brought together the collection of “big science” and defined the requirements in common. Stonebraker actually countered complaints from previous science database users where concurrency is not well guaranteed with some example that indicates that building database suitable for certain requirements is not that hard.
The data model used for SciDB is the array data model because this data model would meet most science database users’ needs. Call for array data model appeared in the Sequoia 2000 Project and was taken into consideration of science database requirements. SciDB supports POSTGRE-style user defined functions. Structural operators including some advanced ones like reshaping an array are legal in SciDB. Content dependent operators are also supported in SciDB. Filter is important in science analysis. In addition, SciDB is open source database which is suitable for science analysis. There are no hidden secrets or magic behind the source codes.
Part 2: Contributions
User defined functions are supported in SciDB. UDFs can be used to enhance arrays. Any function that accepts integer arguments can be applied to the dimensions of an array and therefore enhance the array by transposition.
Advanced Structural operators like reshape is included in SciDB. Reshape operator converts an array to a new one with different shape just like reshape function in MATLAB. Content dependent operators including filtering are supported also.
Part 3: Drawbacks
This paper brings SciDB for science research including particle physics, biology, remote sensing , etc. and also points out that those big vendors in the market would not actually be interested in this area as science databases are “zero billion dollar” industry. There is not enough support for SciDB for the great needs of science research.
Array data model would not meet some users’ needs, such as solid modelling users who require mesh data model. There is no solution that “one size will fit all”.
The paper discussed about sets of requirements for a science database called SciDB. The problem addressed by the paper is that building custom software for each new project from the bottom-up will not work. And there is a need of building a single DBMS which address all these required systems. In addition, the paper identifies the fact that there is no much attention from large commercial vendors as much as business oriented DBMSs get. |
The paper mainly discusses specific requirements for scientific databases. The main requirement is that having an array data model instead of SQL table is important. Since a table with a primary key can be considered as a one dimensional array, SQL table can be modelled by array too, satisfying those scientific data users who needs table. In addition, there are specific kind of operations required. These include the structural operators, which create new arrays based on the inputs without considering the content. For example, a subsample operator takes two inputs , an array A and a predicate over the dimensions of A and output an array with the same number of dimensions but with smaller number of dimension values. The paper also discusses content-dependent operators like a filter which output an array that satisfy a predicate.
Other requirements mentioned in the paper include: user-extendable array operations instead of primitive join operation, multiple language bindings which convert into a parse-tree representation for commands, no overwrite of data because most scientists are adamant about not discarding any data, being open source, shared nothing architecture due to huge amount of data, being able to operate on “in-situ‘ data without requiring a load process, being able to cook raw sensor data inside SciDB, being able to recreate an array by remembering how it was derived and supporting uncertain data elements. Furthermore, the authors identified that these requirements apply more broadly than just for science applications.
The main strength of the paper is that it identified applications area that are important for advancement of scientific research but is ignored by business organization because of not having direct impact on them. However, the advancement of science which is possible through DBMS like SciDB, will in turn leads to advancement of business organizations. I like the author's motive to proceed with the project even with this funding difficulty. In addition, they discusses general requirements which can be useful for other DBMS developer to take into consideration if they decided to develop DBMS in this area.
The main limitation of the paper is that it is not supported by an experimental evaluation. It would have been better to includes some research finding to support the different requirements set in the paper. In addition, I feel like the number of requirements mentioned are too much to fit in the paper. It would have been great if the authors could have prioritize requirements and discuss only some of them or have a better categorizing mechanism and discuss them in smaller number of categories.
This paper talks about the requirements they have been collecting and assembling from a collection of scientific data base users. The summary is provided in this paper and it also briefly talks about some of the SciDB design. Science DBMSs are a "zero billion dollar" industry. Therefore, this paper from Stonebraker helps catch attention from different vendors.
First, the paper mentioned that the data for scientific work is very different than traditional database data models. Usually they are larger in data table dimensions. They proposed an array indexing syntax for SQL extension, which is much desired by scientific databases because it might be too tedious for them to name each column in a table or a data point in some measurements. Then, the paper discussed the structural operation required and the ways of implementing those operators. It also mentioned that being open source is the solution of no large commercial vendors paying emphasis. This paper shows their commitment of building a database system for scientific works and they said that they are going to have a working prototype in two years from when the paper is published, which should be 2011. The usage of such a database can go beyond just for scientific works. For example, a lot of searching queries are actually science analysis, too. Therefore, companies like eBay, Microsoft and many more might have large interest in such a design and invest into this course.
This paper provides a great overview of the requirements specific for scientific data base users. Unlike most other computer science papers, this paper also includes a little about the expecting marketing success of such scientific database. It requires being open source and it requires extensibility. Another good thing about this design is that, such scientific database can go far beyond just being used for science applications. For example, the time series analysis in searching queries can take huge advantage of the design and do multi-dimensional analysis, which catches more attention from different vendors then ever.
It is not really a weakness if the target of this paper only wants to provide some overview, some literature study or a guideline of designing a scientific database. But I prefer to see some working prototype in this paper, not just some scratches. Well, I think this paper shows the privilege that a leader in the area can have. Overall, I think the paper conveys a good motivation and confidence of designing a successful database system for scientific users.
SciDB is developed because there is a need for a database that can process different data structures. For example biology and genomics uses graphs and sequences but solid modeling uses meshes.|
There are two types of operators in SciDB. The structural operators are operators do not depend on data, such as sampling. And there are content-dependent operators, such as Aggregate operator.
And there are language support built in SciDB.
Another feature of SciDB is that no entry is overwritten, because they are all important.
SciDB can be very important in science world. It satisfied many science needs that are not supported in other databases.
This seems to be a paper that is written after the product has been on the market for a while. This paper is not really well structured. A set of feature is described and there seems to be little connection between topics.
SciDB was created in response to requirements from the science community for a database that is more suited to the kind of experiments that are conducted in comparison to typically transaction-primary databases. It supports a multi-dimensional, nested array model with array cells that have records, which can also be multidimensional arrays.|
The authors begin with the thought that a typical table in a DBMS with a primary key is equivalent to a one dimensional array, therefore an array model should be acceptable for most requirements. There are different kinds of arrays that can be created in the SciDB. A physical array can be created by specifying what they define as a high watermark in each dimension. The arrays can also be unbounded if required. They also have provision for enhanced arrays that need not necessarily have the same shape throughout or non-integer dimensions. Since researchers would like to know how a particular value was arrived at over multiple runs, they have provided a facility to store historical values for each of the cells in the array which is quite similar to a slowly changing dimension in a data warehousing system.
One of the advantages of SciDB comes from the fact that besides providing provisions for typical SQL operators like aggregation, it also provides structural and content-dependent operators such as filter that would really help. They also have support for multiple language binding which makes SciDB a good back end source for scientific purposes. The functions and operators they have implemented reminded me of array manipulation functions in Python so they definitely are trying to make it easier for the researchers to use just the DB for their experiments. SciDB also has provision for user-defined data types and functions which gives the researcher more freedom to manipulate the database for experiments.
However, this paper seems to be looking forward and even though it claims a lot of things it has not implemented are simple, it is difficult to take it at face value. I would have really liked to see the kind of feedback they received from the science community after having tested the database and the results related to the same. This would have helped me understand the kind of impact this DB makes; is it actually better than just using a coding module on top of a typical database.
There were definitely bits to this paper where I liked the attention to detail where they are tailoring this specifically for scientific community and yet, making it general enough to satisfy multiple groups. This paper definitely helps understand the specific and special needs of the science community and feels like a step forward towards the database community being able to relieve it.
Requirements for Science Data Bases and SciDB paper review|
In this paper, michael stonebraker mentioned that in the extreme conditions like scientific researches, the current commercial DBMS are inadequate.
And in most cases, the need would just be the large array other than the interface of traditional relational database.
The major shortcoming of this paper is that the needs of each scientist varies from one to another, so it is hard to form a market standard.
This paper details a list of requirements collected for a "scientific database" based on discussions between the scientific and database communities. While some users have needs that are more complex, the authors work towards providing a database that has arrays as the fundamental data model as this addresses the needs of a large portion of users. The scientific database should provide support for some features commonly found in relational database systems such as user-defined functions and limited forms of joins and operators. Scientific applications often require complex transformations to data, and as such, user-extensibility is a critical requirement. The authors also discuss design decisions made for SciDB, their attempt at a database that meets the described requirements and find that many of the requirements for scientific applications overlap with those for business analytics.|
The contribution of this work is in identifying a data model that fits the needs of the majority of the scientific community. The authors have identified that the array meets the needs of many scientific communities and provide examples of use-cases.
This work details reasonable solutions to many of the problems faced by the scientific community and challenges in solving those, but it has a number of shortcomings:
* The authors explain the need for user-provided data transformations, but do not provide integration with large-scale computation systems. Rather, they force users who attempt many classes of data transformations to re-invent such functionality or deal with the performance cost of not having it available.
* SciDB still requires that users write a mix of native and database-specific code. For instance, SciDB provides a number of aggregation functions and allows users to provide user-defined functions, but these are likely to be implemented in different languages using different data representations which is not an desirable user experience.
This paper proposes a new way to store data called SciDB, which is optimized for scientific computation. In the past, the relational database has been a “one size fits all” database used for all industries and all workloads. However, we have found that by specializing databases, we can get significant performance increases. The problem with relational databases in a scientific context is that most relational databases have a primary key for each row. In this case, we can think of these rows as one-dimensional arrays. Scientific workloads, though, work best with multi-dimensional arrays, graphs, and sequences.|
SciDB allows users to create multi-dimensional arrays that can be named. For example, if we wanted to define a two-dimensional array, we would “define Array_Name(Array types)(dimensions)”. Instances of these arrays can then be changed through UDFs or user defined functions, written in C++. SciDB also has some predefined functions such as Sjoin, concatenate, cross product, and adding and removing dimensions. Finally, this database can also be used with non-scientific workloads such as eBay’s workloads.
I believe the following are the positives of this paper:
1. SciDB takes into account what the scientific community needs in a DBMS and implemented operations that the community needs.
2. The source code is open sourced so that the community can help improve and create new features that are needed for scientific calculations.
Overall, the paper does a good job of explaining how SciDB can help the scientific community and what advantages it has over relational databases. However, I have a few concerns about the paper:
1. The authors did not perform experiments about how queries on SciDB would compare to queries on traditional relational databases like MySQL. I would have liked to see if we would actually get a performance increase by using SciDB.
2. The authors did go over the data model, but it is still confusing as to how the internals of the database works and where SciDB is saving in computation time.
This paper discusses SciDB, an in-development DBMS that aims to meet the needs of the scientific community more closely than current relational databases. While the table-based structure of relational databases works well for the needs of many businesses, educational institutions, etc., it is not the most efficient structure for scientific data, which is often organized as arrays, graphs, or mesh models. The creators of SciDB chose to pursue an array model, as it was simpler to create than a graph or mesh model and also provided for the needs of scientific communities that were happy with tables. Because the scientific community does not draw nearly as much revenue as the huge corporations that major vendors develop for, SciDB was begun as an open source project.|
The basic data model of SciDB are arrays that may be any number of dimensions, each of which is addressed by contiguous integer values between 1 and the size of the array. Each cell in the array may contain a scalar value(s) or an array(s). SciDB supports user-defined functions and user-defined aggregates. This was a critical requirement, as scientific applications vary widely. Any DBMS that is intended to server a wide range of scientific interests must be extensible to fit the very particular needs of each project. SciDB supports several unique operators, such as the ability to reshape the dimensions of an array, and the ability to join cells based on either structure or content. One of the most critical features of SciDB is a version dimension kept for all arrays. This provides the ability to keep track of transformations applied to data, which, in turn, allows for easy analysis of where errors occurred in a process, and easy reconstruction of data once the error has been corrected.
One of my chief concerns with this paper is that they never give a great example of why the table structure of relational databases is unsuitable for many scientific applications. The only statement offered is that simulating arrays on top of tables yielded poor performance, so I would have liked to see an example illustrating why this decrease in performance occurs. I also thought that the examples provided for the two join operations were very unclear, particularly the Sjoin example. They were drawing distinctions between joins based on index and joins based on content, but, for that example, the index and content of all the cells were identical, so it was difficult to grasp exactly what they were illustrating.
This paper talks about the requirements for Science Data Bases and SciDB.|
This paper is an introduction to SciDB, which was at the time a concept for an implementation of a database that Stonebraker wanted to make. Basically, Stonebraker didn’t like RDBMSs and thought they were bad for scientific data so he proposed a new structure that is entirely array based. This paper is entirely about the data model of SciDB that will be implemented. |
I won’t go too in depth really on the data model because that’s what reading the paper is for, but a high level overview is it’s no longer SQL syntax and they have created new operators. You can still do queries obviously, with things like “subsample” or “reshape” or “sjoin”. It also offers other joins and aggregation operators. One interesting thing is there is no overwrite capabilities because Stonebraker says scientists like to keep all their data for lineage purposes… In general it has all the capabilities you would need for a DB but definitely isn’t perfect, I’m sure it’ll get high performance though.
Amongst my complaints of this are first off I think they did a poor job of convincing me that arrays are better or preferred to current RDBMS setups. I’m not saying it’s not the case but they didn’t provide any real evidence against RDBMSs and why arrays would be better. Second off, I don’t feel like this is something that should be a paper. This is Stonebraker saying, “hey I have a new idea that I’m going to do”, and I don’t think the concept is nearly revolutionary enough to make it a paper. But I guess when you get big enough names on a paper you can somewhat write whatever you want, just an update on your ideas is an acceptable paper. Lastly it seems like they just glossed over some potentially large problems SciDB could encounter and are saying “well this could be a problem, but forget about that and look at this shiny thing over here”. Lastly, I don’t like the formatting of the paper as just one big long section, but that’s just a personal thing.
On the plus side I think they have thoroughly thought out the operations and data model for SciDB and I’d imagine they have sufficient reason to believe the performance of this will be far greater than other systems. Also, the examples of the operations supported were clear and simple and I think they were very well thought out. Overall, it was another classic Stonebraker paper to read so it’s worth the read, as the guy is always thinking against the norm it seems like which can be great. I’m excited to hear the presentation and see what has become of SciDB.
According to the prophecy given by Stonebraker in a previous paper, many different niches in the database community are arising, each with a specific need that can’t be solved by a “one size fits all” paradigm. In particular, the motivation for this paper was developed through particular research in the scientific computing community, in order to (i) describe a general scheme of requirements for databases in the scientific community, and (ii) present a system that is tailored to such systems.|
The primary concern first addressed in the paper is the inadequacy of relational systems for scientific computing; people in the science research community don’t look at data simply as points. They prefer a holistic view of the data via functions/sequencies/series/visualizations/graphs/etc. Their argument is that this can be done with an array-based data model. They do this by defining “enhanced” arrays; arrays that don’t necessarily have integer coordinates (i.e. cylindrical map geometry for topologies), for which they project pseudo-coordinates defined by mappings with user-defined functions. They then go on to describe the usefulness of shaping functions for data, which specifies a geometric representation mapping coordinates to bounds in array data. All user defined functions mentioned must also be defined in C++.
They then describe several operators on these enhanced arrays that are defined. For example, “subsampling” is relatively self-explanatory (they give an example of choosing slices along an X-Y dimensional array). Reshape is another interesting operator which, given an enhanced array along a set of dimensions, can change the number of dimensions as well as the shape of the data within the new dimensions. My interpretation is that this is basically a different implementation of the reshape() function in scientific computing languages like MATLAB or Octave. Aggregate and filter are other operators defined, which are also self-explanatory (e.g. aggregate can sum on a dimension, filter replaces values that don’t satisfy a predicate with NULL). They mention other rather non-intuitive operators, but neglect to give a description regarding functionality or implementation.
Their discussion of extensibility of the project is rather straightforward; they discuss needs of the scientific community such as (i) needing bindings in other languages (since many researchers use MATLAB, Python, IDL, etc), (ii) needing permanence in data (i.e. no overwrites), which requires metadata to track versioned parts of data, (iii) having an affinity for open source data, (iv) having an efficient storage system for non-trivial data formats (i.e. distributed systems for spatial data). They also mention having efficient disk partitioning, but I don’t know how deeply scientific researchers need to care about this other than “it needs to be fast.” This is similar to their other point about needing to operate on data without having to load everything with high overhead, which seems to be more relevant to me. Regarding versioning, they also found that being able to label versioned data with something other than time is useful (i.e. studying humidity for the time with least cloud overhead), and following computations that result from specific versions (provenance), which seems like an intuitively good idea that doesn’t appear to have been too difficult to implement. They then describe a method for collecting and summarizing nondeterministic data, and give a very brief overview of an application to eBay’s business model.
However, they do claim that SciDB will have its usefulness in non-scientific applications, but do not give a convincing description of exactly how (besides an extremely specific use case from one of their funders, eBay). I found the description of “shapes” and “raggedness” hard to understand in their explanation. If they could add some illustration or better description of an actual example they used with these terms, I think it would demonstrate their need more clearly (after all, scientists need visualizations). In addition, I can see the fact that user-defined functions must be defined in C++ as a problem for researchers, as many people in the research community should not (in my personal opinion) be forced to use such a technical language like C++. However, as this is a concept paper, I think it’s actually really nice that they came up with a list of grievances/desires from a significant portion of the DB computing community. Each point was well-founded, and they seem confident in their ability to implement some of these features. I also think having a graphical interface for browsing/summarizing the data will also be important, and was something that wasn’t necessarily addressed in the paper.
This paper is meant to be a result of requirement identification for science data bases (SciDB). The motivation behind the paper is the inadequacy of commercial database offering for science data. Since the data model used for science is leaning toward multidimensional data (unlike one-dimensional data that is often used in commercial databases), there needs to be a “general database” that could accommodate scientist from various branches of science.|
The whole paper is spent to explain – in detail – the structural and functional requirement of SciDB. The structural requirement concerns the appropriate data model to be used. In this case, the regular relational model would not satisfy, since science data are often multidimensional. Therefore, they decide to use multidimensional array with functions such as: Define, Create, Enhance, and Shape. The array can also be unbounded array. The required operators are structural operator (which does not necessarily read data values of the array, i.e.: Subsample, Reshape) and content-dependent operator (i.e.: Aggregate, CJoin, Apply, Project). However, these array operations must also be extendable where users can add their own functions (UDFs) and data types. The SciDB system must accommodate various programming languages, thus it should have parse-tree representation for commands. There will also be no data overwrite, so old value will stay and there will be a history dimension for each array. The system must be open source and grid-oriented. There must be storage within node, also the system will use “in situ” data so there will be no waiting for the data to load. The system should integrate the Cooking process (the goal will be to enable cooking inside the engine), and support named versions. There must also be repeatability of data derivation (recreation of data). The SciDB must also support uncertainty model, for example an “uncertain x” for any data type x that is available in the engine. The paper briefly mentions the potential use of SciDB for non-science usage, namely eBay (to answer the question “how relevant is the keyword search engine?”). Lastly, the paper signals the creation of science benchmark.
The main contribution of this paper is providing a thorough explanation of requirement for a science database (SciDB). Although users (in this case, scientists) know what kind of database system is needed, it is very helpful to have it written and explained. For future use, this requirement identification document can serve either as basis for building similar project or as a basis for further development of such database.
However, SciDB highly supports UDF with no consensual programming language (meaning, user can used various languages), I think the paper should also consider how to reconcile the UDFs. In this case, the system would also need to recognize if specific functionality has already been built in specific programming language, then the system can at least notify that to the user. User may be partial to specific programming language or it could be the case that the functionality may be unique, but at least the existing UDF can serve as reference. Also, I think it would be better if the writers also explain the technical requirement in this paper (although I know that later paper on SciDB provides more complete view on this).--
The purpose of this paper is to present standardized requirements for their science data base system called SciDB such that these requirements are applicable to a large collection of scientific database uses in a variety of fields. The goal of these standardized requirements is to avoid building similar databases from the ground up for every new science application as was the trend at the time.
A main technical contribution that this paper makes is the discussion of a database that is based on an array data model rather than a table-based data model as most DBMSs are currently based. They support a multidimensional, nested array model where the values in the cells of the array are the records of the database. These nested part of this definition implies that the record contained in a cell of a multidimensional array could itself by a multidimensional array. They present a language for specifying these arrays as well as Enhanced Arrays which are arrays that can be scaled, translated, or have non-integer dimensions. SciDB also supports UDFs (As in POSTGRES), but where the arguments given to this function can be arrays rather than simply scalars. These UDFs can be used to create the enhanced arrays referenced above. To accompany this new data model, SciDB also introduces new operators: structural operators (such as reshape and structure join) and content-dependent operators (such as Filter, which filters an array based on a provided predicate). There are other more minor contributions that we have read about in Stonebrakers other works that will be utilized in SciDB as well.
As far as strengths go, I do think a lot of the things they are introducing in SciDB that are different from a traditional DBMS exist in some form or another in another less-frequently-used DBMS. These features are thus somewhat developed and proven to work, and the authors of this paper aren’t just grasping at air when discussing these features.
As far as weaknesses go, I think the structure of this paper is a little bizarre. The section headings do not have sensible titles and thus it was a little hard to follow the general flow of the information being presented. I think this might be a different kind of article (rather than the typical conference paper we might be reading), but that doesn’t give it the excuse to be as difficult to follow as it was.
Paper Review: Requirements for Science Data Bases and SciDB|
This paper proposes a coming Database called SciDB. SciDB focuses on assembling requirements from a collection of scientific data base users from multiple fields of science. The paper discusses the requirements it collects and identifies, and some brief description of the SciDB design. As described in the paper, the motivation of this work comes from the varying need of DBMs data representations from all kinds of scientific fields. As a conclusion, a mixed specialized DBMS is needed, which is also what this project tries to achieve.
The following of the paper focuses on introducing some design details of the system.
Strength: The necessity of the work is strong, which means there is a great need of such an application. The discovery of the paper is valuable because it provides a generalization, although maybe not very much generalized across different scientific fields, of what are needed to make a good DBMS for scientific research usage.
Weakness: This is a paper proposing a prototype application without much of performance demonstration or comparison, which leaves it to the readers to guess how well will this model perform. Maybe some plan of testing and comparison and future development can better help readers to get a hold of where the project is going.
This paper introduces SciDB, a DBMS system specifically designs for solving scientific problems. It utilizes multiarray to represent multidimensional model to provides data representation for scientific applications. This paper also does a complete survey to show what scientific database requirements in both common applications and also for specific areas. This paper also defines 2 kind of operators to support database operations: 1. Structure operators: Subsample, reshape, etc; 2, content dependent operators such as filter, aggregate and content-based join, etc. Moreover, it also shows that user can define their own database operators and also data types. In the end of this paper, it also shows that it plans to provides some new benchmarks for scientific database in the future.
1. This paper does a great survey on requirement of scientific database system, it shows
2. This paper defines two kind of operators that incorporates two data sources to provides backend solutions for a variety of applications.
1. One of the major weakness is that this paper does not provide a complete evaluation on its SciDB system, however a benchmark is developing in the future development as the paper writes.
2. A lot of the work hasn’t been done when the paper published. Although it is exciting to see SciDB can solve so many problems in scientific computing, it is not clear at this stage it can really archive that.
This paper discussed the requirement for science database and introduced their work on designing SciDB. The scientific research community has a few special requirements that traditional RDBMS doesn't provide. To address this issue, the authors summarized those requirement in this paper. In addition, they were designing a scientific DBMS called SciDB, and they shared their experience. Most scientific users are not satisfied with relational model as it doesn’t naturally matches their data. It’s observed that a large subset of scientific users have data that fits in an array data model. This paper proposed a multi-dimensional, nested array model with array cells containing records, which in turn can contain components that|
are multi-dimensional arrays. More specifically, arrays can have any number of dimensions,
which may be named. Each dimension has contiguous integer values between 1 and N. Each
combination of dimension values defines a cell. Every cell has the same data types for its values, which is one or more scalar values, and/or ones or more arrays. SciDB will also support POSTGRES-style UDFs to enhance the functionality of arrays. SciDB mainly supports two categories of operations, structural operators and content-dependent operators. Structural operators create new arrays solely based on the structure of the inputs. The content-dependent operators performs operations that depend on the data stored in the input array. Unlike in RDBMS, Science users has a special need for regarding arrays and performing sophisticated computations and analytics. So the fundamental arrays operations in SciDB are user-extendable. SciDB uses language bindings to support disparate languages. In contrast to most commercial systems that overwrite the old data with a new one, science users require a no-overwrite storage manager. In addition, SciDB also supports grid orientation, name version, etc. It is promising that there are users outside of scientific community also willing to have the aforementioned features, such as eBay.
The main contribution of this paper is to systematically investigated the need for scientific database. Based on the observation, this paper proposed an array data model to address the needs.
However, this paper didn’t explain how to efficiently store the multidimensional array on disk.