This paper reports scientists' requirements for databases, and sketch some of the SciDB design. The requirements discussed include: 1. data model: array data model is wanted by a majority of scientific users, and functions like shape is needed 2. operations: structural operators based on input structure and content-dependent operators based on data stored in the input array are needed 3. extendibility: array operations should be user-extendable 4. language binding: no consensus on a single programming language 5. No overwrite: retain old values for provenance purposes when updating values 6. Open source: easier for people to use 7. Grid Orientation: allow data partitioning to change over time 8. Storage withing a Node: decompose a partition into disk blocks 9. in-situ data: avoid overhead of loading data 10. Cooking: decide when to convert sensor information into standard data types 11. Named Versions: versioning of data, easier to track what went wrong 12. Provenance: repeatability of data derivation is required 13. Uncertainty: support for uncertain data elements is requested Actually, I don't like this paper due to two reasons: 1. The paper introduces some of the user requests/requirements in different sections an provide solutions, which is too sparse and hard to follow. I prefer listing requirements and reasons for requirements in the beginning of the paper, and then introduces solutions to these requirements. 2. Some titles are not easy to understand. For example, the "Integration of the Cooking Process". After careful reading, I understand what cooking means in the content, but it's not intuitive at all by just looking at this title. I think this is a bad practice of writing the title. Besides the reasons making me dislike the paper, the paper provides some ideas about how SciDB designs are generated based on actual requirements, and it's interesting to see how the group design data structures and operations from scratch, from some of the user requirements. |
Current commercial DBMS cannot fulfil the demand of some scientific database users. Scientific database users in different fields including astronomy, particle physics, fusion, remote sensing, oceanography, and biology proposed several requirements they need. This paper aims at reporting the requirements they collected and identified and present the design of SciDB, which is a database designed to meet a common set of requirements from scientific users. Some of the contributions and strengths of this paper are: 1. SciDB supports customized data models for scientific users that are not satisfied with traditional table or array data models. 2. SciDB allows users to add their own data type, which gives the possibility for scientific users to perform sophisticated computations and analytics. 3. SciDB support multiple languages including C++, python, Ruby and so on by mapping the language-specific representation to common parse-tree representation. Some of the drawbacks of this paper are: 1. The paper is not derived in well-organized sections and some of the requirements are isolated and not related to each other. 2. Most of the requirements remain in concept stage without explaining the detailed implementation. 3. The paper lacks experimental results and performance comparison with other existing popular DBMS. |
A collection of scientific data base users keep complaining about the inadequacy of current commercial DBMS offerings, and though many researchers worked on science databases for years, there was still no common set of requirements across several science disciplines. Therefore, this paper defined these requirements and presented a detailed design exercise and sketched some of the SciDB design. First, Data model in most scientific scenarios are an array data model, primarily because it makes a considerable subset of the community happy and is easier to build than a mesh model. Specifically, this model is a multi-dimensional, nested array model with array cells containing records, which in turn can contain components that are multi-dimensional arrays. The operators that accompany the proposed data model consist of Structural Operators and Content-Dependent Operators. Structural operators create new arrays based purely on the structure of the inputs. In other words, these operators are data-agnostic. Content-dependent operators are those whose result depends on the data that is stored in the input array. To support these disparate languages, SciDB will have a parse-tree representation for commands. Then, there will be multiple language bindings. These will map from the language-specific representation to this parse tree format. The main contribution of this paper is it is the first paper which defined the basic requirements for scientific databases across several science discipline, and thus can serve as a guidance for researchers. Besides, the authors of this paper also decided to build SciDB and recruited an initial programming team, and have started a non-profit foundation to manage the project, which can definitely help scientific data users as it's not possible for commercial vendors to pay attention to these non-profit part of specific databases. The main advantages of the proposed model are as follows: 1. It suits the large scale dataset. 2. It performs best when the workload mostly contains write once (or accumulate slowly) and read a lot. 3. It offers the ability to store large volumes of multidimensional data and perform computation on the data in the form of database queries. 4. It also provides support for data versioning and provenance which is a oft-cited feature requirement of data management systems from scientists. 5. It may be possible to extend SciDB’s array-based data model to how images are digitally represented, and apply SciDB in image and video processing tasks. One thing to notice is that this paper is specified for scientific database, so we cannot think from perspective of the most common databases, especially these requirements cannot meet the OLTP demands. In other words, this proposed database does not support transactions, it only supports analysis, which is like OLAP workloads, which can be thought as a “drawback” of the model. |
Scientific applications often have to use databases and other data management technologies to handle the immense amounts of data that they produce in their research. Often, however, they are forced to make do with systems that are developed with industry concerns in mind, rather than the requirements specific to scientists, due to various historical reasons. For example, scientific data management is a “zero billion dollar” industry, which makes it nearly impossible for large commercial vendors to justify the time and expense to make a scientific research-focused system. This paper attempts to bridge the gap by specifying the various requirements for scientific databases, to make it easier for vendors and other organizations to develop products specifically targeting the scientific community. In particular, the product that the authors hope to develop out of this is termed SciDB. Potential users of this SciDB system span various scientific disciplines from particle physics to biology and remote sensing to astronomy to oceanography, and are mostly represented by universities and large research institutions. The first major difference between scientific users and industry is that most of their data does not fit particularly well into the table format that is the cornerstone of relational DBMSs. In fact, arrays tend to be the natural data model for a lot of these disciplines, rather than tables. Other domains like genomics would be happy with neither, and instead prefer graphs and sequence representations instead. This means that a “one size fits all” approach does not work, and instead, DBMSs will have to be specialized to individual needs. This paper focuses on the array data model, due to practical considerations, where each array can have an arbitrary number of dimensions and each combination of dimension values defines a cell. Data values are in the form of scalar values, and each cell has the same data type across all of its associated values. Arrays can be defined, and then multiple instances can be created. SciDB will also have user defined functions (in C++). One concept that SciDB introduces is the idea of enhancing arrays, i.e. performing various transformations such as transposition, scaling, translation, etc, which are done by user defined functions. SciDB includes a wide range of operators, which fall under two general categories: structural and context-dependent operators. Structural operators create new arrays purely based on the structure of the input, i.e. they are data-agnostic. This allows for a greater degree of optimization since the values of the data do not have to be taken into account. Examples of structural operators include subsample and reshape. On the other hand, context-dependent operators depend on the data stored in the array. Examples of this class of operators include filter and aggregate. As before, the set of operators and data types is extendable, and users can add in ones more suited to their specific disciplines. As for language, SciDB has a parse-tree representation for commands, followed by multiple language bindings. The reason for this is to support the disparate languages, from C++ to Python to MATLAB that different scientists prefer. Other important features of SciDB include a no-overwrite design, where old data is usually not deleted (for lineage purposes), the open source nature of the project, grid orientation, storage within a node, and “in situ” data where the overhead of loading data is minimized. The main strength of this paper is that is probably the first one of its type to attempt to gather together scientific professionals in order to figure out their specific requirements for a database management system. The insights gathered are very interesting and shed light on how differently scientific research is from other more conventional business-oriented transaction or analytical processing is done. By taking this initial step, the hope is that SciDB or other follow-on systems will be developed to better serve the scientific community. Overall, the paper is very comprehensive, but since it is well written, it is straightforward to comprehend. The biggest weakness, in my opinion, is that the scientific community is so fragmented in terms of requirements and preferred programming languages, to name a few. As previously mentioned, science as a whole is a “zero billion dollar” industry, and further splitting that up decreases the business incentive to develop systems specific to them. Also, SciDB has to make a lot of compromises in order to make it flexible enough to satisfy enough users, as was shown in how it was designed to handle multiple types of programming languages. While that is the practical reality, it is questionable whether the overall product is able to perform as well as initially envisioned, given that is being pulled in so many directions at once. |
This paper is a requirements paper for Science Databases as well as their proposed Science Database architecture, called SciDB. The author starts with a history where Michael Stonebraker and his colleague at a conference defended the current DBMS systems which were under attack by the advocates of scientific databases. The author also draws attention to the fact that the space of scientific databases does not have a lot of market potential, so it does not get the support it needs from the corporate community. The author then gets into the data model of a science database. This section introduces all of the possible models: an array data model, a table data model, and a mesh data model. These data models each serve different subset of the scientific community, but array data models serve the largest amount of the community and it is easier to develop than a mesh model. The paper describes how tables can be defined once and then later instances can be created multiple times, as in SQL. The paper also describes the Postgres-style UDFs that allow you enhance your array by adding pseudo-coordinates to your array. The author mentions how SciDB will also come with a few shape functions implemented that will allow the user to digitize some shapes like circles using arrays. Next the paper looks into structural operators which create arrays based on the structure of the inputs. The first one we look into is Subsample, which takes an array and a predicate over the array and returns a subset of the elements that satisfy the predicate. The next example we look at is Reshape, which converts the input array to a new array with a different shape and can modify the dimensions of the array, but not the number of cells, accordingly. Next the paper looks into content-dependent operators, which are operators that return objects of the same dimension as the input. The author then proposes that SciDB will be user-extendible like Postgres. The next requirement is that a science database should not overwrite old data but should instead keep a lineage of the data, so scientists can see what the value was at different points in time. The paper also suggests that the DBMS should be open source so that the community can work together to manage and support the codebase. They also require named versions so that scientists can access data from a large set at exactly the time or range that is useful to them. We also want provenance so we can recover data based on the lineage of steps used to get the data to its state. They also require that the DBMS can be used in non-scientific applications and that a benchmarks is developed, which they say is under progress for SciDB. I thought that in the earlier stages, the paper told an interesting story to lead the reader to the main content of the paper and provide context. However I felt that when the technical information was presented, it was a bit hard to distiguish the history from the author’s contribution, at least in the earlier sections of the Requirements part of the paper. This was understandable since the paper was more high level as a requirements paper. |
Requirement for sequence data bases and SciDB This paper presents several requirements from scientific researchers and database users for scientific area. And provides some designing ideas of science database. Overall speaking, the requirements from scientific users are quite different from other people(daily database users), and those requirements are mixed which means a typical design of SciDB will not satisfy all. Therefore, the design of SciDB is need to consider all the requirement and needs. In this paper, more than ten designing ideas are listed. For the data model. Scientific users want different type of storage of data and significant ability of manipulating arrays. So SciDB designs array of cell mechanism and providing lots of user define functions so that users can design their own datatype. Also, SciDB provides multiple data/arrays manipulation and its user define functions method so that SciDB is extendable. For the language bindings, different users have different view on it. Some prefers C++ but other prefers Python, so SciDB has a parse-tree representation for commands, and then there will be multiple language bindings. For the scientific users, they do not want to overwrite datas in case of losing history. To deal with this, SciDB adopts another dimension(history) to record the data in previous times. Like MySQL and Postgres, SciDB is open-source for the community. Also, some may use very large dataset such as LSST, and their dataset is change during the time span, so SciDB not only deal this with grid, but also support dynamic partition(change over time). With regard to storage within a node, there are some optimizations designed specifically for that. Another thing regarding the complaints from scientific users is that they feel loading data is slow and SciDB can handle "in Situ" data. For the cooking process, there are typically two methods of doing this. You can process it before the data comes in the database and you can also first record the data in database and then processes it. The second method is better because the "cooking" process is an information loss process, you want to have the most accurate information regarding your data. At the same time, another problem arises, different people, at the different time, may use different algorithms dealing with same chunk of data, in this case, SciDB adopts named version to handle different usage of the same data. Next important thing is, if something goes wrong, users want to know what procedure lead to the wrong answer and what are the datas that got influenced. Furthermore, SciDB can not only used in scientific area, but also non-scientific area like eBay. The contribution in this paper is the group of people who designed and develop SciDB, keeping in mind that this is a non-commercial project, and what they did would benefit for all scientific users and others which will move the science field faster. The advantage of SciDB is: 1: it resolves the concerns and requirements of scientists 2 SciDB is non-profit and open source which would benefit lots of people. 3 the design is excellent which make the computing faster. One of the drawback maybe, as an introductory paper, there are lack of detail of some implementations like how is the algorithm for dynamic partitioning, and how to deal with the "in situ" data in detail. |
This paper is written to specify a common set of requirements for a new science databse system and breifly sketch the design of SciDB. The requirements come from particle phsics, remote sensing and astronomy, oceanography. While most scientific users can use relational tables, tables are not a natural data model that closely matches their data. The Sequoia 2000 project realized in the mid 1990s that their users wanted an array data model, and that simulating arrays on top of tables was difficult and resulted in poor performance. THus, people will be not happy with neither a table nor an array data model. This project is exploring an array data model, primarily because it makes a considerable subset of the community happy and is easier to build than a mesh model. It will support a multi-dimensional, nested array model with array cells containing records. Then the paper introduced some of the operators that are used in the data model. There are mainly two kinds of operators structural operators and content-dependent operators. Structural operators category creates new arrays based purely on the structure of the inputs. Content-Dependent operators involves operators whose result depends on the data that is stored in the input array. Then the paper point out several essential requirements of the SciDB, it should be extendible, such as user can add its own data type in SciDB. Secondly, the SciDB should able to not binding a specific language. It can be completely acoided by language embedding approach. What's more, it should be open source and frid orientated. To ease the burden of loading large-scale data into a database, SciDB defines its own data format and writes adapters for commonly used external number formats. As long as there is an adapter corresponding to the user data, you can use SciDB directly without loading the data. SciDB loads raw data, using custom functions (UDFs) and data manipulation processes. The user makes specific changes to a portion of the array while leaving the rest unchanged. Data in the scientific field is generally inaccurate, and SciDB supports data and its errors. The main contribution fo this paper is that it introduced the SciDB and describe the user requirement of the SciDB. One week point is that this paper did not introduce SciDB in a systematic way and did not say anything about the performance of SciDB. |
This paper describes the requirements for a database designed for users in the scientific community. There was a discussion about this at XLDB-1 in October 2007, and Stonebraker and DeWitt agreed to build it if a common set of requirements could be defined - these are those requirements. Before this database was proposed, most scientific use cases had built one-off databases for individual products. There are many requirements that differ from traditional databases, but due to the length of this review I will focus on a select few. One that is quite interesting is the data model. The agreed upon data model is arrays - these look somewhat like a numpy array, which can have multiple dimensions. Additionally, UDFs can be defined on arrays. Because of the data model and other requirements of the project, the traditional SQL language is not desired. Therefore, new operations are described, which include two different subsets: operations that manipulate the structure of the data and operations that manipulate the data itself. This was an interesting separation that I had not thought of before in a database context. Some possible operations are reorganizing the dimensions of the data or performing a join. Additional important requirements include but are not limited to flexible language bindings, open source, and running on a shared-nothing cloud. I think that the authors did a fairly good job laying out all of the requirements. Additionally, I noted that they mention that scientific DBMSs are a “‘zero billion dollar’ industry.” However, they draw parallels with use cases at eBay near the end of the paper, for complex business analytics at web-scale. Therefore, in my mind, since the requirements meet the requirements for use cases at a place like eBay, this certainly *isn’t* a zero billion dollar problem. I’m curious about what this space looks like in 2018, with more NoSQL options and specific time series databases. I felt as though there were some areas that were too closely explored in a paper for requirements, and some that should have been included when their absence was noted. On page 2, there is a lot of description of syntax, which feels unnecessary for a paper on high-level requirements. Meanwhile, on page 6, issues like which compression schemes to use are pushed to a later point. In my opinion, this decision about detail should have been flipped. I also noticed that the data model chosen basically ignores the desires of biology, genomics, and chemistry (who want a graphs and sequences as their data model), which to me seem to represent many scientific disciplines.. Although they say that this is to satisfy a large subset of the use cases, I wonder how these disciplines use cases will be addressed in the future, because it seems unrealistic to call this a database for all of science without them. |
The contribution of this paper is a set of requirements that is crucial when designing a new science database system or SciDB in short. Users doing scientific computing, which typically have extreme database requirements complained about the inadequacy of current commercial DBMS offering. Building custom software for each new project clearly will not work in the future since the software stack will become too big and complex. To meet these users needs, we need to know their requirements for the new database system. The first and the most important requirement is a more flexible data model. Traditional DBMS only provides tables with a primary key, which is merely a one-dimensional array. But biology and genomics users want graphs and sequences. To satisfy their requirement, the author proposed a new data model called array which can have any number of (named) dimensions. Each combination of dimension values defines a cell, and a cell can store one or more scalar values or arrays. Enhanced arrays which allow basic arrays to be scaled, translated, have irregular boundaries and non-integer dimensions can be created by user-defined functions. Updates on the values of an array should not overwrite previous versions, so a history dimension must be added to every updatable array. Users should also be allowed to create named versions of their data as a user may want to do only a few modifications to a dataset and get a new version. Operators that can be applied on arrays can be divided into two categories: structural operators which based purely on the structure of the inputs and content-dependent operators whose result depends on the data that is stored in the input array. One important property of array operations is that they must be user-extendable, as users typically want to perform sophisticated computations and analytics. Other requirements include open source, multiple language bindings, a simple model of uncertainty, a history that remembers how an array is derived, etc. This paper does a good job of presenting these requirements for a SciDB. However, I think it will be better to also provide background on the existing solution to this problem and show how these methods failed to meet the customers’ needs. |
In the paper "Requirements for Science Data Bases and SciDB", Michael Stonebraker and Co. identify a common set of requirements between disciplines in the field of "Big Science" to develop a new science-oriented database system. Recently (in 2007), there were many users with extreme database requirements that complained about the inadequacy of current commercial DBMS. These current solutions did not support the workloads and grand scheme of "Big Science". Thus, Stonebraker and his companions decided to tackle this "zero-billion dollar" industry since getting the attention of large commercial vendors didn't seem likely. Since the field of academia is responsible for the "progress" of humans, designing a system that can assist them is not only interesting, but an important problem. These requirements are split in several sections: 1) Data Model: Even though most scientific users can use relational tables, it doesn't actually suit their workload (which includes SQL). An array/graph/mesh data model is the more preferred model (for different fields). In order to satisfy the largest subset of users in the scientific community, the array model is used - specifically, a multi-dimensional nested array model with array cells containing records. This array model can be used to model complex behavior in systems. 2) Operations: There are structural operators and content dependent operators. Structural operators create new arrays based on the structure of inputs. A content dependent operator's result depends on the data that is stored in the input array. 3) Extendability: Much like Postgres, users like defining their own operations through user defined functions in order to do complex aggregates/analysis. 4) Language Bindings: There is a desire for persistence of large array like in c++. However, some want python while others want MATLAB. In order to support all these interfaces, there needs to be tree parse representation for commands (A bridge that communicates to different languages). 5) No Overwrite: Scientists want to keep data and fix it in the case that it is wrong. Thus, there are no-overwrites and arrays are marked as "updatable" instead. 6) Open Source: Unless the project is open-source, no one will contribute and there will be no traction in the community. 7) Grid Orientation: We should use a shared nothing architecture to store large volumes of data. This makes load balancing a bit harder, but still doable. 8) Storage Within a Node: Data is streamed into main memory and written to disk when memory is full for performance gains. 9) "In Situ" Data: There must be a way to deal with loading large amounts of data. This is one of the major pain point for scientists - the downtime between actual analysis. We should allow other operations to be possible during the load times. 10) Integration of the Cooking Process: "Cooking data" within the DBMS vs doing it externally. This is a big area for debate. 11) Named Versions: Different versions to complete different cooking processes for data -> this supports different workloads. 12) Provenance: We should keep track of the derivation of an array via a log in order to extract any lessons learned from them. 13) Uncertainty: All scientific data is plagued with impreciseness. Thus, there is a lot of support for data distribution of data elements (error bars). Much like other Stonebraker papers, this one had some drawbacks. One criticism I have is the approach Stonebraker takes when organizing the structure of his paper. He tries really hard to generalize a problem, when in reality, it is usually a case-by-case basis. I believe that most of his views are valid, but some people may value one attribute over another - attribute a may not be the most compelling factor. Another drawback is the lack of an actual configuration/development of a system with these requirements. I would have liked to see the performance benefits that this type of system would have in comparison to mainstream database vendors. This would have strengthened the earlier claims made by users as well as give rise to support and contribution for scientific databases. |
This paper details the requirements and creation of SciDB, a DBMS built specifically for scientific usage. Standard DBMSs using SQL don’t work well for many scientific applications, but they still need a standardized system to work with. Different scientists need different data models, but an array data model worked well for most of them. In this data model, the equivalent of a table is an array of arbitrary dimensions, each of arbitrary size. An index into all of the dimensions of an array yields an array cell, which can be an object with arbitrary attributes. As such, a 1-dimensional array is equivalent to an ordinary table, where the array index is the primary key, and the cell attributes are the other columns. Because of the many different uses for arrays, SciDB supports user-defined functions, which can also change how users index into an array. A user can always access an array with the standard array indexes, but they can use UDFs to define pseudo coordinates, which can be of any type, not just integers. Then, the user can perform standard array operations such as aggregation, sampling, and filtering. One of the benefits of SciDB is that it can run in multiple programming languages. The various scientists don’t all use the same language, so SciDB produces a common parse tree representation that any language can use. Another benefit is that users can recreate the set of steps that built any data item, so they can debug data items that were created incorrectly. As well, users can store uncertain data with error bars. All of these allow SciDB to be widely used. One of the largest downsides of the paper is the lack of experimental results. SciDB is a system to be used in practice by many people, and it should justify itself with its performance. Without this, it’s much harder to verify how effective SciDB really is. |
The paper presents a common set of requirements for a new science data base system, SciDB. The result was a meeting at Asilomar in March 2008 between a collection of science users and a collection of DBMS researchers to define requirements, followed by a more detailed design exercise over the summer. Additional use cases were solicited, and parallel fund raising was carried out. These requirements come from particle, biology and remote sensing applications, remote, astronomy, oceanography, and eBay. Requirements involves data models, SciDB operators, extendibility, language binding, open source, grid orientation, storage within a node, “In Situ” Data, Integration of the Cooking Process, Named Versions, Provenance, Uncertainty, Un-science usage and science benchmark. . |
This paper presents requirements for science data base and SciDB. These requirements are assembled from a collection of scientific data base users from astronomy, particle physics, fusion, remote sensing, oceanography, and biology. The science community realizes that the software stack is getting too big, too hard to build and too hard to maintain. Hence, the community seems willing to get behind a single project in the DBMS area. The paper first presents the data model in scientific databases. The requirements here are arrays can have any number of dimensions, which may be named; an array can be defined and multiple instances can be created; an array can be created by specifying high water marks in each dimension; unbounded array can grow without restriction in dimensions; enhanced arrays should allow basic arrays to be scaled. SciDB should also support user-defined functions, which should be coded in C++. UDFs can be defined by specifying the function name, its input and output and the code to execute the function. Additionally, a shape function must be able to return low-water and high-water marks when one dimension is left unspecified. There should also be a exits function to find out whether or not a given cell is present in an array. Following the data model, the paper presents requirement of operations. The first operator category creates new arrays based purely on the structure of the inputs. These operators are data-agnostic and do not have to read the data values to produce results. Therefore, there is opportunity to optimize these operators. Example operators are subsample, reshape, and Structured-Join. Structured-Join restricts its join predicate to be over dimension values only. The second type of operator category is content-dependent operators. This type of operator's result depends on the data that is stored in the input array. Example operators are filter, aggregate, sum, and content-based join. Content-based join restricts its join predicate to be over data values only. The fundamental arrays operations in SciDB should be user-extendable. Most scientists are adamant about not discarding any data. Therefore, there should not be overwrite, which may cause data loss. If a data item is shown to be wrong, they want to add the replacement value and the time of the replacement, retaining the old value for provenance purposes. Furthermore, DBMS should be open source to get traction in the science community. For the storage aspect, the storage manager must decompose a partition into disk blocks. Most data will come into SciDB through a streaming bulk loader. Additionally, a universal requirement from scientists was repeatability of data derivation. Hence, they wish to be able to recreate any array A, by remembering how it was derived. The basic requirements are: for a given data element D, find the collection of processing steps that created it from input data; for a given data element D, find all the downstream data elements whose value is impacted by the value of D. The advantage of this paper is that it covers a lot of requirements of a DBMS for science community and explain why the requirements are needed. However, the whole paper is bery disorganized. Each section should be named more clearly. Also, I feel like there are a lot of requirements mentioned. It would be better if the requirements can be summarized and categorized. |
“Requirements for Science Data Bases and SciDB” by Stonebraker et al. discusses database use cases and requirements of scientists, a new database system SciDB to support these requirements, and a history of the SciDB designers’ interactions with users to uncover these requirements. The requirements include: 1) a data model and corresponding language that better fits scientific data use cases, namely an array data model 2) structural operators (based on structure of input) or content-dependent operators (based on content of input) 3) user-extendable operations on arrays 4) bindings for a variety of programming languages, since different scientists use different languages 5) no overwrite of data 6) open source DBMS 7) support changing data partitionings over time 8) how data is stored within a node 9) operate on “in situ” data (don’t require a full load of data) 10) provide support for “cooking” or pre-processing of raw data within the DBMS 11) support named versions of data 12) provide provenance of how particular data was computed and what other computations down the line it affects 13) support indication of uncertainty (i.e., normal distribution of data) |
This paper talks about the requirements for science databases and SciDB. The paper first introduces that the demand for science database is large. With different people work not together on the topic, the software becomes too large and too hard to maintain. So there is a demand for a single project in the DBMS area for scientific usage. The main contribution of the paper is the discussing of many requirements for the science database and SciDB. The first one is the array data model, which is easier to build than a mesh model. And SciDB also needs to support user-defined functions. The second requirements are for the operations. The paper proposed several operations such as subsample, reshape a structured-join as data-agnostic operators. Also, several context-dependant operators are introduced, such as filter, aggregate, and content-based join. The third and fourth requirements are extendibility and language bindings. I think they are both requirements for user usage. Besides, the paper considers some special requirements, including no overwriting, open source, storage within a node, "in Situ" data and provenance. The strong part of the paper is that it conders the design requirements from a scientific researcher's perspective. The data and situation for researchers are different from traditional DBMS users. And the paper also gives the non-science usage of the proposed requirements, which means the design od science area can be applied to other applications. The drawback of the paper is that I didn't see the support of treating some of the opinions of the paper as a considerable subset of researchers' opinions. They didn't conduct a survey and gave data to support their ideas. I think this kind of assertation can only be done by doing a well-designed survey. So I think it's the drawback of the paper. |
In this paper, the authors from several different organization do a research on exploring DBMS especially for scientific researches called SciDB. Design and implement DBMS for scientific researches is definitely a meaningful task because, before the introduction of SciDB, there are no other commercial DBMS systems. Even for existing products like Sequoia and MonetDB, the common set of requirements across several science disciplines are not defined. However, the demand for DBMS with scientific usage is great, as they mentioned in their paper, lots of researches like astronomy, particle physics, fusion, remote sensing and etc. The goal of this paper is to specify a common set of requirements for a new science DBMS called SciDB, these requirements also fit the requirement of very complex business analytics. Next, I will summarize the crux of this paper with my understanding. In this paper, they discussed several requirements for building SciDB. First of all, they need an appropriate data model. The relational data model of tables is not suitable for scientists because of its rigidity. They don't appear natural to the scientific needs and the SQL language is not a good method to retrieve data for scientists. A good model is an array like data model, which may contain multiple dimensions and each combination of the dimension number is a single array cell that holds the data. Secondly, a set of well-defined operators is required. They defined structural operators that are data agnostic and performs only structural manipulations on arrays, possibly with various dimensions. The other type of operators is the content-dependent operators, which takes a logical predicate on data values to decide the operations they are to perform. Third, for analytical and computational needs, the SciDB should be extendible by user-defined contents. Users could define their own operation methods and data types in POSTGRES style like format. Fourth, The SciDB should be able to support multiple language bindings, as there is no unique language that is supposed to be used behind the scene. Fifth, in order to guarantee that data are trackable and not discarded, the SciDB should prevent overwriting the old data by the users. This is done by adding history which can track the data changes when preserving the original value. Next, to make sure the scalability, a long-time scientific project is maintainable and recompile at will. SciDB should be open-source, to accompany it with a comparable strength as the commercial DBMS's, a non-profit foundation is established to support the user community. Partitioning of data storage should fit the users' demand for large datasets. In order to finish this efficiently, the SciDB has a default fixed partitioning mechanism, while a user-defined dynamic partitioning scheme is available. Also, The SciDB needs to break data into disk blocks. An R tree is used for tracking and background threads are used for optimization. Further optimization is expected to take place as well. The SciDB is supposed to have the ability to support ad-hoc analysis without much time spent on loading the huge amount of data. Next, it should focus on a powerful data manipulation ability internally, the SciDB should integrate the cooking process within the DBMS without external processing when the users need it. SciDB should support different scientific data manipulation on the same set of data bundles. To achieve this the authors proposed a tree-like structure using named versions. Data derivation should be clear from the outset. SciDB should support backward tracking of data changes when necessary. Besides, uncertainty is common in scientific fields and SciDB should capture this. It will no longer store just single data values, but also normal distributions of them. I think this is an interesting paper which is pretty different compared to the previous paper we read, it looks like a blueprint for designing a scientific DMBS and provide several useful advices. Although it is a survey like paper, I think it still makes a good technical contribution, SciDB is the first project that considers several factors when designing a DBMS especially for science, I think they make a good effort to this project and the result is promising. There are several advantages of this paper, first of all, they proposed a well-defined database management system with an array data model and accompanying operations and features that capture the requirements of the scientist communities. I think they make a contribution by providing much good guidance, which can help developers to avoid potential pitfalls when building such a system. Actually, I think numpy make a good job later for an array representation that can be widely used for scientific computation. There are some drawbacks to this paper. First of all, there is no existing implementation of SciDB and they just talk about their plan for building such a DBMS. Also, due to this reason, there is also no any experiments done with SciDB, so whether the decision of SciDB is correct is unknown. There is no solid result which can prove that SciDB is efficient for scientific workloads and its robustness and flexibility. |
This paper is a collection of “requirements” for a good scientific DBMS, proposed by Michael Stonebraker & others after collecting information from a group of users. These requirements were gathered for the purpose of developing a new scientific database system called SciDB. Here is a summary of the proposed rules/requirements: 1. Data models are specific to use case; for example, biology users like graphs & sequence but other users like array models better. SciDB decided to pursue a multidimensional-array-based model but it is impossible to make everyone happy with this. Several operators are defined for this model, including subsample, reshape, dimension adding/removing, concatenate, cross product, filter, aggregate, join and project. 2. Language binding is desired by users and persistence (such as that offered in the object-oriented era models) is a good thing for scientific DB users. Because of this, SciDB uses a parse-tree that lets it use language embedding for this purpose. 3. Don’t allow overwriting of data — instead just append. This makes sense as analysis over old data is important in scientific settings. 4. Make it open source — this is fairly intuitive and makes sense in the scientific community. 5. The “cooking” process is important to build around — this is when input (raw) data is converted into a better standard (calibration, correction, etc). SciDB makes the choice to load raw data into the DMBS and then cook it INSIDE the database. 6. Determinism when deriving data is important, as it allows scientists the ability to re-create any situation (array in SciDB). This is good for fixing errors. 7. Support for fuzzy/uncertain data. This isn’t as important with business databases—uncertain data means that SciDB (or any good scientific database) should support uncertainty factors related to data, which can be related to how the data was measured, etc. The contributions of the paper are mainly these observations, which were taken from real industry users of scientific databases / people who want better scientific databases. The paper also presents a high-level picture of what SciDB would look like The main weakness is that no results were given because SciDB is still in the idea phase at the time of this paper, which seems strange to me—why not just implement the project then write the paper afterwards, rather than write it before any actual development has taken place? |