Review for Paper: 30-The 8 Requirements of Real-Time Stream Processing

Review 1

This paper provides a high-level guidance to the requirements of evaluating stream processing solutions. The 8 requirements mentioned include:
1. Perform message processing without having a costly storage operation (on critical path)
2. Enabling process moving real-time data with a high-level language such as SQL
3. Able to deal with delayed, missing, or out-of-sequence data
4. Generate deterministic and repeatable, thus predictable outcomes
5. Integrate both historical and live data in the system
6. Ensure the application is always up and available, and data is safe
7. Be possible to split an application over multiple machines for scalability
8. Provide high performance, be able to deliver real-time response for high-volume applications.

The paper introduces and compares three software technologies for stream processing: DBMS, Rule engines, and Stream Processing Engines (SPE). Among the three of them, SPE solution is preferred, since it works best for integrating stored and streaming data; the two other systems lack the ability to integrate both data sources in applications.

I like this paper. First, it's high level and very easy to understand. Second, the analysis of the three systems are clear, and serves as a good example of how to evaluate systems with the 8 requirements listed. This paper makes me interested in learning about specific stream processing systems and evaluate those systems with the mentioned requirements.


Review 2

Steam-based applications include market feed processing and electronic trading on Wall Street, network and infrastructure monitoring, fraud detection require real-time processing of high-volume data steam. The data volume is growing exponentially, and the latency requirements are strict. In order to provide guidance to evaluate steam processing solutions, this paper presents eight requirements for a good real-time stream processing system.

Some of the strengths of this paper are:
1. The idea of keep data moving in the stream get rid of the unnecessary latency of storage operations and reduce the latency of the whole system.
2. By integrating stored state and streaming data, the system can achieve the common comparison between different time range.
3. Automatic partition and distribute processing across multiple processors and machines can achieve higher scalability.

Some of the drawbacks of this paper are:
1. The paper is only presenting some high-level idea. How to achieve the rules is not discussed well.
2. No specific commercial products are mentioned due to “vendor neutral”. It’s not a good reason because the evaluation need to be supported by real-world applications.
3. There’s no experiment and comparison in the paper.



Review 3

Stream-based applications typically deals with high-volume data streams, and requires low-latency processing techniques including off-the-shelf stream processing engines and other “repurposed” software technologies such as main memory DBMSs and rule engines. Since traditional system softwares did not target specifically for these stream-based systems and thus often fail, this paper summarized eight requirements that a system should meet to excel at a variety of real-time stream processing applications.

Basically, this paper concluded that there are eight requirements of real-time processing:
Rule 1: Keep the Data Moving: Messages should be processed “in-stream” as they fly by.
Rule 2: Query using SQL on Streams (StreamSQL): It is very much desirable to process moving real-time data using a high-level language such as SQL.
Rule 3: Handle Stream Imperfections (Delayed, Missing and Out-of-Order Data): The ability to time out individual calculations or computations is needed.
Rule 4: Generate Predictable Outcomes: A stream processing system must process time-series messages in a predictable manner to ensure that the results of processing are deterministic and repeatable.
Rule 5: Integrate Stored and Streaming Data: A stream processing system must also provide for careful management of stored state.
Rule 6: Guarantee Data Safety and Availability: To preserve the integrity of mission-critical information and avoid disruptions in real-time processing, a stream processing system must use a high-availability (HA) solution.
Rule 7: Partition and Scale Applications Automatically: it should be possible to split an application over multiple machines for scalability without the developer having to write low-level code, and also support multi-threaded operation to take advantage of modern multi-processor (or multicore) computer architectures.
Rule 8: Process and Respond Instantaneously: A stream processing system should be able to process tens to hundreds of thousands of messages per second with latency in the microsecond to millisecond range on top of COTS hardware.

The key contributions of the paper are:
1. It provided high-level guidance to information technologists to know what to look for when evaluating alternative stream processing solutions.
2. It also briefly reviewed alternative software technologies and summarized how they measure up for real-time stream processing.

The advantages of this paper are as follows:
1. It summarized eight requirements and compared among them in terms of supporting DBMS, Rule engine and SPE, which help readers get a better understanding of the restriction of each requirement.
2. In its explanation, this paper used lots of figures and examples to illustrate the meaning of each requirement, which makes the definition easier to understand.

This paper did a great job in providing guidance for researchers, however, it could be better if it can come up with a possible solution of a system satisfying these eight requirements, and proved that it is practically useful for either research field or industrial field.


Review 4

Problem & motivations:
Applications that require real-time processing of high-volume data streams are pushing the limits of traditional data processing infrastructures. With the cheaper sensor, more and more devices will be senor-aware and produce more data. Therefore, a heavier burden can be expected in the streaming data services.

Main Contribution:
It proposes 8 rules that should be considered when developing a real-time streaming processing application. The first rule is that keep data moving. The basic idea behind that is that you should never have a costly storage operation in the critical processing path. In this way, you can reduce the latency. The second rule is that query using the SQL on streams. The key insight behind this is to avoid the record-at-a-time programming. Therefore, we need to support a high-level SQL language with additional features like merge and window definitions. The third is to handle stream imperfections which imply that we need to take care of the imperfect data such as reordering. The left is generating predictable outcomes, integrate stored and streaming data, guarantee data safety and availability, partition and scale applications automatically and process and respond instantaneously.

Drawbacks:
The examples they provide do not utilize all 8 rules. If it can provide us with an example that utilizes 8 rules and shows us how they trade off between those rules. It will be better.



Review 5

Data collection, management, and analysis has come a long way since its beginnings many decades ago. Recently, high-volume data streams have become more and more common in applications such as electronic trading on Wall Street and command and control tasks for the military. Often, they need to be performed in real time, necessitating a new set of design considerations in order to be able to process large amounts of data very quickly. This paper was written around the time that real-time stream processing was becoming more widely used, and attempts to define a set of 8 design considerations/requirements that anyone attempting to handle real-time stream processing should meet. In doing so, this paper attempts to serve a purpose similar to other papers providing high level guidance on the design of relational DBMSs and on-line analytical processing.

The eight rules that the authors propose are as follows:

Keep the data moving - avoid adding time-intensive storage operations to the critical processing step, or latency will suffer
Query using SQL on Streams (StreamSQL) - Use SQL to query stream data, since everyone know SQL and it is implemented everywhere. Some factors such as the window size of a query (i.e. scope) need to be added, while others like the “Merge” operator are not.
Handle Stream Imperfections (Delayed, Missing and Out-of-Order Data) - have some mechanisms to provide fault tolerance, especially since data is likely to contain “mistakes”
Generate Predictable Outcomes - outcomes should be deterministic so replaying and reprocessing old data can still yield the same result
Integrate Stored and Streaming Data - being able to seamlessly switch between “historical” data and real-time data is valuable for applications such as electronic traders.
Guarantee Data Safety and Availability - High availability is very desirable, and one way to do so is through a “Tandem-style” hot backup and real-time failover scheme
Partition and Scale Applications Automatically - handle the low level details of distributed and parallel computing automatically
Process and Respond Instantaneously - the application needs to be able to keep up with the pace of data, so very low latency is necessary, which requires highly optimized systems and low overhead.

After proposing these eight rules, the authors also bring up three general types of technologies that can potentially support real-time stream processing: DBMSs, Rule engines, and Stream Processing Engines (SPEs). By comparing their relative features and abilities to support some or all of the 8 requirements, they conclude that only the SPE is able to meet all eight, making it potentially worthwhile in some applications to develop a dedicated solution (i.e. the SPE) rather than adapt existing technologies.

The main strength of this paper is that it provides a comprehensive high level overview of the many challenges facing applications that need to support real-time stream processing, especially at a time when there were no similar papers out yet. The writing was quite straightforward and easy to follow, and each rule was followed by a short summary, which also aided in understanding. The 8 rules all generally made sense, and the comparison at the end between DBMSs, rule engines, and SPEs helped illustrate the relative limitations of each.

One potential weakness, however, is the authors’ reluctance to mention any specific systems being used by the industry. While they cite their desire to be vendor neutral, it also deprives readers of the ability to see to what extent the real-life systems follow the 8 requirements spelled out in this paper. For instance, if multiple SPEs do not follow a particular rule very closely, perhaps that rule is not as necessary as the others, and on the other hand, they could also suggest additional rules that were not mentioned in this paper.


Review 6

The purpose of this paper is to serve as a requirements paper for systems running stream processing applications. The paper begins by naming a few scenarios where stream processing is used, like in global exchanges where market data feeds are processed and in computer networks where real-time fraud detection can be used to avoid denial of service attacks as well as other security attacks. The paper mentions a few other wireless sensor applications that are high volume and low-latency workloads. This paper accomplishes it’s goal by presenting eight requirements of a real time stream processing system.

The first rule is to keep the data moving, or to process the data as they fly by, and not perform storage or polling operations that might increase latency. The second rule is to use StreamSQL, which is a variant of SQL that allows you to operate on a window of continuously streaming data. The third rule is to handle stream imperfections like messages coming in late, out of order, or missing, by timing out certain computations so we can move forward with partial data. The fourth rule is to be able to generate repeatable or predictable results from computations on your streamed data. The fifth rule is that there should be a seamless switch between stored and streamed data in the event that you need to compare some historical trends with current incoming data, for example. The sixth rule is safety and availability of data through a real-time failover scheme where a secondary stream of data is synced with the primary one at certain checkpoints so that the system can switch to the secondary if the primary ever failed. The seventh rule is to automatically scale and partition your application, while also handling load balancing so none of the available machines are bogged down as you scale up. The eighth and final rule is that each component of the system is highly optimized and has minimal latency such that you have real-time response for your application.

The paper then explains the architectures which can run these applications. The first is a database management system, which stores large datasets efficiently. The next is a rule engine, which processes inputs based on an if-then condition being met. The final architecture mentioned are Stream Processing Engines which are specially designed for streaming applications.

I liked how concise this paper was while still presenting a lot of information in a clear way. However, as I think was mentioned with Stonebraker papers, I did not find the graphics helpful. This is not a major concern since the text did a good job of explaining the architectures to me.



Review 7

Requirements of Real-Time Stream Processing

This paper gives an overall introduction of the requirements of real time stream processing. The background of stream processing is the increasing requirements of data processing in big data scenario. It need to process more data in a timely manner. Though batch processing have the ability to processing huge amount of data, it has a big time latency which is unacceptable. Therefore, as a streaming processing system, it must fulfill the following requirements that lists in this paper.

There are altogether eight requirements for the streaming process system. 1, if the system want to achieve low latency, it must keeps the data moving and process messages “in-stream”, without any requirement to store them to perform any operation or sequence of operations. 2: the database storage is not static any more, the data that you store are relevant to the time. The data in the storage is a stream as well and it needs floating, coming in and out. 3: data stream is not stable at all, there may be stream imperfections such as delayed messages, missing and out of order data. 4: A stream processing system must process time-series messages in a predictable manner to ensure that the results of processing are deterministic and repeatable. 5: the streaming processing system should have the capability to efficiently store, access, and modify state information, and combine it with live streaming data. 6: guarantee data safety and availability which means it needs to tolerant the faulty, which ensure that the applications are up and available, and the integrity of the data maintained at all times, despite failures 7: the system needs to have extensibility and scalability so that it should automatically partition and scale if needs. 8: process and respond instantaneously which means a stream processing system must have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume applications. This is the basic requirement of the stream processing system.

The contribution of this paper is providing the systematic requirements for stream processing system. And it also introduces the architecture and software stack of the system and gives some results on streaming processing properties with DBMS, Rule engine and SPE. Though none of them perfectly satisfy the requirement of the stream processing listed in this paper.

The advantages of this paper is that is really easy to understand, and the requirements is clearly illustrated. And carefully summarized the differences between streaming processing system and current database management system.

For the drawback, I would hope this paper could introduce more concrete principle and tech stacks. But overall this is a very good paper.


Review 8

In this paper, the author outline eight requirements that a system should meet to provide a good solution for real time stream processing. The goal of this paper is to provide a high level guidance about how to evaluate a stream processing system.

Stream Processing Engines are mean to support high-volume, low-latency stream processing applications. There are eight rules described in this paper.
First one is that the system should be able to process the message in place, without saving it some where or store it. In conclusion, the system should use an active processing model.
Second rule is that the system should support high level streamSQL language with some stream oriented operation. Such as merge operator that multiplex message in the stream
Third requirement is that the system should have mechanism that can handle the stream imperfection, such as delay or missing data. That is a basic requirement to handle the normal outlier situation in stream processing.
Fourth requirement is that the stream processing engine must guarantee a predictable outcome. This is important from the perspective of the recovery and tolerance.
Fifth requirement is that the system should be able to integrate the stored state information and combined with the current stream data. The ability to switch between the current and history data.
Sixth requirement is to ensure that the data is save and available, which is quite similar to the requirement that important in distributed system. This requirement require a backup or a shadow master.

Seventh requirement is to be able to partition the data in distributed manner and service should be easy to scale. The ideal way of distribution is that it should be automatic and transparent.
Eighth requirement would be the system should be highly optimized. The over head of the system should be low.


The main contribution of this paper is that is discussed eight rules that characterize the requirements for real-time stream processing. The stream processing system has several characters such as, low latency, business agility, higher developer productivity, and flexibility. As more people aware of the significant advantages of using stream processing engines compared to custom coding, this Technology will become more widespread. These eight rules are used to describe the necessary features required for any product in this category, which will be used for high volume, low latency applications.


Review 9

This paper by Stonebraker, Çetintemel, and Zdonik describes the theoretical elements that should make up a successful stream processing system. Traditionally, these systems have been created by developers as needed, but they are growing in importance. Paraphrased, the rules are
1. Keep data moving through the system with no storage needed at the time of processing (for latency)
2. SQL querying
3. Handle problems like out of order and missing data
4. Outcomes should be predictable and repeatable
5. Integration of traditional stored DBMS data and streaming data
6. Guarantee data integrity and availability
7. Automatic partitioning and scaling
8. Fast!
The paper mostly focuses on these goals. Near the end, there is also a short discussion on existing architectures and their limitations.

This paper seems to be a good introduction for someone who is thinking about building a stream processing system, so that they can essentially check off all of the rules. I wonder exactly where this paper falls in terms of the development of stream processing systems in industry, which seem to have become pretty crucial in the past few years. If it came before all of those systems, it seems like it could have been quite influential. The whole paper is quite clear and well-written, but I especially liked the table on page 6 that laid out the capabilities of DBMSs, rule engines, and SPEs that are related to stream processing.

I did not think that the description of how to build a system with high availability was very clear. Perhaps this is because I am unfamiliar with the “Tandem-style” model that is briefly mentioned. However, I was not convinced that this model was sound in terms of not losing any data. Additionally, it seemed as though this may conflict with the goal of not interacting with storage (because of latency concerns) that is discussed in rule #1.



Review 10

This paper can be viewed as a guideline for designing a real-time stream processing system. It also provides the knowledge needed for evaluating alternative stream processing solutions. Stream-based applications include traditional market feed processing, electronic trading and more recent monitoring and control applications. All these applications require very high volume processing of feed data with very low latency.

In the first half of the paper, the author listed 8 requirements for a high-quality system:
1. The system should not require the data to be stored before performing an operation. N active processing model will be even better.
2. The system should support SQL like language with built-in extensible primitives and operators.
3. The system should be able to handle missing and out-of-order data.
4. The result produced by the processing system should be predictable and repeatable.
5. Users of the system should be able to store, access and modify system state information and seamlessly switch between historical data and new data.
6. The system should provide high availability and ensure data integrity.
7. The system should easily scale out and automatic balance load among all nodes.
8. A highly-optimized, minimal-overhead execution engine needs to ensure timely response for high-volume applications.

There are three basic architectures for a stream processing system: DBMSs, rule engines and stream processing engines. In the second half of the paper, the author compares these architectures based on the requirements they proposed. Since DBMSs and rule engines are originally designed for a different class of applications with different assumptions and requirements, they don’t fit very well into these new requirements. As an example, DBMS can’t keep the data moving as it has to store the data before any operations can be applied to them. Rule engine needs to be extended so that they can express conditions of interest over time and potentially unbounded data. SPE is specifically designed to deal with stream data so it meets almost all requirements.

I think it will be better if this paper can also analysis some real stream processing (Kafka, for example) and see what requirements are met and what are not. Currently, the discussion of different architectures seems a little bit abstract.



Review 11

In the paper "The 8 Requirements of Real-Time Stream Processing", Michael Stonebraker and Co. outline the challenges of processing high-volume, real-time data with low latency requirements. Since applications in the real world are increasingly becoming "sensor-tagged", the need for systems that constantly track the state of objects and react accordingly is increasing. One novel example comes from global exchanges such as the stock market. With the sheer amount of data being collected every second and user queries attempting to aggregate newly observed data to help make decisions about investments, latency of even one second is unacceptable. Similarly, sensor based technology operate in the same way: if it detects something wrong, it needs to notify the proper individuals asap. Traditionally, this has been solved through custom code, but we all know how that can turn out - being inflexible, expensive, and non-adaptive to future requests. However, recent systems have been re-purposed and re-marketed to support stream processing. Thus, this paper attempts to give guidance on how to evaluate alternative stream processing solutions using eight characteristics.

The following are the 8 rules and their takeaways:
1) Keep the Data Moving: Message processing should occur independent of costly storage operations. These messages should be processed "in-stream" to decrease latency. Ideally, the system should also use an active processing model.
2) Query using SQL on Streams: Using low-level programming languages has a high cost and long development cycles. Thus, we should use SQL as a declarative interface since it is both in wide use and is understood independent from run-time. Furthermore, such a language should know its scope (i.e. when to stop) and support new stream-oriented operators such as merge.
3) Handle Stream Imperfections: We should time out operations in the case of data delays/problems. The system can then react accordingly by working with partial data.
4) Generate Predictable Outcomes: From the perspective of fault tolerance and recovery, having predictable outcomes regardless of the time of execution is important. Thus, even if things arrive out of order, we should get the same answer if they were in order in relatively the same time.
5) Integrate Stored and Streaming Data: Sometimes to detect anomalies in sensor data, we need to use pre-existing stored data. Thus, we need to efficiently store, access, and modify state information and combine it with live stream data. Ideally, we should use a uniform language when dealing with either type of data.
6) Guarantee Data Safety and Availability: Applications that deal with stream data prefer high availability over other aspects. Furthermore, we need to make sure the integrity of the data is maintained at all times despite failures.
7) Partition and Scale Applications Automatically: Since distributed systems are becoming much more popular, stream processing must have support for it. It should scale well and load balance with many machines so a single machine does not get overloaded.
8) Process and Respond Instantaneously: We need a minimal-overhead execution engine so that users have an interactive environment with their queries.
Architecture concerns: DBMS are good for storing data but not so good in this particular environment. Rule engines use conditional branches to determine when to act upon stream data but consequently have limitations. Stream processing engines (SPE) deal with data without the need to store them. Overall, according to Stonebraker, SPEs are much more preferable than other solutions.

Much like other papers, this one also has some drawbacks. The most interesting thing about the paper was the fact that it was written by Stonebraker - it is very easy to follow and comprehend, but is totally constrained to his point of view. As a result, the reader is left without understanding other non-traditional methods that adapt to a company's workload (E.g. Dynamo for Amazon when Stonebraker seems to despise eventually consistent systems). Another drawback was the lack of actual research - this was a retrospective paper detailing the dos and don'ts. I would have liked to see some performance analysis on methods that follow these guidelines vs. ones that do not.


Review 12

This paper describes the requirements for a system to process data coming as a stream. Usually, data is processed in fixed chunks at fixed times. However, some applications require that data be processed in real time as it comes in through a streaming connection. The paper describes eight requirements for such a system to run well:

1: The data must constantly move through the system, and there should be no long operations like stores on the data’s critical path.

2: SQL syntax should be usable for queries on streams. Since SQL is usually queried on static data, the system needs to find a way to simulate this, such as by defining a time window over which the query will run.

3: Streamed data can come out-of-order, or some data can be missing. The system should be able to account for irregular data patterns.

4: The processing of the data should be independent of any out-of-order reception. The system should process the data in a deterministic order, regardless of how it comes in.

5: The system should be able to store data as it comes in, and it should be able to run queries on the stored data. The user should be able to run queries on the stored data in exactly the same manner as the streamed data.

6: The system must be always available; incoming requests should always be served, so any servers should have backups that are easily accessed in case of failure.

7: The system should automatically scale up as new partitions are added. The user should not need to manually change the system to take use of new partitions.

8: Since the system should be close to real-time, it must have very low latency. Each part of the incoming stream must be processed in seconds, at most.

Streaming can be implemented using a traditional DBMS, a rule engine, or a specialized Stream Processing Engine. Generally, SPEs follow the eight rules better than either of the standard data models, since they are specifically built for this purpose. Especially, SPEs are the only model that easily integrates streaming and stored data together.

The benefit of this paper is that it provides a standard set of behaviors for any potential streaming application. This would allow any developer to determine the best way to construct a data system to handle processing streaming data. It especially focuses on making a low-latency, durable system that a user can easily interact with without specialized knowledge.

The downside of this paper is that it doesn’t discuss much how restrictive these standards are. Many systems are built to optimize certain behaviors at the cost of others, and it may not be necessary, depending on the situation, to fully implement all eight behaviors. As such, it’s important to know which behaviors are most important for which results.



Review 13

The paper presents 8 requirements on real-time stream processing applications as guidance for people to evaluate stream processing solutions. The first requirement for a real-time stream processing system is to process messages “in-stream”, without any requirement to store them to perform any operation or sequence of operations. Ideally the system should also use an active (i.e., non-polling) processing model. The second requirement is to support a high-level “StreamSQL” language with built-in extensible streamoriented primitives and operators. The third requirement is to have built-in mechanisms to provide resiliency against stream “imperfections”, including missing and out-of-order data, which are commonly present in real-world data streams. The fourth requirement is that a stream processing engine must guarantee predictable and repeatable outcomes. The fifth requirement is to have the capability to efficiently store, access, and modify state information, and combine it with live streaming data. For seamless integration, the system should use a uniform language when dealing with either type of data. The sixth requirement is to ensure that the applications are up and available, and the integrity of the data maintained at all times, despite failures. The seventh requirement is to have the capability to distribute processing across multiple processors and machines to achieve incremental scalability. Ideally, the distribution should be automatic and transparent. The eighth requirement is that a stream processing system must have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume applications.




Review 14

Nowadays, many applications require real-time processing of high-volume data streams. For example electronic trading on Wall Street, network and infrastructure monitoring, fraud detection, and command and control in military environments. There are many requirements of real time data processing infrastructure that traditional data processing infrastructures cannot satisfy. Therefore, this paper outlines eight requirements that a system should meet to excel at a variety of real-time stream processing applications.
The first rule is to keep the data moving, which means the system should have low latency. Usually the storage operation adds unnecessary latency to the process. It is not acceptable to require a time-intensive operation before message processing can occur. Another aspect that causes latency is passive systems, which wait to be told what to do before initiating processing. This is a problem because it causes additional overhead. Instead, active systems avoid this overhead.
The second rule is query using SQL on Streams. Traditionally, C++ and Java are used as development and programming tools. However, relying on low-level programming schemes results in long development cycles and high maintenance costs. In contrast, it is very much desirable to process moving real-time
data using a high-level language like SQL. SQL is good because it is explicit about how primitives interact so that it can be easily understood independently from running conditions. StreamSQL is a variant of the SQL language specifically designed to express processing on continuous streams of data. Since streaming data never ends, windows are used to instruct when to finish such an operation and output an answer. Windows should be definable over time, number of messages, or breakpoints in other attributes in a message.
Third rule is handle stream imperfections include delayed, missing, and out of order data. One requirement is the ability to time out individual calculations or computations. To deal with out-of-order data, a mechanism must be provided to allow windows to stay open for an additional period of time.
Fourth rule is generate predictable outcomes, which means that a stream processing system must process time-series messages in a predictable manner to ensure that the results of processing are deterministic and repeatable.
Rule number 5 is integrate stored and streaming data, because for many applications comparing present data with past data is a common task. This requirement is to have the capability to efficiently store, access, and modify state information, and combine it with live streaming data. For seamless integration, the system should use a uniform language when dealing with either type of data.
Sixth rule is guarantee data safety and availability. High availability is a critical concern for most stream processing applications. Seventh rule is partition and scale applications automatically. It should be possible to split an application over multiple machines for scalability. Stream processing systems should also support multi-threaded operation to take advantage of modern multi-processor computers. The last rule is process and respond instantaneously. This requirement is that a stream processing system
must have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume applications.
The advantage of this paper is that it clarifies the problem by categorizing the requirements into 8 categories, which make it very clear to read and understand. The disadvantage of this paper is that it intends to be vendor neutral, therefore it does not mention commercial products that are related to mentioned technologies.



Review 15

“The 8 Requirements of Real-Time Stream Processing” by Stonebraker et al. discusses 8 important characteristics that are important in the design or selection of a system for real-time stream processing applications. Real-time stream processing has become more popular due to applications like electronic trading, sensor monitoring, fraud detection, and more. Here is a summary of the 8 requirements:

Rule 1: Keep the Data Moving; Process the data in-stream, often before writing to storage, so that writing to storage isn’t a barrier. Systems that actively process data rather than passively process data (e.g., wait for applications to poll for data) reduce latency.

Rule 2: Query using SQL on Streams (StreamSQL); should use SQL as it’s well-known and commonly used, and performant. StreamSQL should be used and expanded on to support stream-relevant queries and processing, in particular indicating a window (of time, messages, or other) that an operator should perform on.

Rule 3: Handle Stream Imperfections (Delayed, Missing and Out-of-Order Data); careful timing and timeout of operations are needed to account for delayed, missing, and out of order data, and not increase latency of operations.

Rule 4: Generate Predictable Outcomes; delayed or out-of-order data can impact consistency of operations if those operations were not specifically designed to be deterministic for this kind of data. Operations should be made deterministic, to ensure deterministic results if recomputation is needed upon a system recovery.

Rule 5: Integrate Stored and Streaming Data; some applications will want to do analytics on both streaming data and stored historic data. To avoid high-latency, historic data needs to be stored in same place as the app processing streaming data, and accessed using a common language.

Rule 6: Guarantee Data Safety and Availability

Rule 7: Partition and Scale Applications Automatically; taking advantage of multi-processing and multi-threading is important for reducing latency and blocking while processing streaming data. Automatic partitioning and scaling is important for reducing latency and blocking when the workload changes or certain machines become backed-up.

Rule 8: Process and Respond Instantaneously

The authors then discuss 3 kinds of software architectures (DBMSs, rule engines, and stream processing engines (SPEs)) that could be considered for stream processing applications, and the strengths and weaknesses of each architecture for the 8 ideal characteristics described earlier. Stream processing engines best satisfy the 8 requirements because they are designed specifically for stream-processing applications; in particular, of the 3 architecture types, SPEs are the only ones that support SQL on streams and stored and streamed data.

I like the organization of the paper, in that it first introduced 8 desirable characteristics, then 3 software architectures, then discussed which characteristics are supported by which architecture types. Table 1 is a nice summary of this information.

Something I’m still curious about is whether certain stream processing applications have higher tolerance for particular challenges (e.g., stream imperfections) or whether certain applications inherently do not have particular challenges/requirements. For example, are there slightly different challenges or requirements for bank fraud detection versus a hobbyist Internet of Things application? Another question I still have is the lower-level mechanics of how and when data is actually stored in SPEs.



Review 16

This paper discusses 8 requirements for real-time stream processing. It first introduces the situation that stream data era has come. With more and more sensors integrated into the information system, and the growing amount of exchange, trade, and transactions every second, it is of great demand to process the streaming data with low latency. However, traditional techniques such as DBMSs and rule engines cannot handle the problem directly. Thus, the paper proposed 8 requirements for stream processing and evaluate 3 basic systems with the 8 requirements.

The main contribution of the paper is putting the 8 rules together and give detailed explanations of these rules. Of the eight rules, there are 3 rules that I think are the most important. I think the first one "keep the data moving" is the intrinsic property of stream data since the data is coming. The third rule "handle stream imperfections" is also the reality of stream data, since the unprocessed stream data is naturally easy to be "delayed, missing and out-of-order". I think this is one of the most important reasons why DBMS is not suitable for stream processing. Since data is first stored in DBMS before it is processed. And the fifth rule "integrate stored and streaming data", I agree with the paper that in applications it is usual to deal with stream data with "past" data, so it is important to integrate stored data with streaming data. Also, the paper introduces 3 software technologies and measures the technologies with the 8 rules.

The main drawback of the paper in my mind is that it doesn't provide any streaming process products and analyze them. Although the paper said that to be vendor neutral they didn't mention commercial products, they can mention open source software. I think with software examples, readers can understand the rules much easier and know more on real-time stream processing.


Review 17

In this paper, the authors outline eight requirements that a system should meet to excel at a variety of real-time stream processing applications. The problem they are dealing with is to provide a guidance for real-time processing aiming at high-volume data streams with low-latency processing requirements. This problem is definitely an important issue for nowadays’ workload, as they said in their paper, the stream-based applications are almost everywhere, such like market feed processing and electronic trading, network and infrastructure monitoring, fraud detection, also with the increasing number of sensors. Design new technologies to hand these new challenges but opportunities are very interesting and meaningful. The goal of this paper is to provide high-level guidance to developers so that make they understand the evaluation of alternative stream processing solutions. Next, I will summarize the 8 requirements with my own understanding.

1. Keep the data moving: To achieve low latency, a system must be able to perform message processing without having a costly storage operation in critical processing path. A good way of doing this is to reduce the storage and use the straight-through processing paradigm.
2. Query using SQL on Streams: For streaming applications, low-level programming schemes are not welcomed due to long development cycles and high maintenance costs. It is very desirable to process moving real-time data using a high-level declarative language like SQL. In order to address the unique requirements of stream processing, and StreamSQL is proposed to do this job.
3. Handle Stream Imperfections: This requirement means that we should have built-in mechanisms to provide resiliency against stream imperfections including missing and out-of-order data, which are commonly present in real-world data streams.
4. Generate Predictable Outcomes: Stream processing must deal with time-series messages in a predictable manner to ensure deterministic and repeatable.
5. Integrate Stored and Streaming Data: We need to have the capability to efficiently store, access and modify state information and combine it with live stream data. For seamless integration, uniform language is required to deal with heterogeneous data.
6. Guarantee Data Safety and Availability: We need to ensure that the applications are up and available and the integrity of the data maintained at all times.
7. Partition and Scale Applications Automatically: We need to have the capability to distribute processing across multiple processors and machines to achieve incremental scalability, and this distribution should be automatic and transparent.
8. Process and Respond Instantaneously: A stream processing system must have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume applications.
Besides, they also talked about three different software system technologies can be applied to solve high-volume and low-latency streaming problems, including DBMS, Rule engines, and SPEs. They provide a trade-off of all these techniques based on 8 guidance.

Although this is a review like paper, I think it still makes a great contribution to the state-of-art streaming processing techniques. It provides great guidance for researchers and engineers who are working on the stream processing field. First of all, I like the tabular results in section 3.3 a lot, this table gives a very nice description about the tradeoff among guidance by using DBMS, Rule engine or SPE, it is pretty straightforward thus very easy to understand.

Generally speaking, this is a nice paper with great insight and the downsides of this paper are minor. I think maybe they can give a more concrete example and real products for handling the large stream. Besides, for the table result, I hope they can provide more explanation about why some techniques satisfy something why others not.



Review 18

This paper is a survey-type paper that showcases a type of workload (real-time streaming) that was on the rise at this point in time, and it outlines several guidelines (“requirements”) that a system should follow in order to be a good system for streaming. These requirements are as follows:

1) No costly storage procedures should be used during message processing. This helps keep latency low.

2) SQL should be used as the query language. This is because low-level languages make maintenance/development too complicated. That being said, the SQL used should be extended to support streaming primitives.

3) The infrastructure should be able to handle streaming imperfections like missing/out-of-order data by using timeouts, etc. This is a fairly obvious point which doesn’t need much justification.

4) This is related to #3, but a system should be able to produce deterministic results. This is especially important for recovering from faults.

5) The system should provide a **seamless* way to access state (a.k.a. stored information). The paper justifies this by saying that many applications need access to state-related information (ex. comparing “current” streaming results to past results to see if there are irregularities). I personally think that it is overkill to call this a “requirement” as there are probably situations where storing state is not necessary, but I do think that when it is, making it seamless and not a cause of high latency is vital.

6) The system should be able to provide good safety and availability guarantees. This is also very obvious.

7) The system should scale automatically, including incremental scaling. This is another fairly obvious observation—it should be noted that observations 6-8 are simply separated ways of saying “the system should perform very well and be safe”.

8) The system should provide an execution engine that offers extremely low latency for general use, even when data volume is high.

The contributions of this paper are fairly self-evident; a type of workload is presented and the case is made for it being important to focus on and provide efficient systems for. The guidelines presented also seem reasonable and well-justified.

With respect to weaknesses, I wasn’t overly fond of the introduction for this paper. I thought the use of colloquialisms (“expression in quotes”) was extremely high and made the introduction overly difficult to understand without reading it twice. Some of the assumptions about the future prevalence of streaming applications also were too strong and have since been proven incorrect, which wouldn’t have ben as big of an issue if it weren’t stated so aggressively (example: "Tagging **will** be applied to customers at amusement parks for ride management and prevention of lost children.”).