Apache Flink is an open-source system for processing streaming and batch data. The motivation for building static data analysis functions into a stream processing system is, data records are often batched into static data sets and then processed in a time-agnostic fashion. The main contributions of the Flink system are: 1. design a unified architecture of stream and batch data processing 2. the system is fault-tolerant, with check-pointing and light-weight recovery One design discussion I find interesting is the balancing of latency and throughput. When discussing data exchange through intermediate data streams, the paper discusses tuning the buffer timeout: increasing timeout timeout results in an increase in latency with an increase in throughput, until full throughput is reached. The paper also includes experimental results to support this intuitive argument. I think this point is one of the key insights I learned from the paper. Another design that I like is, the Flink system's dataflows don't provide ordering guarantees, and choose to leave maintaining data order to the operator implementations. When I first read this idea, I thought it's generally bad since the infrastructure itself is not "good" enough, but considering the efficiency provided this is actually a good design choice. One thing I don't like much about this paper is, it does not provide experimental results on Flink, for example, the runtime analysis of a task on Flink, and maybe comparisons with other systems. As discussed in the paper, Flink is widely used, so it should be not that hard to include some examples in the paper, which will make the paper more concrete. |
Most of today’s large-scale data processing use cases handle data that produced continuously over time including web logs, application logs and sensors. Data records are usually batched into static data sets and then processed in a time-agnostic fashion. Current data collection tools suffer from high latency imposed by batches and high complexity while connecting and orchestrating several systems, and implementing business logic twice, as well as arbitrary inaccuracy, as the time dimension is not explicitly handled by the application code. Therefore, the paper presents Apache FIink, an open-source platform capable of doing distributed stream and batch data processing. Some of the contributions and strengths of this paper are: 1. The runtime environment of Apache Flink provides high throughput and very low latency. This can be achieved by doing minimum configuration changes. 2. Stream processing systems always maintain the state of its computation. Flink has a very efficient check pointing mechanism to enforce the state during computation. 3. Flink has an efficient fault tolerance mechanism based on distributed snapshots. This mechanism is very lightweight with strong consistency and high throughput. 4. Apache Flink has its own memory management system inside JVM. So the application scalability is handled easily beyond main memory with less overhead. Some of the drawbacks of this paper are: 1. There is no experiment and evaluation in the paper to present how Flink outperformed other systems/platforms. |
Stream processing and and batch processing were traditionally considered as two very different types of applications and were programmed separately using different programming models and APIs, and were executed by different systems. However, today’s large-scale data processing use cases handle data produced continuously, and the existing approaches tend to ignore the continuous and timely nature of data production, and thus cause severely high latency, high complexity and arbitrary inaccuracy. Therefore, this paper proposed Apache Flink combining real-time analysis, continuous streams and batch processing both in the programming model and in the execution engine. In order to support batch use cases with competitive ease and performance, Flink has a specialized API for processing static data sets, uses specialized data structures and algorithms for the batch versions of operators like join or grouping, and uses dedicated scheduling strategies. The core of Apache Flink is a streaming dataflow engine, which supports communication, distribution and fault tolerance for distributed stream data processing. Flink consists of the following components for creating real-life applications as well as supporting machine learning and graph. The contributions of this paper are as follows: 1. Flink is the first unified architecture of stream and batch data processing, including specific optimizations that are only relevant for static data sets, 2. Flink showed how streaming, batch, iterative, and interactive analytics can be represented as fault-tolerant streaming dataflows. 3. This paper also discussed how to build a full-fledged stream analytics system with a flexible windowing mechanism, as well as a full-fledged batch processor on top of these dataflows, by showing how streaming, batch, iterative, and interactive analytics can be represented as streaming dataflows. The main advantages of Flink are as follows: 1. The runtime environment of Apache Flink provides high throughput and very low latency. This can be achieved by doing minimum configuration changes. 2. Flink has a very efficient check pointing mechanism to enforce the state during computation. 3. Flink has a natural flow control system built in. It helps in efficient flow control with long-running operators. 4. Flink has an efficient fault tolerance mechanism based on distributed snapshots. This mechanism is very lightweight with strong consistency and high throughput. 5. Apache Flink provides a single runtime environment for both stream and batch processing. So the same implementation of the runtime system can cover all types of applications. Although Flink is faster than Spark due to its underlying architecture, Spark has very strong community support and has a good number of contributors. And Spark has already been deployed in the production, which for now maybe a “drawback” of Flink, but as far as streaming capability is concerned, Flink is far better than Spark (as spark handles stream in form of micro-batches) and has native support for streaming. Stream processing and and batch processing were traditionally considered as two very different types of applications and were programmed separately using different programming models and APIs, and were executed by different systems. However, today’s large-scale data processing use cases handle data produced continuously, and the existing approaches tend to ignore the continuous and timely nature of data production, and thus cause severely high latency, high complexity and arbitrary inaccuracy. Therefore, this paper proposed Apache Flink combining real-time analysis, continuous streams and batch processing both in the programming model and in the execution engine. In order to support batch use cases with competitive ease and performance, Flink has a specialized API for processing static data sets, uses specialized data structures and algorithms for the batch versions of operators like join or grouping, and uses dedicated scheduling strategies. The core of Apache Flink is a streaming dataflow engine, which supports communication, distribution and fault tolerance for distributed stream data processing. Flink consists of the following components for creating real-life applications as well as supporting machine learning and graph. The contributions of this paper are as follows: 1. Flink is the first unified architecture of stream and batch data processing, including specific optimizations that are only relevant for static data sets, 2. Flink showed how streaming, batch, iterative, and interactive analytics can be represented as fault-tolerant streaming dataflows. 3. This paper also discussed how to build a full-fledged stream analytics system with a flexible windowing mechanism, as well as a full-fledged batch processor on top of these dataflows, by showing how streaming, batch, iterative, and interactive analytics can be represented as streaming dataflows. The main advantages of Flink are as follows: 1. The runtime environment of Apache Flink provides high throughput and very low latency. This can be achieved by doing minimum configuration changes. 2. Flink has a very efficient check pointing mechanism to enforce the state during computation. 3. Flink has a natural flow control system built in. It helps in efficient flow control with long-running operators. 4. Flink has an efficient fault tolerance mechanism based on distributed snapshots. This mechanism is very lightweight with strong consistency and high throughput. 5. Apache Flink provides a single runtime environment for both stream and batch processing. So the same implementation of the runtime system can cover all types of applications. Although Flink is faster than Spark due to its underlying architecture, Spark has very strong community support and has a good number of contributors. And Spark has already been deployed in the production, which for now maybe a “drawback” of Flink, but as far as streaming capability is concerned, Flink is far better than Spark (as spark handles stream in form of micro-batches) and has native support for streaming. Stream processing and and batch processing were traditionally considered as two very different types of applications and were programmed separately using different programming models and APIs, and were executed by different systems. However, today’s large-scale data processing use cases handle data produced continuously, and the existing approaches tend to ignore the continuous and timely nature of data production, and thus cause severely high latency, high complexity and arbitrary inaccuracy. Therefore, this paper proposed Apache Flink combining real-time analysis, continuous streams and batch processing both in the programming model and in the execution engine. In order to support batch use cases with competitive ease and performance, Flink has a specialized API for processing static data sets, uses specialized data structures and algorithms for the batch versions of operators like join or grouping, and uses dedicated scheduling strategies. The core of Apache Flink is a streaming dataflow engine, which supports communication, distribution and fault tolerance for distributed stream data processing. Flink consists of the following components for creating real-life applications as well as supporting machine learning and graph. The contributions of this paper are as follows: 1. Flink is the first unified architecture of stream and batch data processing, including specific optimizations that are only relevant for static data sets, 2. Flink showed how streaming, batch, iterative, and interactive analytics can be represented as fault-tolerant streaming dataflows. 3. This paper also discussed how to build a full-fledged stream analytics system with a flexible windowing mechanism, as well as a full-fledged batch processor on top of these dataflows, by showing how streaming, batch, iterative, and interactive analytics can be represented as streaming dataflows. The main advantages of Flink are as follows: 1. The runtime environment of Apache Flink provides high throughput and very low latency. This can be achieved by doing minimum configuration changes. 2. Flink has a very efficient check pointing mechanism to enforce the state during computation. 3. Flink has a natural flow control system built in. It helps in efficient flow control with long-running operators. 4. Flink has an efficient fault tolerance mechanism based on distributed snapshots. This mechanism is very lightweight with strong consistency and high throughput. 5. Apache Flink provides a single runtime environment for both stream and batch processing. So the same implementation of the runtime system can cover all types of applications. Although Flink is faster than Spark due to its underlying architecture, Spark has very strong community support and has a good number of contributors. And Spark has already been deployed in the production, which for now maybe a “drawback” of Flink, but as far as streaming capability is concerned Flink is far better than Spark (as spark handles stream in form of micro-batches) and has native support for streaming. |
Data processing often comes in two different forms, stream processing and static batch processing, and for a long time, they were handled by different types of applications. For example stream processing engines like Storm handled data streams, while relational databases dealt with the batch cases. With the changing landscape of data nowadays, however, the lines are becoming blurred, with data being produced in ever larger amounts in a stream, which are batched together and processed. Architectures have been developed to combine batch and streaming systems together, but they suffer from high latency and complexity, as well as arbitrary inaccuracy. Recognizing this issue, the authors have developed a unified system called Apache Flink that is able to handle a wide array of use cases for both stream and batch processing workloads, as well as other features such as early/approximate and delayed/accurate data computation. In other words, it supports stream processing while still including a batch processor that can use libraries for graph analysis and machine learning. Flink is composed of a variety of components, the core of which is the distributed dataflow engine, which is responsible for executing dataflow programs. These programs are a directed acyclic graph of stateful operators connected with data streams. Flink uses two core APIs, the DataSet API for bounded data sets (batch processing), and the DataStream API for stream processing. Both the DataSet and DataStreamAPIs can create runtime programs executable by the core runtime engine. On top of this, Flink has domain-specific libraries and APIs that generate DataSet and DataStream API programs, for areas like machine learning, graph processing, and SQL-like operations. All Flink programs compile down to a dataflow graph, which are executed in a data-parallel fashion. Intermediate data streams are the core abstraction for exchanging data between operators. Flink also uses pipelining and blocking streams to manage data flow, as well as balancing latency and throughput. Additionally, Flink is able to offer reliable execution with “exactly once” processing, and is able to use checkpointing and partial reexecution to deal with failures. In other words, it ensures that data sources are persistent and replayable. Besides these features, Flink’s DataStream API includes a full stream-analytics framework on top of the runtime, which manage situations such as out-of-order event processing, defining windows, and maintaining and updating user-defined state. As for batch processing, some simplifications can be added, since a bounded data set is a special case of an unbounded data stream. Query optimizations are included with Flink and used with batch processing, and other considerations like memory management and batch iterations are taken into account. The main strength of this paper is that it showcases a working system which is able to handle both stream and batch processing paradigms using the same runtime. By treating batch processing as a special case of a data stream, this simplifies the design and makes it easier to handle both use cases. Additionally, the addition of other features such as fault tolerance and information-rich stream analytics enhances the usability of the Flink system. Again, being able to have a unified architecture that represents everything in terms of streaming dataflows undoubtedly helped in the process. With over 180 open source contributors as of the writing of this paper, and its use in production in various companies, Flink has also managed to gain some traction within industry as well, which distinguishes it from other, more theoretical systems or architectures. One weakness of this paper is the lack of experimental results or other discussion on some of the key claims of the paper. For example, it initially mentions how previous systems suffered from high latency and arbitrary accuracy, yet it does not present results showing how Flink is able to improve on the latency over patterns like the “lambda architecture.” Perhaps this is due to the inability to get a copy of these systems directly, or perhaps it is because the unified architecture that comprises Flink makes a direct comparison unnecessary in the first place. |
The purpose of this paper was to describe the architecture of the Flink system for streaming and batch data processing. The paper speaks on how some of today’s solutions to continuously produced data are handled by either computing approximate timely results on the streaming data or by computing a late accurate results on batched chunks of data, which are static sets of data collected during some time interval. The current approaches suffer from high complexity, high latency, and arbitrary inaccuracy. Flink allows users to deliver early and approximate or late and accurate results as well, but there is no distinction between the different types of streamed data processing; you are always just starting your computation at some point in the durable stream and maintaining state during the life of your computation. Flink also handles batch processing, which is a special kind of stream processing where the stream is finite and the order of the results being returned does not matter. They implement special APIs to handle this. Flinks architecture can be thought of as both a software stack and as a distributed system. There are two core APIs used for Flink as well, and they are one for batch data processing and continuous data stream processing, respectively. Also the Flink process model is comprised of a client, Job Manager (which keeps a dataflow graph and can recover from faults), and a Task Manager. Flink programs, whether batch (DataSet) or streaming (DataStream) are represented as dataflow graphs which are run by Flink’s runtime engine. The author moves into Data exchange in intermediate data streams which can be done through blocking exchange, where an operator’s data is not made available until all of its data is buffered, or pipelined exchange where the output on streamed data is made available immediately to concurrently running producers and consumers of data. Flink also balances latency and throughput by allowing a timeout on buffers. Flink also uses checkpointing and partial re-execution to recover from faults, using distributed consistent snapshots. The author next speaks on stream analytics. It uses watermarking, similar to DBMSs with snapshots, to create a notion of time. The author then introduces the idea of Stream Windows which are stateful operators that put continuously updated data in memory for processing. Next the author speaks on batch analytics. Batch computations are executed by the same runtime as streaming except blocking operators are used to isolate stages of larger computations. I liked how this paper iteratively deepened the description of their contribution. This helped give context and reveal more about each idea. I did not like how the authors did not spend as much time on the streaming and batch analytics as they did for the dataflows and background. |
Apache Flink Flink is a break stone product, which combining streaming processing and batch processing together using streaming engine. And it is designed for dealing with many classes of data processing applications such as real time analytics, continuous data pipelines, batch processing, and iterative algorithm. Compared with previous big data architecture, Flink has true stream engine, and performs better than Hadoop, Apache Spark. Before Flink comes out, stream processing system and batch processing system are totally different things. And only "lambda architecture" combine them together but the whole system is very complex. Compared with that, Flink is neat and with high performance: the whole structure is neat. The core the system is stream engine, bundled with DataSet API and data streams API. Flink can employ on local, cluster and cloud. The Flink process model goes with the Flink client takes the program code and transforms it to a Dataflow graph and submits that to the job manager. Data programs also go through a cost-based query optimization phase. The most important part in Flink is Dataflow. The Dataflow is a directed acyclic graph consisting of stateful operators and data streams that represent datas, operators and others. As a stream system, Flink deals with data exchange through intermediate data streams. It uses pipelined data exchange so that by tuning the timeout and size of buffer one can change the latency and throughput of system. In the DataStream, there are not always data, some are light weighted control events, they are checkpoint barriers, watermarks, and iteration barriers. As for the fault tolerance, Flink offers exactly-once-processs guarantees. This is achieved by checkpoint mechanism(Asynchronous barrier snapshotting) and the corresponding aligning phase. The Asynchronous barrier snapshotting provides several advantages. First, it guarantees exactly-once state updates without ever pausing the computation. And second, it is completely decoupled from other forms of control messages. Third, it completely decoupled from the mechanism used for reliable storage. Flink puts the notion of time in the streams and most operators are stateful. Which means that the state is made explicit and is incorporated in APIs. And for streams windows, there are window assigner, trigger, and evictor. The assigner is responsible for assigning each record to logical views. The trigger define when the operation associated with the window definition is performed. And the evictor determines which records to retain within each window. In Flink, batch processing is a specific situation of stream processing, it uses the traditional query optimization but adding some optimization on it. For the memory management, they discard JVM but serializes data into memory segments. For the batch iteration, bulk iteration and delta iteration are used to make batch iteration. The contribution of Flink is the combination of batch processing and stream processing using streaming engine. And the experiments shows(not in this paper, though) Flink has a much better performance in throughput and latency compared with Spark. The advantages of this paper is that it gives an overall introduction of Flink from top to down. And provides concrete picture on this. The drawback of this paper: 1 it does not include any experimental information and results. 2: the iteration part is not clear, it should include more details. |
This paper mainly introduced apache flink architecture. Why we need apache Flink? Because Flink is a pure-flow computing engine, a micro-batch engine like Spark is just a special case of the Flink streaming engine. Flink originated from a research project called Stratosphere, which aims to build the next generation big data analytics engine. Apache Flink is a big data processing system that supports both distributed data stream processing and data batch processing. Flink can express and execute many types of data processing applications, including real-time data analysis, continuous data pipelines, historical data processing and iterative algorithms and fault-tolerant data streams. The Flink distributed program consists of two main processes: JobManager and TaskManager. JobManager is responsible for the management of the Job and the coordination of resources. Including task scheduling, monitoring the execution status of tasks, coordinating task execution, checkpoint management, and failure recovery. The Task Manager is a worker node that executes the tasks. The execution task runs in one or more threads in a JVM. The TaskManager is a JVM process running on different nodes. This process will have a certain amount of resources. TaskSlot is where the distributed program actually executes the Task. By adjusting the number of TaskSlots, users can define how subtasks are isolated from each other. JobClient is the entry point for program execution. The Job Client is responsible for receiving the user-submitted program and converting the user-submitted program into a Dataflow graph through the optimizer and GraphBuilder. When the program runs, different processes will participate, including Jobmanager, TaskManager, and JobClient. First, the Flink program is submitted to the JobClient, and the JobClient is submitted to the JobManager. The JobManager is responsible for resource coordination and job execution. Once the resource allocation is completed, the task will be assigned to a different TaskManager. The TaskManager will initialize the thread to execute the task and feed back to the JobManager according to the execution status of the program. The execution status includes starting, in progress, finished, and canceled and failing. When the job is completed, the result is returned to the client. Flink data takes record as its processing unit. Each record is generated by an Event, and each record and real-time stream processing system is generally bound to time. Record generally has the following three times: EventTime, IngestionTime and ProcessingTime. The main contribution of this paper is that it proposed flink. And also, Flink has its own memory control fron the first day. This is one of the reasons that inspired Spark to follow this path. In addition to storing data in its own managed memory, Flink also operates binary data directly. This paper mainly introduced apache flink architecture. Why we need apache Flink? Because Flink is a pure-flow computing engine, a micro-batch engine like Spark is just a special case of the Flink streaming engine. Flink originated from a research project called Stratosphere, which aims to build the next generation big data analytics engine. Apache Flink is a big data processing system that supports both distributed data stream processing and data batch processing. Flink can express and execute many types of data processing applications, including real-time data analysis, continuous data pipelines, historical data processing and iterative algorithms and fault-tolerant data streams. The Flink distributed program consists of two main processes: JobManager and TaskManager. JobManager is responsible for the management of the Job and the coordination of resources. Including task scheduling, monitoring the execution status of tasks, coordinating task execution, checkpoint management, and failure recovery. The Task Manager is a worker node that executes the tasks. The execution task runs in one or more threads in a JVM. The TaskManager is a JVM process running on different nodes. This process will have a certain amount of resources. TaskSlot is where the distributed program actually executes the Task. By adjusting the number of TaskSlots, users can define how subtasks are isolated from each other. JobClient is the entry point for program execution. The Job Client is responsible for receiving the user-submitted program and converting the user-submitted program into a Dataflow graph through the optimizer and GraphBuilder. When the program runs, different processes will participate, including Jobmanager, TaskManager, and JobClient. First, the Flink program is submitted to the JobClient, and the JobClient is submitted to the JobManager. The JobManager is responsible for resource coordination and job execution. Once the resource allocation is completed, the task will be assigned to a different TaskManager. The TaskManager will initialize the thread to execute the task and feed back to the JobManager according to the execution status of the program. The execution status includes starting, in progress, finished, and canceled and failing. When the job is completed, the result is returned to the client. Flink data takes record as its processing unit. Each record is generated by an Event, and each record and real-time stream processing system is generally bound to time. Record generally has the following three times: EventTime, IngestionTime and ProcessingTime. The main contribution of this paper is that it proposed flink. And also, Flink has its own memory control fron the first day. This is one of the reasons that inspired Spark to follow this path. In addition to storing data in its own managed memory, Flink also operates binary data directly. This paper mainly introduced apache flink architecture. Why we need apache Flink? Because Flink is a pure-flow computing engine, a micro-batch engine like Spark is just a special case of the Flink streaming engine. Flink originated from a research project called Stratosphere, which aims to build the next generation big data analytics engine. Apache Flink is a big data processing system that supports both distributed data stream processing and data batch processing. Flink can express and execute many types of data processing applications, including real-time data analysis, continuous data pipelines, historical data processing and iterative algorithms and fault-tolerant data streams. The Flink distributed program consists of two main processes: JobManager and TaskManager. JobManager is responsible for the management of the Job and the coordination of resources. Including task scheduling, monitoring the execution status of tasks, coordinating task execution, checkpoint management, and failure recovery. The Task Manager is a worker node that executes the tasks. The execution task runs in one or more threads in a JVM. The TaskManager is a JVM process running on different nodes. This process will have a certain amount of resources. TaskSlot is where the distributed program actually executes the Task. By adjusting the number of TaskSlots, users can define how subtasks are isolated from each other. JobClient is the entry point for program execution. The Job Client is responsible for receiving the user-submitted program and converting the user-submitted program into a Dataflow graph through the optimizer and GraphBuilder. When the program runs, different processes will participate, including Jobmanager, TaskManager, and JobClient. First, the Flink program is submitted to the JobClient, and the JobClient is submitted to the JobManager. The JobManager is responsible for resource coordination and job execution. Once the resource allocation is completed, the task will be assigned to a different TaskManager. The TaskManager will initialize the thread to execute the task and feed back to the JobManager according to the execution status of the program. The execution status includes starting, in progress, finished, and canceled and failing. When the job is completed, the result is returned to the client. Flink data takes record as its processing unit. Each record is generated by an Event, and each record and real-time stream processing system is generally bound to time. Record generally has the following three times: EventTime, IngestionTime and ProcessingTime. The main contribution of this paper is that it proposed flink. And also, Flink has its own memory control fron the first day. This is one of the reasons that inspired Spark to follow this path. In addition to storing data in its own managed memory, Flink also operates binary data directly. |
Flink is a system that supports both streaming and batch processing. A key observation in the introduction is that “batch programs are special cases of streaming programs,” an idea that I will touch on more later. From there, much of the discussion is about stream processing. The programs written on Flink are compiled to dataflow graphs, which are DAGs. Flink uses pipelined streams to increase parallelism. There is a significant amount of reliability offered, with exactly-once-processing consistency and failure-handling with checkpointing (using a system called ABS). An important aspect of Flink is that iterative dataflows are supported, making it suitable for machine learning applications. Windows are used to group tuples together for processing. It seemed like a core focus that differentiated Flink from other systems was the ability to do both stream and batch processing, as is emphasized throughout the paper and in the title. Given that, I was somewhat surprised that in the body of the paper, batch processing wasn’t really mentioned until very close to the end. However, I thought that the section was actually extremely well done. It is made clear earlier that batch processing is basically a special case of stream processing, and the bullet points on page 35-36 do a really great job showing what optimizations can be made on the stream processing workflow described in order to do well with batch processing. I found it a bit odd that this paper included no results section. Typically in an industry-focused paper, I would expect to see some results, comparing to other similar systems. The paper states that “To the best of our knowledge, Flink is the only open-source project that: i) supports event time and out-of-order event processing ii) provides consistent managed state with exactly-once guarantees iii) achieves high throughput and low latency, serving both batch and streaming.” Given that, I wonder why it could not have been compared to closed-source solutions (assuming some cloud offering). Additionally, this makes me wonder what is truly considered novel - is being the first open-source implementation of something that has already been done sufficient novelty to be presented in a conference? |
This paper introduces Apache Flink, a system for processing streaming and batch data. A lot of existing large-scale data processing system ignores the continuous and timely nature of data production. They simply batch data records into static sets and processed it in a time-agnostic fashion. This approach suffers from high latency, high complexity, and arbitrary inaccuracy. This paper proposed a unified architecture of stream and batch data processing, including specific optimizations for static data sets. At the heart of Flink system is a distributed dataflow engine, which executes dataflow programs. The program is simply a DAG of stateful operators connected with data streams. On top of this engine, separate APIs are built for batch processing (DataSet API) and stream processing (DataStream API). Based on these APIs, different libraries such as Flint ML, Gelly, Table are built. Although there are different APIs, the Flink runtime treats them as the same thing. They are all converted into a dataflow graph. The only difference is that DataSet programs will additionally go through a cost-based query optimization phase. Dataflow graph is just a DAG that consists of stateful operators and data streams which represent data produced by one operator and are available for consumption by another operator. To enable distributed processing, operators are parallelized into one or more parallel subtasks and similarly, data stream are split into multiple stream partitions. Both data records and control events are transferred through intermediate data streams. There are two kinds of intermediate streams: pipelined and blocking. Pipelined stream enables the propagation of back pressure from consumers to producers while blocking stream will buffer all of the producing operator’s data before making it available for consumption. Control events for example checkpoint barriers, watermarks, etc. are also transferred via intermediate streams. Since a bounded dataset is just a special case of an unbounded data stream, Flink is able to convert DataSet program into a streaming program and utilize the same architecture for both DataSet and DataStream. One problem I have for this paper is that why not build a dedicated system for batch processing. Is there any particular reason for using the same runtime for different workload types? |
In the paper "Apache Flink: Stream and Batch Processing in a Single Engine", Paris Carbone and Co. discuss Apache Flink, an open-source system for processing streaming and batch data. When observing the current trends for modern large-scale data processing, we notice that most of the data is produced continuously and is very relevant towards aggregations. However, another type of interpretation can give rise to a static approach - we can gather data within a set period of time and then query the data afterwards. This combination of batching and stream processing comes with two flavors: a fast and inaccurate answer that arrives on the fly or a slow but accurate answer that arrives after careful computation. Carbone gives an overview of Flink's architecture and how this diverse set of use cases is unified under a single model. Flink follows a paradigm that highlights data-stream processing in both the programming model and execution engine allowing for high flexibility in terms of what a programmer might desire. As a result, Flink expresses itself as a state-of-the-art method for approaching stream and batch processing. Such a tool is both interesting and important to explore in the context to real world applications. The paper is divided into multiple sections: 1) Architecture: Flink's core is comprised of a distributed dataflow engine that executes dataflow programs on either finite or infinite data sets. For each process, a client takes program code and turns it into a dataflow graph. This is submitted to the job manager. The job manager then coordinates the distributed execution of the dataflow. 2) Streaming Dataflows: a) Graphs: Dataflow graphs are DAGs and consist of stateful operators and data streams that represent data produced by an operator. b) Data exchange: Data is handled at a logical level by operators. c) Fault tolerance: There is reliable execution with strict consistency guarantees that deal with failures via checkpointing and partial re-execution. Recovery from failures rolls back all operators to their last successful snapshot. d) Iterative dataflow/Asynchronous stream iterations/Batch iterations: A subset of the work that they have done that is beneficial to machine learning and graph processing (These types of works require an immense amount of iterations). 3) Stream Analytics a) Time: Event time (event origin time) vs processing time (machine wall-clock processing time) b) Stream windows: Computations over unbounded streams are often evaluated over updated logical views called windows. A pool of common predefined implementations are specified by the user in order to boost the user experience. 4) Batch Analytics a) Query optimization: The optimizer uses techniques from parallel database systems such as plan equivalence, cost modeling and interesting-property propagation. However, the UDF-heavy DAGs in Flink’s dataflow programs have no wiggle room for traditional optimizer to employ database techniques. Flink enumerates many physical plans and picks one that is best suited in terms of cost. b) Memory Management: Sorting and joining are done as much as possible on the binary data directly, keeping the serialization and de-serialization overhead at a minimum. To handle arbitrary objects, Flink uses type inference and custom serialization techniques. Like many other papers, this paper had some drawbacks. The first drawback that I noticed was the lack of an experimental section. It was hard to gauge the performance of Apache Flink between the different modes: stream analytics vs batch analytics. I was curious to see what the accuracy and time guarantees were for these modes respectively. Another drawback that I noticed was that I felt that the related work was mentioned far too late into the paper. Since Apache Flink desires to combine both approaches much likes databases that try for hybrid OLTP and OLAP workloads, it would be interesting to put Flink on a spectrum and weigh its closeness to each side. Lastly, I personally thought this paper violated Stonebrakers philosophy on what an ideal stream processing engine should look like. Even though Apache Flink is capable of doing many things, I still think it is still domain specific in the sense that is cannot be fully generalized to solve any problem. |
This paper describes Apache Flink, which processes both streaming and static data. It uses data stream processing to unify stream processing and standard batch processing. While stream processing in other systems is usually done by waiting for a batch of appropriate size to materialize, waiting for a batch introduces latency, which isn’t always good. However, certain job types require standard batch processing. As such, Flink uses data sets to process static data, and data streams to process streaming data. Flink divides its jobs into three parts. The client translates the user code into a dataflow graph, which represents the plan to apply to the data. The Job manager coordinates execution of the different operations on the graph. The Task manager actually runs the various operations on the data. The dataflow graph can be executed in parallel in order to speed up execution. The dataflow graph acts as the common interface for both batch and stream processing; both batches and streams can be processed on dataflows. It’s a directed, acyclic graph of operators and the intermediate data streams between them. Flink can adjust the buffer sizes or buffer timeouts in order to optimize for throughput or latency. Flink can also use local variables in the dataflows to allow them to remember state, for state-based operations. The most important part of this paper is how it unifies streaming and batch processing by combining them as dataflows. Flink is also notably able to making these dataflows fault-tolerant, which it achieves by archiving streams and making them easily replayable. In addition, Flink can generate movable windows on streaming data to make it similar to batch data without losing streaming benefits. The major downside of this paper is that it doesn’t display any experimental results. The major reason for Flink to exists is as a superior alternative to other systems, but without any major comparisons, it’s hard to prove its point. |
Apache Flink is a new generation general big data processing engine which targets to unify different data loads. Apache Flink’s dataflow programming model provides event-at-a-time processing on both finite and infinite datasets. At a basic level, Flink programs consist of streams and transformations. “Conceptually, a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.” Compared to open source streaming engines like Apex, Storm, or Heron, Flink does more than streaming. It is more like the reverse image of Apache Spark in that both put real-time and batch on the same engine, doing away with the need for a Lambda architecture. Both have their APIs for querying data in tables, and both have APIs or libraries for batch and real-time processing, along with graph and machine learning. Pros: (1) Apache Flink provides true event-at-a-time stream processing, enabling 24/7, continuous applications for immediate insights and actions on your data. (2) Apache Flink offers a streaming SQL API, making it accessible for business and non-technical users to harness the power of stream processing. (3) Besides stream processing, Flink integrates real-time and batch processing in the same engine. Cons: (1) Less support than Spark as a big data platform(but promising) |
This paper presents Apache Flink, which is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing, and iterative algorithms can be expressed and executed as pipelined fault-tolerant dataflows. The paper present's Flink's architecture and expand on how a set of use cases can be unified under a single execution model. This paper first presents the architecture of Flink. A Flink cluster comprises three types of processes: the client, the Job Manager, and at least one Task Manager. The client takes the program code, transforms it to a dataflow graph, and submits that to the JobManager. The JobManager coordinates the distributed execution of the dataflow. It tracks the state and progress of each operator and stream, schedules new operators, and coordinates checkpoints and recovery. A TaskManager executes one or more operators that produce streams, and reports on their status to the JobManager. The TaskManagers maintain the buffer pools to buffer or materialize the streams, and the network connections to exchange the data streams between operators. The dataflow graph is a directed acyclic graph that consists of stateful operators and data streams that represent data produced by an operator and are available for consumption by operators. Streams distribute data between producing and consuming operators in various patterns. Flink’s intermediate data streams are the core abstraction for data-exchange between operators. An intermediate data stream represents a logical handle to the data that is produced by an operator and can be consumed by one or more operators. Pipelined intermediate streams exchange data between concurrently running producers and consumers resulting in pipelined execution. When a data record is ready on the producer side, it is serialized and split into one or more buffers that can be forwarded to consumers. Apart from exchanging data, streams in Flink communicate different types of control events. Flink uses lots of special types of control events, including checkpoint barriers, watermarks, and iteration barriers. Flink offers reliable execution with strict exactly-once-processing consistency guarantees and deals with failures via checkpointing and partial re-execution. The checkpointing mechanism of Apache Flink builds on the notion of distributed consistent snapshots to achieve exactly-once-processing guarantees. Recovery from failures reverts all operator states to their respective states taken from the last successful snapshot and restarts the input streams starting from the latest barrier for which there is a snapshot. ABS provides several benefits: it guarantees exactly-once state updates without ever pausing the computation, it is completely decoupled from other forms of control messages, and it is completely decoupled from the mechanism used for reliable storage, allowing state to be backed up to file systems, databases, etc., depending on the larger environment in which Flink is used. Since Flink’s runtime supports pipelined data transfers, continuous stateful operators, and a fault-tolerance mechanism for consistent state updates, overlaying a stream processor on top of it is to implementing a windowing system and a state interface. Batch computations are executed by the same runtime as streaming computations. Flink optimizes batch processing by optimizing execution using a query optimizer and by implementing blocking operators that gracefully spill to disk in the absence of memory. I like this paper because it explains technical details clearly and explains the main contributions of this work. One thing that could improve this paper is to provide performance of Flink in production. |
“Apache Flink: Stream and Batch Processing in a Single Engine” by Carbone et al. discusses the Apache Flink system which supports both stream and batch processing workloads through a common dataflow runtime environment. The runtime environment consists of a directed acyclic graph of stateful operators (that perform operations) and data streams (which are produced and consumed by operators). Data streams between operators can be either pipelined or blocking; pipelined data streams are necessary for streaming programs and can also be used for batch programs; blocking data streams are used for batch programs. Data is exchanged between operators via buffers. Control events (e.g., checkpoint barriers, watermarks, iteration barriers) are injected into the data stream by operators in order to provide special information to the consuming operator, that will potentially impact what actions it takes. For example, checkpoint barriers are important to helping Flink overcome failures, by taking snapshots of the state of operators. Building on top of Flink’s runtime, to support the Flink streaming API essentially just a windowing system and state interface are implemented. The Flink runtime already entirely supports batch processing, but the Flink DataSet API includes simplified syntax and some additional optimizations (e.g., query optimizer; turning off period snapshotting if overhead is high; staged scheduling). I like the idea of a “common fabric”, a common runtime environment that supports extensions/add-ons specific to both stream and batch processing. It seems to be a well thought out common layer, as it supports both use cases and also supports optimizations for batch processing. There wasn’t an evaluation of the Apache Flink system. I would have liked to have seen how performance of Apache Flink compares to performance for stream-only systems and batch-only systems; for example, does Apache Flink perform worse (e.g., throughput, latency) than a stream-only system would have for streaming workloads, or do they perform comparably? |
In this paper, the authors proposed a novel open-source system for processing stream and batch data called Apache Flink. Design high-performance data processing systems is definitely an important problem nowadays. It is important because now we are facing very huge amount of data, apart from the volume, the different kind of data (batch, stream and etc.) from different data sources, we need to have a pipeline which can unify different kinds of workloads. The high-level idea of Flink is that many data processing applications like real-time analysis, continuous data pipelines, batch processing, and iterative algorithms can be expressed and executed as pipelined fault tolerate data flows. In this paper, they present the architecture of Flink and also talk about how to extend it to several use cases can be unified under a single execution model. Next, I will summarize the crux of this paper with my understanding. Apache Flink is a platform that implements a universal dataflow engine designed to perform both stream and batch analytics. Before Flink, data processing platforms are suffered from several problems like high latency for batch processing, high complexity, and arbitrary inaccuracy. The key idea of Flink is to embrace data-stream processing as the unifying model for real-time analysis, continuous streams, and batch processing both in the programming model and an execution engine. Although Flink makes a combination of these factors, it utilizes a highly flexible window mechanism which can compute both early and approximate, also delayed and accurate. It also provides different notions of time (event-time, ingestion-time, processing-time) to provide high flexibility for programmers. Flink can be regarded as a full-fledged and efficient batch processor on top of a streaming runtime, including advanced analytics like graph processing and machine learning applications. Flink is a top-level project of Apache Software Foundation that is developed and supported by a large and lively community, it had already been adopted by many companies. Its dataflow engine treats operator state and logical intermediate results as first-class citizens and is used by both the batch and a data stream APIs with different parameters. The streaming API that is built on top of Flink’s streaming data flow engine provides the means to keep recoverable state and to partition, transform, and aggregate data stream windows. While batch computations are, in theory, a special case of streaming computations, Flink treats them specially, by optimizing their execution using a query optimizer and by implementing blocking operators that gracefully spill to disk in the absence of memory. This is a great industry track paper that gives a clear description of Flink, the main technical contribution of this paper is the introduction of Flink and the idea of universal data flow processing platform. There are several advantages of this paper: first of all, this paper is very easy to read and understand, they give a lot of graphs and examples in their paper, I really like this. Second, the performance of Flink is very good. I see a comparison between Apache Storm and Apache Flink before, the result shows that Flink outperforms that a lot! Flink can achieve high throughput, low latency, high availability and high accuracy, which can be applied to various real-time data processing platforms. Third, Flink is not stateless, it also provides good support for the window, for message delivery, it can provide exactly once delivery. Next, Flink uses checkpoints mechanism for fault tolerance which is much more robust than traditional ACK based fault tolerance. Besides, Flink is an open-source product, which is easy to obtain and learn from it. Generally speaking, it is a nice paper and the downsides of this paper are minor. My first concern is that why they don’t provide any experiments in their paper, I think as a practical product that may have been adopted, they should provide some performance evaluation results and compare Flink to some baselines, in this way will make their paper more convincing and attract people to use their product. |
This paper starts by recognizing that batch data processing is a solution for data streaming that involves very high latencies, and even systems that offer “hybrid” solutions involving streaming & batch modes suffer from high latency due to their batch-driven philosophy. This paper introduces Flink, which is a system that is able to achieve better performance by pursuing a **unifying** architecture, rather than a **hybrid** architecture. This unified architecture is possible by observing that everything can be represented by a stream (batches are just streams with finite length). Here are the core features of Flink: 1) Flink’s most basic data model is called a dataflow graph, which is a directed acyclic graph that contains stateful operators which produce data streams that can be consumed by other operators. These can be parallelized and partitioned easily. Data exchange is done by pipelining intermediate streams that can be concurrently running. 2) Flink’s consistency scheme is “exactly-once-processing”, the meaning of which is fairly obvious, and uses checkpointing + snapshots to deal with fault tolerance. Barriers are used in order to construct consistent snapshots. 3) Stream windows are logical views of a stream that make batch processing intuitive in a stream environment. They are constructed based on the timestamps (or optionally other things technically). The paper’s main contributions are a system that is (or was) used in production by several companies that (at least at a philosophical level) solves a performance issue that hybrid streaming/batch processing systems have had in the past. The main weaknesses are with the paper itself. There is no thorough experiment section, which makes it hard to evaluate what the benefits are from an empirical standpoint. The introduction was also very dense and difficult to understand what Flink’s main strategy was to achieve its goal of being a “unifying approach, rather than a hybrid approach”. Also, the introduction bashed the idea of having a hybrid approach, but Flink still uses a hybrid approach as it has a specialized batch-mode API. |