Review for Paper: 31-Twitter Heron: Stream Processing at Scale

Review 1

This paper introduces the design of Heron, Twitter's stream processing engine.

The motivation for building Heron instead of using Storm solution include:
1. Twitter needs a cleaner mapping from logical computation units to physical process, this is crucial to find failing topologies.
2. Storm has inefficient resource allocation schemes.
3. It's inefficient for Storm user to provide new production topologies.
4. Despite the drawbacks of Storm system, Heron should be compatible with Storm APIs so the existing applications can be preserved.

The Heron system is better than Storm in the aspects including: better performance, lower resource consumption, better debug-ability, scalability, and manageability. The improvements come from the following architecture changes:
1. With Topology Master (TM), the procisioning of resources is abstracted from the cluster manager.
2. Each Heron Instance (HI) executes a single task, improves debug-ability
3. The fine granularity of run-time metrics enables easy tracking of which components of the topology is failing or slowing down.
4. With component-level resource allocation, the resource allocation is efficient.
5. TM per topology separates topology management so the failure of one topology won't affect another.
6. Backpressure allows consistent rate of delivering results. Back pressure is used to remove uncertainty in run-time, with both TCP nd Spout backpressures.

The test results show that Heron has better throughput and latency compared to Spout, and also has good scalability.

Heron uses many techniques to improve from Storm. I think two techniques especially interests me:
1. The two-threaded approach of HI. Each HI runs a gateway thread and a task execution thread, this is a design we've seen in many systems, but it's important to understand that the design can avoid blocking from executions (i.e., long wait on system calls/IO).
2. The backpressure mechanism to dynamically adjust the data flow rate. This is also a well-known idea from computer networks, and it's interesting to see it's application in a stream-processing system.


Review 2

Twitter and many other organizations use real-time streaming to achieve multiple tasks like computing real-time active user counts (RTAC), measuring the real-time engagement of users to social media and advertisements. Previously, Storm was used as the main platform for real-time analytics at Twitter. However, the limitation of Storm has been revealed due to the increasing of the scale of the data processed. Therefore, the paper proposes a new system, Heron, that scales better with better performance and easier to manage.

The paper first reviews Storm techniques and introduces the motivation of developing Heron. Then it presents the design of Heron and supplementary tools. At last, it shows empirical evaluation comparing Storm and Heron and discussion about future work.

Some of the strengths and contributions of Heron are:
1. Heron is API-compatible with Storm, which is good for Storm users to migrate to the new system.
2. Heron significantly reduces the hardware resources needed to dedicate to the topologies.
3. Heron increases throughput and reduces processing latency
4. Heron has high stability.

Some of the drawbacks of Heron are:
1. Heron depends on Mesos, you have to establish Mesos infrastructure before leverage the advantages of Heron.
2. Exactly once semantics haven’t been implemented in current Heron version, which provides crucial real-time analytics.



Review 3

Twitter relies heavily on real-time streaming, which is mainly served by Storm platform. However, Storm has the issue of debug-ability, and it needs dedicated cluster resources and special hardware allocation which leads to inefficiencies in using precious cluster resources, and also limits the ability to scale on demand. Therefore, this paper proposed a new real-time stream data processing system called Heron, which scales better, has better debug-ability, has better performance, and is easier to manage – all while working in a shared cluster infrastructure.

Heron architecture consists of Aurora Scheduler and several topologies. Aurora is a generic service scheduler that runs as a framework on top of Mesos, and each topology is run as an Aurora job consisting of several containers. There are four components of container:
1. Topology Master: It is responsible for managing the topology throughout its existence and provides a single point of contact for discovering the status of the topology.’
2. Stream Manager: The key function of the Stream Manager (SM)is to manage the routing of tuples efficiently. Each Heron Instance(HI) connects to its local SM to send and receive tuples.
3. Heron Instance: The main work for a spout or a bolt is carried out in the Heron instances(HIs). Unlike the Storm worker, each HI is a JVM process, which runs only a single task of the spout or the bolt.
4. Metrics Manager :The Metrics Manager (MM) collects and exports metrics from all the components in the system. These metrics include system metrics and user metrics for the topologies.

The main contribution of this system is it addressed several shortcomings of existing open source streaming systems: First, Heron topologies are very easy to troubleshoot and debug. When a streaming job (topology) misbehaves for any number of reasons, such as misbehaving user code, failing hardware, or even changes in load, it is important to determine the root cause quickly, since downtime could lead to loss of revenue and a variety of other headaches. Heron provides UI and tracking tools for easily locating the logs of a particular instance of a spout or bolt and for identifying the sequence of events that led to the error. Furthermore, Heron topologies continuously monitor themselves and automatically notify the developer when something is not right. This helps developers quickly identify where the underlying issue lies and to take timely action.

The main advantages of Heron are:
1. It has simplified and separate components, which is easier to reason about behavior and increases community collaboration.
2. It significantly reduced the hardware resources they needed to dedicate to the topologies, and at the same time increase throughput and reduce processing latency.

The main disadvantage of Heron is its dependency on Mesos. So, unless the company already has an Mesos infrastructure in place, that needs to be established before it can leverage the advantages of Heron, which oftentimes is not such an easy task.


Review 4

Problem & motivations:
Twitter relies heavily on real-time streaming and it requires the system to have a good debug-ability. Therefore, Twitter wants a production topology. Also, the infrastructure should base on the existing streaming API and be high compatibility.

Main Contribution:
The Heron, a new stream data processing system. The key insight behind the Heron is to utilize the JVM and produce the storm workers and re-architected storm to allow it to run one task per worker. And it has a stream manager that handles the connection.




Review 5

Real-time stream processing has become an ever increasingly important task in many data-driven companies like Twitter, but over time, the increase in the amount and diversity of data, along with use cases, necessitates new technologies to solve the new challenges. For some time, Twitter had been using a system called Storm to handle their real-time analytics needs, but various limitations in debugability, scalability, and resource management has led them to develop a new system called Heron. As will be detailed later, Heron manages to achieve significant performance gains while lowering resource consumption and improving debugability, scalability, and manageability. This is significant because the lessons learned in developing such a system for a major technology company like Twitter can very well be applied elsewhere.

As previously mentioned, Storm was beginning to reach some limits due to its initial implementation. For instance, all worker processes are scheduled inside a JVM process, and the scheduling and complex interactions between tasks often makes it difficult to determine which tasks are being scheduled. Additionally, disparate tasks can run in the same JVM, making determining the root causes of bugs or failures to be a difficult task. If that was not enough, Storm treats every worker the same, regardless of its resource requirements, which often results in overprovisioning resources such as memory. These reasons, as well as other weaknesses such as efficiency and a single point of failure for the Storm Nimbus component, caused the authors to consider changes to Storm. For various reasons, they concluded that it would be easier to create a new system, called Heron. Heron also runs topologies like Storm, and is designed to be compatible with Storm code to facilitate adoption. It is made up of containers that contain processes, including the Topology Master, Stream Manager, Metrics Manager, and Heron Instances, which are scheduled using Aurora. The Topology Master manages the topology by serving as the point of contact for getting the status of the topology. The Stream Manager manages routing of tuples efficiently, employing techniques such as backpressure to manage the rate that tuples come in. Heron Instances are essentially the tasks/code that the user wants to run. They were implemented using a two-threaded approach, one being a Gateway thread and the other a Task Execution thread. Finally, the Metrics Manager collects and exports metrics for all system components. Overall, Heron manages to do fine-grained resource provisioning, abstract resource provisioning from the duties of the cluster manager, and improve on debugability by confining a given Heron Instance to a single task. Additionally, metrics make it easier to see which components are faster or slower, and having Topology Masters for each topology allows them to be managed independently of each other.

The main strength of this paper is that it introduces a working, large-scale stream processing system that solves many of the issues faced by its predecessor. As the authors have pointed out in their architectural description, Storm had various limitations that were corrected by Heron. What is more, Heron also manages to achieve impressive performance and latency improvements, which become more magnified as the amount of work per task (parallelism in the paper’s graphs) increases. Another impressive feat that the authors managed was complete compatibility of Heron with Storm code, which makes it much easier for former users at Twitter to move over to the new system.

One weakness with the Twitter Heron system is possibly the lack of an “exactly once” tuple processing semantic, where a tuple is processed exactly once. As a result, tuples may either be missed (“at most once”) or they may be processed more than once (“at least once”). While Twitter’s requirements may make this unnecessary, it does bring into question how generalizable the Heron system is to non-Twitter workloads, especially since it, like Storm, has been given to Apache.


Review 6

Heron is the efficient and scalable stream processing engine used in Twitter, and this paper’s purpose is to outline this system. With the growing size of Twitter and increasing diversity of use cases, the shortcomings of the Storm data analytics tool were exposed, such as difficulty managing machine provisioning, debugging across multiple components in a single system process, and sharing resources. This new system, Heron, is also API compatible with it’s predecessor Storm, so the switch from Storm to Heron was easy for Twitter. There is a limited section on related work because Twitter needed an open source, high performance solution that worked on shared infrastructure and was compatible with the previous Storm API, which left them with Heron.

First the paper breaks down the Storm architecture for background and to describe the limitations it possesses. The limitations arise from the grouping of computational tasks from different topologies are grouped into the same process since a worker (which is run as a single process) will run multiple executors (which run multiple tasks). This leads to uncertainty in task scheduling and resource usage, which makes it hard to debug since a worker can run tasks from different topologies. Logs from multiple tasks are also written to a single file, which further adds to the confusion. The authors considered the idea of modifying the workers to run single tasks, but they suggested that this would make workers very computationally inefficient. Next they describe the shortcomings of Storm’s Nimbus which schedules, monitors, and distributes JARs. Nimbus does not isolate tasks from different topologies so they can interfere with one another within the same worker. Nimbus also uses Zookeeper to monitor heartbeats which limits the number of workers. Also, Nimbus is not backed up so it is a single point of failure. There was also cases of unpredictable performance caused by replaying failed tuples, high-latency garbage collection cycles, and contention of resources.

Next the paper speaks on the Heron system. They say that since the mentioned issues are fundamental to the Storm system, that they needed to rewrite the system from the ground up in order to fix them in a more improved system. To remain compatible with the Storm API for easy migration, Heron uses the same data model and API as it. Heron runs topologies which are DAGs of spouts (sources of input data) and bolts (abstraction of computation on streamed data), which is equivalent to a logical query plan in a DBMS. Heron deploys topologies using the Aurora scheduler, which is Twitter’s generic service scheduler that runs on top of Mesos. Each Aurora job consists of multiple containers which are described in the paper. One container runs the Topology Master Process, while the rest each run a Stream Manager, Metrics Manager and a number of Heron Instance Processes. The Topology Master simply manages the topology throughout its existence, ensuring there is only one master and that master is discoverable by all processes in its topology. The Stream Manager routes tuples to the proper Heron Instance. The Metrics Manager collects and exports system and user metrics for the topologies. The Heron Instance is a single process that runs the computations of a spout or a bolt.

I liked how much detail and history the paper gave bout the Storm system that predated Heron. I did feel like the paper gave a bit much attention to the faults of Storm in detail just to say that Heron was built from the ground up because of these faults. I think they could have introduced Heron earlier instead.



Review 7

Twitter Heron: Stream Processing at Scale

This paper introduces Heron, which is developed by Twitter. This paper starts by listing some drawbacks of storm system. The storm system is considered cumbersome, and the Nimbus master node and Zookeeper cluster burdens the high load. And it is not easy to modify. Therefore, they decided to re-design storm and named it Heron.

The existing problems in Storm cause Twitter to develop Heron. First, as is in Storm, each work process consists of multiple threads, and each executor threads consists multiple task. This complex relation makes the load balancing hard. Second, Nimbus has to burden the load. In storm, Nimbus node takes multiple tasks such as task scheduling, monitoring and distribution of Jar files. The node will become a bottle neck when there are a lot of tasks. Third, efficiency issues. The replay mechanism is sub-optimal in Storm. And Storm needs longer time on garbage collection.

Overall, Heron remains the same as Storm in streaming processing model. The biggest change is in framework which supports yarn, Mesos and docker. Sending the resources management job to the scheduler. Furthermore, in order to improve the predicability of system, each task is corresponding to a process. The stream manager is responsible for taking charge of Tuple which behaves like a hub.

Heron has the following properties. First, Heron uses the container and scheduler in resource management. Second, each task is corresponding to a process which makes easy to debug. Third, it is easy to target failure, by recognizing the faulty component in topology. Fourth, improve the resource distribution and minimize resource wasting. Fifth, the topology is independent. The experiments show that compared with storm, Heron saves the resource much more, and greatly improves the throughput and latency. We can draw the conclusion that Heron has much better performance than Storm.

The contribution of this paper is that it introduces the Heron designed by twitter and analyses the reason why twitter wants something to replace Storm. The Heron brings a new framework of stream processing system that has a better performance compared with Storm.

The advantages of Heron are 1: compared with Storm, Heron instance is a n independent JVM process which is clear and easy to debug. 2 the topology in Twitter Heron only uses the resources that distributed originally which means it never goes beyond the resource limit. 3 Heron has better system design which makes it has high performance in throughput and latency.

The disvantages of Heron, from my searches, is it's dependency on Mesos. So a in-place infrastructure of Mesos are need before running Heron which makes usage harder.


Review 8


This paper mainly describe the limitation fo the old generation of twitter stream engine Storm and proposed the new engine Heron. Twitter Heron is a real-time, fault-tolerant, distributed stream data processing system open sourced by Twitter. Heron is the direct successor to Apache Storm. It inherits the real-time, fault-tolerant, low-latency features of Apache Storm.
The limitation of Strom mainly lies on the following aspects: scalability, debug-ability, manageability, and efficient sharing of cluster resources with other data services.

Storm's worker is a JVM process. Each worker can run multiple executors. According to Storm's existing scheduling mechanism, we can't determine which worker is assigned to which worker and which physical machine. Since you don't know which worker the task is assigned to, it may be the same. Considering the join situation, a join task and an output to the DB Store or other stored tasks are assigned to the same worker, so performance may not be guaranteed.

Storm's NImbus tasks are a lot of arduous tasks, including scheduling, monitoring, distributing JARs, etc. Nimbus will become a bottleneck when there is more topology. The Nimbus scheduler does not support the fine-grained resource reservation and isolation of the worker. Different topology workers are assigned to the same physical node and are likely to interact with each other.

Then we will introduce the Heron architecture. The user writes a release of topoloy to the Aurora scheduler. Every topology is running as an Aurora job. Each job consists of several containers, which are allocated and dispatched by Aurora. The first container acts as the Topology Master and the other Container acts as the Stream Manager. All metadata information including who submitted the job, the job information of the job, the startup time, etc. will be saved in Zookeeper.
Each Heron Instance is written in java and is a JVM process. The Heron process communicates with protocol buffers.

It is worth mentioning that each HI is only running a task, that is, either spout or bolt. This is good for debugging. This design also considers the complexity of future data. There are two types of HI designs: Single thread, Double thread. TM (Topology Master) is mainly responsible for the topology of the topology. At the time of startup, the TM stores the information on Zookeeper so that other processes can discover the TM. So TM has the following two purposes: Block the generation of multiple TMs, Allow other processes belonging to the topology to discover the TM. Heron provides a back pressure mechanism to dynamically adjust the rate at which data flows. This mechanism allows each component in the topology to run at a different speed. You can also change its speed dynamically.

The main contribution of this paper is that it proposed the Huron system. And also first of all, unlike Storm, Heron Instance is a stand-alone jvm process. It may be a spout or a bolt. It is clearer and easier to debug. It doesn't have a lot of spouts and bolts running in a jvm process like Storm. Secondly, Topology in TwitterHeron will only use the resources they initially allocated, and they will never exceed their resource limits. TwitterHeron uses YARN for resource scheduling. Then, Heron has a built-in backpressure mechanism to ensure that the topology can be adaptive with slow components. Finally, Heron's better optimization of the system design makes Heron have higher throughput and lower latency than Apache Storm.




Review 9

Heron is the system that twitter built to replace Storm, an Apache project that it had previously been using. The key difference between Heron and Storm is that Heron does a better job separating functionalities. In Storm, different functional processes might run on the same machine, making it hard to debug and find the true cause of problems. In Heron, functionalities including a Topology Master, Stream Manager, and Heron Instances (at the core) are well defined. The Stream Manager implements backpressure, which is not provided by storm. Essentially, this allows for more control of the rate of data flow for overall efficiency of the system.

Twitter decided to build Heron after deciding that Storm was unsuitable, and their process for deciding to build a new system over utilizing existing systems was interesting. Because there were already so many topologies reliant on Storm, Twitter chose to expose the same Storm API in Heron. This was not available in existing systems, and therefore Twitter built their own. I think that this proved to be a good decision for them, and is probably something that many companies want to do once they have a large codebase written for a specific system that isn’t working for them anymore. It also makes me wonder if there is any talk about a common API for these kinds of systems, especially as they are being developed. This could make adoption much easier.

I really liked the way that results were shared in this paper. The graphs are clear and show true motivation for a system like Heron. This is a great example of how tackling the constants in big-O can be incredibly important with the amount of data that a place like Twitter processes. Although these constant factor are often ignored in theory, they can save millions of dollars for a company. Figure 9 is a really great example of two systems where throughput appears to increase linearly, but one is clearly significantly better.

I thought that the piece of the paper on “Motivation for Heron” seemed odd for something published in an academic venue. Essentially, it covered the operational overhead of Storm in detail. Topics discussed included pagers and the fact that typically the service is shut down because debugging is too hard. I suppose this is interesting, but it seems more appropriate for a company email than an academic conference. These are important topics, but they are rarely discussed in the academic community, as far as I’m aware. To me, the results are more convincing that Heron is a useful system.



Review 10

This paper introduces Heron, a stream processing system developed at Twitter. Originally, Twitter used Storm as their real-time streaming engine, however, as the scale of data being processed in real time at Twitter has increased, limitations of Storm system becomes apparent. More specifically, Storm has three main drawbacks. First is very hard to debug as work from multiple components of a topology is bundled into one process. Second, Storm needs a dedicated cluster resource, which leads to inefficiencies in using precious cluster resources and limit the ability to scale. Finally, Storm has no back-pressure mechanism, which makes the system behavior less predictable.

The goal of Heron is to solve all these problems while maintaining the same API as Storm. Similar to Storm, Heron runs topologies, which can be viewed as a logical query plan. To run a topology, it needs to be deployed to the Aurora schedule, which is a generic service scheduler. Aurora will create several containers for each topology job. The first container runs the Topology Master process. This process is responsible for managing the topology throughout its existence. It allows each topology to be managed independently of each other. Other containers run a Stream Manager, a Metrics Manager, and several Heron Instances. The functionality of the metrics manager and Heron instance are self-evident. Note that each Heron instance will only run one job, which makes the system a lot easier to debug. In terms of the stream manager, it’s used to manage the routing of tuples efficiently. All the stream managers will form a fully connected graph for coordinating. It also applies the Spout back-pressure mechanism to dynamically adjust the rate of data flow. Whenever a stream manager realizes one or more of its Heron instances are slowing down, it identifies local spouts and stops reading data from them. The stream manager will also send a special message to other managers so that they will also stop reading data from spouts in their container. The mechanism allows Heron to achieve a consistent rate of delivering results.

I think one downside of the paper is that it talks too much about the limitations of the Storm system. I think is enough to just outline some of the key downsides of that system and explain why Twitter needs to solve that problem.




Review 11

In the paper "Twitter Heron: Stream Processing at Scale", Sanjeev Kulkarni and Co. discuss Heron, a new data processing engine at Twitter that solves the previous problems faced by their old engine, Storm. Storm has become increasingly useless as the scale of data being processed has increased. Furthermore, the increase in the diversity and number of use cases reveals many limitations about Storm. These limitations include scalability, debug-ability, manageability, and efficient sharing of cluster resources with other data services. Rather than rewriting a large number of applications that are already compatible with Storm, researchers at Twitter felt that developing a new stream processing engine would solve their problems in the most cheapest manner. Thus, this paper serves to give an empirical evaluation of Heron based on their Twitters workload requirements.

The paper is divided into two sections - A description of Storm's architecture and limitations, and Heron's solution to these limitations:

1) Storm:
a) Background: Spouts = sources of data input. Bolts = computation on a stream. Spouts pull data from queries and bolts carry out the computation. They are run as tasks and are grouped into an executor.
b) Architecture Limitations: Complex code makes scheduling a nightmare. Furthermore, debugging techniques cannot pinpoint the source of problems. There is significant overhead and queue contention issues in this design.
c) Storm Nimbus Issues: This acts as the bottleneck for the system because it is functionally overloaded while reporting metrics. Also, if Nimbus fails, everything fails to function properly.
d) Lack of Backpressure: If it can't handle incoming tuples, it just drops them (not good!).
e) Efficiency: High risk and untraceable performance. Thus, high CPU usage was required to keep the system afloat (also not good!).
2) Heron:
a) Data Model: Same spouts and bolts. Included is an "at most once" and "at least once" feature that conditionally considers tuples.
b) Architecture: Users employ the Heron API to create and deploy topologies to the Aurora scheduler, using a Heron command line tool.
c) Stream Manager (SM): Manages the routing of tuples efficiently. All SMs in a topology connect between themselves to form a network.
d) Heron Instances (HI): Each HI connects to its local SM to send and receive tuples. The main work for a spout or a bolt is carried out in the HI. It is easy to debug/profile a spout or bolt since the developer can easily see the sequence of events of an HI.
e) Startup + Failures: When any instance dies, they immediately get restarted and check logs to see if there any self updates they should apply.
f) Heron HCI: These need to be included for user experience - i) user topology interaction b) users to view metrics c) users to view exceptions in HI d) users to view topology logs.

Like several other papers written by companies, this paper has some drawbacks. The first drawback has to do with the evaluation - this is criticism directed to every company that builds their own infrastructure. I understand that the intent of this paper was to highlight Heron solving Storm's previous issues. Accordingly, the experimental results pit these two approaches against each other. However, I would also like to see performance comparisons between other company solutions to stream processing. Even though the workloads are probably different, there is a great deal of things to learn and improve on if one can expand their field of view. Another drawback to this paper is the lack of an SQL interface that Stonebraker believed to be a vital part of stream processing. There was no mention in current or future work for implementation of such a design. Instead, they use a low-level programming language (Java) in order to run user logic code. Thus, this gives rise to more complex code, more lines of code, and more bugs.


Review 12

This paper describes Twitter’s program for processing streaming data, called Heron. Heron is a successor to Twitter’s previous system, called Storm. Twitter requires large amounts of real-time streaming processing, but Storm had several issues that Heron was built to fix.

Storm is built off of sprouts, or input sources, and bolts, or operations on input. The directed graph of all of the sprouts and bolts is referred to as a topology, which functions like a query plan in a standard DBMS. Each sprout and bolt is a separate task, but multiple tasks are handled by a single worker. This causes issues like:

Multiple levels of scheduling are required, which makes tracking execution difficult.

All of the logs for sprouts and bolts are in one file, which make debugging difficult.

If one task fails, it crashes the worker, which fails all of the other tasks as well.

Heron fixes this for the most part by splitting its tasks more effectively, dividing work into the Topology master, the Stream managers, and the Heron Instances. The Topology master is the single point for access to the overall topology, so there should always be exactly one.The Stream managers decide the flow of data between nodes in the topology. There are several of them, so each Heron Instance can contact the nearest for routing information. The Stream managers can apply backpressure to certain nodes, slowing down the flow of data so that later parts of the pipeline aren’t overwhelmed.

The Heron Instances are the processes that run sprouts and bolts; one Instance for each sprout and bolt. The user writes java code for the Instance to implement. This way, failures in one Instance don’t affect the others, and Instances can be debugged individually. Instances can be run as single threads, or as two separate threads for IO and for processing, whichever the system needs.

The benefits of this system are that it creates useful logical boundaries between all of the separate parts of the system. Any part can fail independently of other parts, so failures can easily be tracked, isolated, and debugged. As well, the new implementation allows Heron to run with higher throughput and lower latency than Storm.

The downside, such as it is, is that Heron isn’t compared to anything other than Storm. Replacing Storm is obviously good for Twitter, but there isn’t any demonstration of how it compares to other similar systems, if there are any.




Review 13

Heron is Twitter's stream processing engine. It replaced Apache Storm used at Twitter, and all production topologies inside Twitter now run on Heron. Heron is API-compatible with Storm, which made it easy for Storm users to migrate to Hero. The reason Heron was developed is to improve on the debugability, scalability, and manageability of Storm. While a lot of importance is attributed to performance when comparing systems, these features (debugability, scalability, and manageability) are often more important in real-world use.
When a topology is submitted to Heron, the Resource Manager first determines how many containers should be allocated for the topology. The first container runs the Topology Master which is the process responsible for managing the topology throughout its existence. The remaining containers each run a Stream Manager, a Metrics Manager and a set of Heron Instances which are essentially spouts or bolts that run on their own JVM. The Stream Manager is the process responsible for routing tuples among Heron Instances. The Metrics Manager collects several metrics about the status of the processes in a container.

Benefits:
(1) Heron increase performance predictability, improve developer productivity and ease manageability. 
(2) Easy debugging: Every task runs in process-level isolation, which makes it easy to understand its behavior, performance and profile.


Cons:
(1) Lack mechanisms for exactly once semantics in Heron.




Review 14

This paper presents the design and implementation of a new stream processing system called Heron, which is an replacement for Storm. This paper also provides empirical evidence to demonstrate the efficiency and scalability of Heron.
One of the limitation that Storm has is that it works from multiple components of a topology bundled into one operating system process, which makes debugging very challenging. Also, Storm needs dedicated cluster resources, which requires special hardware allocation to run Storm topologies. This approach leads to inefficiencies in using precious cluster resources, and also limits the ability to scale on demand. With Storm, provisioning a new production topology requires manual isolation of machines. Managing machine provisioning manually is cumbersome. Heron is designed to meet all the goals outlined here.
The paper presents the design of Heron in details. One of the key design goal for Heron is to maintain compatibility with the Storm API. Therefore, the data model for Heron is identical to Storm. Heron runs topologies, which is a directed acyclic graph of spouts and bolts. Spouts generate the input tuples that are fed into the topology, and bolts do the actual computation. Heron topology is a logical plan that is translated into a physical plan before execution. Heron’s tuple processing semantics includes at most once and at least once. First means no tuple is processed more than once, and second means each tuple is guaranteed to be processed at least once. The Topology Master is in charge of managing the topology. It provides a single point of contact for discovering the status of the topology. The ephemeral node serves to prevent multiple TMs from becoming the master for the same topology and allows any other process that belongs to the topology to discover the TM. Stream manager is used to manage the routing of tuples efficiently. Each Heron Instance connects to its local SM to send and receive tuples. Backpressure mechanism is used to dynamically adjust the rate at which data flows through the topology. This mechanism is important in topologies where different components can execute at different speeds. There are mainly three strategies of backpressure: TCP backpressure, spout backpressure, and stage-by-stage backpressure. Heron uses spout backpressure approach as it is simpler to implement. Backpressure is triggered when the buffer size reaches the high water mark, and remains in effect until the buffer size goes below the low water mark. The main work for a spout or a bolt is carried out in the Heron instances. Each HI is a JVM process, which runs only a single task of the spout or the bolt. This design allows easy debug. HI has single-threaded and two-threaded approaches. Finally, the Metrics Manager collects and exports metrics from all the components in the system. These metrics include system metrics and user metrics for the topologies. Following the technical details of Heron, the paper goes on assessing Heron in production.
I like this paper because it not only discusses the technical details of Heron but also assesses it in production. Also, the background of Storm and why it needs to be replaced is clearly presented in the paper.


Review 15

“Twitter Heron: Stream Processing at Scale” by Kulkarni et al. presents Heron, a new internal system at Twitter for stream processing. Heron was created as a result of limitations and challenges with Twitter’s prior stream processing system, Storm. Challenges with Storm included:
- Debug-ability: different logical components of the processing topologies were intertwined into the same physical process, so difficult to tell the source of a bug.
- Scalability: Difficult to flexibly share and scale cluster resources, as Storm had particular hardware requirements
- Manageability: Machines had to be manually provisioned/decommissioned

The authors therefore created Heron, which addresses the above challenges by having intentionally separate processes, Heron Instances, for different topology components (i.e., sprouts and bolts). The Topology Master manages the topology, in particular communicating with its Stream Managers and metrics managers. Stream managers communicate with Heron Instances and manage where data tuples are sent and at what rate. In particular, stream managers apply backpressure, noticing when data buffers are getting filled and communicating to their Heron Instance and Stream manager neighbors to slow their data send rate; these messages recursively propagate through the directed acyclic graph of processes as necessary. Backpressure is important because it reduces the amount of data that is being lost because the buffer it’s being sent to is full. Another great thing about Heron is that allocation of resources is done dynamically: the scheduler finds machines that can be used and creates containers, and the Topology Master assigns spouts and bolts to containers. For end users at Twitter, the Heron system also has a Tracker, UI, and Viz for interacting with topologies and viewing metrics, errors, and logs. The authors ran experiments to compare Storm and Heron. They used two different workloads: 1) Word Count topology, which has minimal work, to evaluate system overheads, and 2) RTAC (real-time active user counts). They also ran each workload a) with acknowledgements enabled, and b) with acknowledgements disabled. For the “Word Count” experiment, they found that Heron achieves a much higher throughput than Storm (10-14x), and that Heron has a less steep latency rate than Storm. For the RTAC experiment, they found that Heron requires fewer CPU resources than Storm to keep up with the ~6M tuples/min rate.

I think the approaches the authors used in designing Heron make sense and are smart: systematic backpressure to reduce lost data, and separate processes for different topology components and management components to make debugging easier and to more flexibly scale and manage machine resources.

In section 7.1, the authors say that the topologies evaluated “were constructed primarily for this empirical evaluation, and should not be construed as being the representative topology for Heron/Storm workloads at Twitter”. This makes me wonder how well Heron works for Twitter’s actual workloads. If Twitter did not want to include actual workloads (for privacy or other reasons), it would have been nice to see a characterization of what Twitter’s workloads are like, and how they compare to the Word Count and RTAC ones. Also, as a minor point, I think topologies could have been explained better, in particular, a more concrete example of a topology earlier in the paper; I only really understood what a topology was when I was reading about “Word Count” topologies in the evaluation (section 7.3).



Review 16

This paper proposed Heron, a platform for real-time analytics at Twitter. It is a superior model of the older platform called storm. Heron can scale better, has better debug-ability, has better performance, and is easier to manage. The main part of the paper is to present the design and implementation of the system, and also gives the empirical result of the efficiency and scalability of Heron.

The data model of Heron is identical to that of Storm, which is topologies. A topology is a directed acyclic graph. Nodes inside topology represent data-computing elements while the edges represent the streams of data flowing between those elements. There are two different types of nodes: spouts connect to a data source and inject the data into a stream and bolts both process incoming data and emit data. The paper also introduces heron instance, which is a JVM process and only runs a single task of the spout or the bolt.

I think the most strong part of the paper is its considering of all implementation details of the large-scale system in the production environment. For example, in the stream manager subsection, the paper not only give the design principles but also talks about the TCP backpressure under different circumstances. Similarly, in the heron instance section, the paper considers different approaches including single-threaded and two-threaded approaches., and gives detail explanation about the difference between the approaches.

The drawback of the paper to me is that the novelty of its data model is nothing. It just inherits the data model of storm. Although the paper claims that they do this considering the compatibility of Heron with Storm, I still put this as a drawback.


Review 17

In this paper, the engineers from Twitter proposed a novel real-time stream data processing platform called Heron. Before the adoption of Heron, they are using Storm as their main platform for real-time analytics at Twitter. However, the Storm is subject to several drawbacks, it cannot handle large scale workloads, it has bad debug-ability and poor performance. Design and implement a new real-time stream data processing system is definitely an important issue for Twitter. First of all, Twitter has many customers and its main service require a strong real-time processing ability at scale, build a high-performance system will improve the user experience and expand their business. Second, build a system internally means that it can be utilized by multiple products inside Twitter and thus reduce the cost of using third-party software. Based on these demands, they design and implement Heron which scales better, has better debug-ability, has better performance, and is easier to manage – all while working in a shared cluster infrastructure. Next, I will summarize the crux of this paper with my understanding.

First of all, they introduced several drawbacks of Storm including poor scalability, debug-ability, manageability and resource sharing. One big challenge is debug-ability, it the operation, there can be many reasons result in performance degradation. In Strom, there are multiple components of a topology to form a system process which make the debug very hard. Based on this, Heron provides a cleaner mapping from the logical units of computation to each physical process, which makes the debugging much easier. In addition, Storm needs dedicated cluster resources, which requires special hardware allocation to run Storm topologies. This causes the inefficiency in using cluster resources. Based on this, Heron provides the ability to work in a more flexible way with popular cluster scheduling software that allows sharing the cluster resources across different types of data processing systems. Third, with Storm, provisioning a new production topology requires manual isolation of machines and this job is cumbersome. Heron is aiming to solve these problems and achieve several goals.

Based on the demands above, they proposed Heron which is API compatible to Strom which makes it easier for the user to switch. Besides providing us with significant performance improvements and lower resource consumption over Storm, Heron also has big advantages in terms of debugging ability, scalability, and manageability. First, they discussed some motivation of Heron and some design alternatives. Next, they discussed the Data Model and API. Like Storm, Heron runs topologies which is a directed acyclic graph of spouts and bolts. A Heron topology is equivalent to a logical query plan in a database system. Heron’s tuple processing semantics are similar to that of Storm including at most once and at least once. For the architecture of Heron, it contains one Aurora Scheduler and several Topologies. Aurora is a generic service scheduler that runs as a framework on top of Mesos. Each topology is run as an Aurora job consisting of several containers. The first container runs a process called the Topology Master. The Topology Master is responsible for managing the topology throughout its existence. The key function of the Stream Manager is to manage the routing of tuples efficiently. Each Heron Instance (HI) connects to its local SM to send and receive tuples. The main work for a spout or a bolt is carried out in the Heron instances. Unlike the Storm worker, each HI is a JVM process, which runs only a single task of the spout or the bolt. The Metrics Manager collects and exports metrics from all the components in the system. Besides, they introduced Heron in production and some empirical evaluations. The experimental result of Heron is promising, it delivers 6-14X improvements in throughput, and 5-10X reductions in tuple latencies.

There are several advantages of Heron. First, Heron can save resources in small clusters and it is able to support large cluster scenario, this greatly outperforms Storm because of the scalability issue of Strom. Also, from an engineering perspective, it is easy to migrate service from Apache Strom to Huron because it is backward compatible to the previous APIs in Strom. Besides, Heron is very flexible and can be configured to handle different use cases in Twitter. In addition, this paper gives rich examples and diagrams which give a clear description of concepts thus easy to read and understand. Last but not least, I think they not only provide the detail of design and implementation of Heron, but this paper also shares the experience of operating Heron in production, this also provides a good guide for people who are building or maintaining such systems.

However, there are also some drawbacks of Heron. First of all, I think Heron is more suitable for very large scale clusters and can be applied to a large application. Because there will be some waste of resource for small applications with Heron. Second, as they said in their paper, Heron is only available for at most once and at least once, there is no exactly once guarantee for Heron, this can be a limitation because, in some application, we do want an exactly once deliver.



Review 18

This paper introduces Heron, which is system for stream processing & real-time analytics developed for use at Twitter. It was specifically developed as a soft replacement (meaning it stayed compatible with the Storm API to make switching over as painless as possible) to the previous system, Storm. Twitter’s main issues with Storm were that it did not scale very well and was not easily manageable; in particular, debugging and pinpointing resource usage were major pain points that Storm was facing.

Heron is a system built using similar ideas as Storm in order to preserve compatibility—both, for example, use topologies as their main data model. A topology is an undirected graph that uses spouts to get data as input and bolts to process this data for real-time analytics. One difference between Heron and Storm is that Heron uses the Twitter-built Aurora scheduler rather than Nimbus, which was specifically complained about in the introductory sections. Heron does several things to improve scalability and manageability over Storm:

1) It provides a “Topology Master” that is responsible for discovering the status of the topology at any given time.

2) It uses back pressure to make sure that congestion & bottlenecking don’t happen due to overly optimistic processing in the first stages of a spout.

3) It provides a “Metrics Manager” that provides metrics for each of the components in the system — this helps solve the debugging issues that plagued Storm as things can be found/discovered much faster.

4) The bulk of the work done for any spout or bolt is encapsulated in a “Heron Instance” (single process) which makes it easy to debug/profile the execution of something, unlike Storm where it was difficult to discern which node was executing what at times due to threading.

The strengths of this paper are fairly self-evident. A problem was identified with an industry-grade system, and a new system was built to replace it. The new system performs well and accurately solves the issues it was meant to replace, while preserving compatibility (which is no small feat).

The main weakness of this paper was the logical outline. The problems with Storm were well-introduced, but the “here’s how we fixed it” didn’t come until very late in the paper, relatively speaking. I could infer that Heron solved the various issues that were introduced but rather than stating “Here’s the problem, and here’s how Heron fixed it”, it went more like “Here are all the problems. Heron fixed them, and here’s Heron’s architecture…and finally yes this is how those problems were fixed”. It wasn’t impossible to understand but it didn’t flow nearly as nicely as some of the other papers that we have read, in my opinion.