Review for Paper: 35-Scaling Distributed Machine Learning with the Parameter Server

Review 1

This paper introduces Parameter Server, a framework for distributed machine learning problems. Its features include easy usage (of shared parameters), efficiency (async communication), support flexible consistency models from eventual consistency to sequential consistency, scalability and fault tolerance for stable long-term deployment.

One thing I like about this paper is, before introducing the architecture of parameter server, it first introduces a detailed background about machine learning tasks and algorithms involved, which can gives infra engineers with less machine learning experience more insight into reasons behind design choices.

In the Architecture introduction section, the paper introduces how key-value vectors are used, push-pull for efficient communication, flexibility in consistency supported (sequential, eventual, bounded delay), and user-defined filters. In the Implementation section, cluster-related techniques/requirements are considered, topics including vector clock, replication, and server/worker management are discussed.

Among the discussions on architecture designs, the topic I like most is how servers and workers are managed, and seeing why different management methods are used.

The system performance seems very promising, with fast training convergence, good CPU efficiency, and reasonably good scalability.

Overall, I like reading this paper, besides the reasons stated above, it's also very easy to read, like most other industry-implementation papers.

Review 2

Machine learning field is growing in scale rapidly recent years. Single machine cannot solve those problems sufficiently because the growth of data and model complexity. In order to solve those large-scale machine learning problems, distributed optimization is necessary. The paper proposes a parameter server framework for distributed machine learning problems. It distributes both data and workloads to worker nodes and maintains globally shared parameters as vectors and matrices in the server nodes. The framework also handles asynchronous data communication between nodes, supports flexible consistency models, elastic scalability and continuous fault tolerance.

Some of the strengths and contributions of this paper are:
1. Asynchronous communication model won’t block computation, which make the communication more efficient for distributed machine learning tasks.
2. Consistency constraints are relaxed and become more flexible because the user can balance convergence rate and system efficiency by custom algorithm settings.
3. Globally shared parameters are represented as vectors and matrices to facilitate development of machine learning applications, which make the framework easier to use.

Some of the drawbacks of this paper are:
1. Flexible consistency may lead to data inconsistency between nodes.
2. The evaluation part only focusses on different use cases but is not comparing with other frameworks.

Review 3

Large scale machine learning problems require distributed optimization and inference, which cannot be solved by a single machine sufficiently rapidly, due to the growth of data and the resulting model complexity. Therefore, this paper proposed parameter server framework for distributed machine learning problems, where both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices.

In this framework, a parameter server instance can run more than one algorithm simultaneously. Parameter server nodes are grouped into a server group and several worker groups. Server nodes communicate with each other to replicate and/or to migrate parameters for reliability and scaling. Each worker group runs an application. A worker typically stores locally a portion of the training data to compute local statistics such as gradients. Worker nodes only communicate with server nodes, pushing and pulling parameters. A scheduler node is used for each worker group. It assigns tasks to workers and monitors their progress.

The main contributions of this proposed framework are as follows:
1. All communications are asynchronous unless requested otherwise. And are optimized for machine learning tasks to reduce network traffic and overhead.
2. It provides a flexible consistency model that can allow some level of relaxation to better balance convergence rate and system efficiency.
3. It provides elastic scalability in which new nodes can be added without restarting the running framework.
4. It provides fault tolerance and durability.
5. Globally shared parameters are represented as vectors and matrices to facilitate machine learning applications. By treating the parameters as sparse linear algebra objects, the parameter server can provide the same functionality as (key, value) abstraction, but also admits important optimized operations such as vector addition, multiplication, 2-norm, and other more sophisticated operations. 

The main advantages of this paper are as follows:
1. By factoring out commonly required components of machine learning systems, it enables application-specific code to remain concise.
2. As a shared platform to target for systems-level optimizations, it provides a robust, versatile, and high-performance implementation capable of handling a diverse array of algorithms from sparse logistic regression to topic models and distributed sketching.

However, there is some limitation of the proposed framework. On one hand, this paper provided a way to scale machine learning computations, but on the other hand, it didn’t provide a flexible enough programming model for building different machine learning applications.

Review 4

Problem & Motivation
Machine learning applications require lots of data to train, ranging between 1TB and 1PB and need to train at least 10^9 parameters. Apparently, it is impossible to conduct the task only on a single machine. Therefore, a distributed storage & computing system is needed. In particular, it should satisfy the following properties. Efficient communication, flexible consistency models, elastic scalability, fault tolerance and durability and ease of use. Given the specific pattern of machine learning workload, we can conduct many optimizations.

Contributions:
The paper contributes many thoughts for optimizations based on the specific patterns of machine learning workload.
 The first pattern is that every worker doesn’t need to know every parameters and data to conduct a round of iteration.
 Some tasks are independent because, for each round, it could be that only a fraction of parameters has been updated.
 Some machines can go down and new machines can come in. The failure is common for ML tasks.

And the whole system is based on the originally distributed algorithms. Basically, having a task scheduler to assign tasks to workers and sum the updates round by round. The updated system has three types of nodes: the server node, worker node, scheduler node. The server node is responsible for storing data and parameters. Worker node is for calculating the data. Scheduler node is responsible for assigning tasks to workers. All key value pairs are partitioned into different servers, yet within the server, all keys are sorted and some optimizations can be adopted. The scheduler node can decide whether a task is asynchronous or not depending on whether it is independent of the previous one. The server node can automatically add new server node and tolerate failures with the same technique the Amazon server use and with the vector clock, it can guarantee that duplicated command will be rejected.

Drawbacks:
1. The paper mentions two type of node failures – server node and worker node. Yet, it does not mention the fault tolerance of the scheduler node, which I think is the most interesting one.
2. Which optimizations we can use if we have a sparse matrix with all keys within it sorted?

Review 5

Machine learning tasks are steadily becoming more important, yet as the amount of data being collected continues to grow, the requirements of this type of workload continue to increase. In general, it is rule now that performing training has to be done in a distributed manner, since that is the only way to feasibly handle data on the order of petabytes. Distributed computing at scale has its own challenges, such as huge network bandwidth requirements, performance penalties associated with synchronization and machine latency, and ensuring fault tolerance. Even so, parameter servers designed to handle this type of task have been a hot topic, and this paper proposes a parameter server architecture that is potentially capable of handling these challenges of performing machine learning at scale. The authors identified five features key to their design:

1. Effective communication – asynchronous communication to reduce network usage and avoid blocking operations.
2. Flexible consistency models – reduce synchronization costs and latency by relaxing consistency requirements.
3. Elastic scalability – new nodes can be added without restarting the framework.
4. Fault tolerance and durability – recovery from non-catastrophic machine failures within 1 sec, without interrupting calculations.
5. Ease of use – parameters are represented as vectors and matrices to facilitate machine learning development.

What distinguishes this paper from its contemporaries is that it claims to be the first general purpose machine learning system capable of scaling to industrial level sizes. The paper identifies two major classes of machine learning algorithms that it aims to support. The first is termed risk minimization, where labeled training data is given and there is an objective function that calculates how far system predictions are from the desired values. The goal then is to iterate until the loss is as small as possible. To handle this, each worker handles a portion of the overall training data and calculates the subgradient, which are combined together by the servers and then pushed out to the workers again. The second class of algorithms is unsupervised, in which labels for the training data are unknown. One example of this is Latent Dirichlet Allocation (LDA), and in general, it is similar to supervised algorithms in its training steps, with the main difference being that an estimate of how well the data can be explained by the model is computed, rather than calculating the gradient. For this parameter server framework, each node is grouped into a server group and several worker groups, the latter of which can support common namespaces in order to increase parallelization. Between nodes, the model used is a set of key value pairs, and data is transferred using push and pull operators. Server nodes also support user-defined functions, and the system uses flexible consistency to realize performance gains through better parallelization. Each key value pair is associated with a vector clock that records a timestamp for each node on this pair, which is used for recovery purposes. Messages are passed between nodes, and they can be compressed to save on network bandwidth. Keys are partitioned similarly to a distributed hash table, and each server node stores a replica of the k counterclockwise neighbor key ranges relative to the one it is in charge of. Also, both server and worker nodes can be added and removed as needed to achieve fault tolerance and dynamic scaling.

The main strength of this paper is that it introduces a new general framework for handling arbitrary machine learning workloads in a distributed fashion that is capable of easily scaling to industry-level requirements. As the performance results show, the parameter server is able to eliminate most of the time spent waiting that other comparable systems have, such that the vast majority of the time is spent actually performing computations. Additionally, for the Latent Dirichlet Allocation task, the various optimizations greatly decreased the amount of network traffic, on the order of 2 to 40x improvements.

One of the weaknesses of this paper is that it is yet another system being proposed to handle distributed machine learning problems, which introduces the difficulty of achieving widespread adoption in the technical community. History is littered with examples of good ideas that failed to gain traction and ultimately fell into obscurity. Additionally it would be interesting to examine whether machine learning algorithms are truly generic enough for this system to be handle nearly all types without having to make any major changes, as it was originally designed to do.

Review 6

The purpose of this paper was to present a parameter server for distributed machine learning systems. The challenge that the paper addresses is building a system that can distribute machine learning data and computations over multiple worker nodes using a shared parameter server. While the computational resources of a distributed system are an advantage, the network bandwidth of sharing data as well as writing an algorithm to manage resources are challenges that come with it. The paper proposes an open source implementation of a parameter server that allows for concise application specific code but still supports a wide range of applications. The author’s contribution provides the following five features:
1) Asynchronous communication across the distributed system does not block computation unless it is supposed to.
2) The developer can tune the efficiency of the system which may come at the cost of consistency, or more specifically in the author’s words, they may “balance algorithmic convergence rate and system efficiency”.
3) New nodes can be added without restarting the system.
4) Able to recover from faults within 1 second without interrupting computation, using vector clocks.
5) The parameters on the servers are linear-algebra-friendly, as they come in the form of vectors and matrices and have multi-threaded libraries for efficient computation.

Two engineering challenges that the author faced were communication and fault tolerance. Communication in this context mainly deals with keeping consistency with the parameters on the server as different tasks update only a part of the data, like a single row in a matrix or a small part of a vector. The fault tolerance is as mentioned above where a full system restart is not needed in the face of a minor failure. The paper then breaks down the anatomy of a machine learning problem into the feature extraction, the objective function, and learning. They mention how training datasets can also get to be quite large and specifically how a large internet company had a trillion data points for their ad impression log. Next the author speaks on risk minimization which is the optimization of a loss and a regularizer term which indicate the prediction error and penalty on model complexity, respectively. Next the author speaks on generative models, where a system must attempt to find the underlying data model for a problem, and how the data would be distributed among nodes in an example problem called topic modeling.

The architecture of the authors system is a group of parameter server nodes called a server group and multiple other groups called worker groups. The linear algebra optimizations include parallelized computation on (key, value) vectors as well as distributed computation in the case of range push and pull. Their system supports user defined functions and also communicates across nodes asynchronously and manage dependencies. As for implementation, they maintain consistency using a vector clock, which keeps track of when updates were made to a range of (key, value) vector indices. This data can be compressed since ranges of data will keep the same timestamp. Messages can be compressed if they are sent to a range of nodes, like a server group, for example. Data is also replicated synchronously, so a push action is not completed until it has been copied to the necessary nodes.
I liked how this paper did not assume the reader knew too much about machine learning. It gave enough context to appreciate what part of a machine learning distributed computing system they were improving upon, without giving too many unnecessary details on machine learning topics. I did not like, however how the author placed figure 3 seemingly out of place so early in the paper and not towards the end with the rest of the analysis.

Review 7

In this paper, the author mainly discussed his work on parameter server. In the field of machine learning and deep learning, distributed optimization has become a prerequisite. Because the single machine has not solved the current fast-growing data and parameters.

In reality, the number of training data may reach between 1TB and 1PB, and the parameters during training may reach 10^9 to 10^{12}. Often the parameters of these models need to be accessed frequently by all the worker nodes, which brings many problems and challenges.

In the parameter server, each server is actually only responsible for the partial parameters (servers maintain a global shared parameter), and each work is only divided into partial data and processing tasks; the server node can be associated with other server nodes. Communication, each server is responsible for its own parameters, server group to maintain the update of all parameters.

The server manager node is responsible for maintaining the consistency of some metadata, such as the status of each node, the allocation of parameters, etc. There is no communication between the worker nodes, only the server corresponding to them communicates. Each worker group has a task scheduler that assigns tasks to workers and monitors the operation of workers. When a new worker joins or quits, the task scheduler is responsible for reassigning the task.

What are the new features of the parameter server? Efficient communication, there is no need to stop and wait for some machines to perform an iteration, which greatly reduces the delay. Some optimizations for machine learning tasks can greatly reduce network traffic and overhead. Flexible consistency models further reduce the cost and latency of synchronization. The parameter server allows the algorithm designer to do trade-off between algorithm convergence speed and system performance based on its own situation.
A distributed hash table is used so that new server nodes can be dynamically inserted into the collection at any time; therefore, adding a new node does not require re-running the system.
Globally shared parameters can be represented in various forms: vector, matrices or the corresponding sparse type, which greatly facilitates the development of machine learning algorithms. And the linear algebra data types provided have high-performance multi-threaded libraries.

The contribution of this paper mainly focus on the proposal of third generation parameter server that have so many great features mentioned above. This platform can actually solve real industry problems. This is good.

One thing I like about this paper is that this paper use a lot of graphs to show the idea which is good.

Review 8

The parameter server solves an important piece of the distributed machine learning puzzle. The framework includes worker nodes and server nodes; the worker nodes do the work on data, while the server nodes maintain globally shared parameters. These are things like weights, which are updated by machine learning algorithms. This is a third generation of a parameter service system.

Previous approaches had used a key-value model as an abstraction, where entities could be read or written using a key. This newer version takes advantage of the fact that machine learning algorithms typically use linear algebra, and treat the parameters as a vector, assuming that keys are ordered. Workers can push data (for instance a local gradient update) to servers, and then request data back using a pull operation. The parameters are partitioned using consistent hashing.

A key idea in this paper that I think is explained well is to consider how machine learning algorithms can be optimized for distributed systems. Many machine learning algorithms can tolerate a certain amount of inconsistency - they might just take a little bit longer to converge. The paper describes three consistency mechanisms (sequential, eventual, bounded delay), which the user can choose between depending on the requirements of their algorithm. Similarly, the user has the opportunity to control recovery, depending on how they want to balance processing all of the data vs. time. I think that it is very useful that these optimizations are left up to the users.

One concern I had was in the experimental setup, which mentioned that the cluster used was in use by unrelated tasks while the experiments were running. I always find this problematic, as it makes the experiments seem difficult to reproduce. If this is being done to replicate what real-life usage would look like, it seems like some explanation of the CPU usage by the other tasks, etc., could be mentioned.

Review 9

This paper introduces Parameter Server, a framework for distributed machine learning problems. The motivation for this kind of problem is that for large-scale machine learning problems, no single machine can solve it sufficiently rapidly. The quantities of training data can range from 1TB to 1PB and a complex model may contain 10^9 to 10^12 parameters. To solve this kind of problem, we need a framework to distribute the job to multiple servers and handle problems like parameter accessing, fault tolerance, etc.

The purposed parameter server provides a robust, versatile and high-performance solution. Its architecture is as follows. A parameter server instance contains two kinds of nodes: parameter servers and workers. All the parameter servers form a server group, while the workers form multiple worker groups. Each server inside the server group is responsible for maintaining a partition of the globally shared parameters. There’s also one special server manager inside the server group. It will maintain a consistent view of the metadata, for example, the node liveness and parameter assignment. In terms of the worker group, there’s a special task scheduler node who is responsible for assigning tasks to workers and monitoring progresses. Each worker node will execute the given task and communicate with the server nodes to push or pull parameters (updates). Of course, it also needs to have access to the training data to perform the computation. Notice that there’s no communication among the workers.

Based on this architecture, there are several refinements which are crucial for achieving high performance. The first one is range push and pull of parameters. Since the model is expressed as a set of key-value pairs, it will be wasteful to only push or pull one parameter at a time. Range push and pull allows multiple parameters to be sent over the network at a time and greatly reduces the overhead. Another important refinement is the fact that tasks can be executed asynchronously. In this way, multiple iterations of an algorithm can be executed concurrently, and therefore reduces execution time.

In terms of the implementation, there are lots of familiar ideas. For example, consistent hashing is used to partition parameters among severs and vector clocks are used for fast recovery in case of system failure. There are also some new ideas such as chain replication which is used to tolerate sever node failure.

I think it will be better to include a neural network experiment in the evaluation part as that way we can compare the performance of this framework to other machine learning frameworks. For now, I can’t find such comparison.

Review 10

In the paper "Scaling Distributed Machine Learning with the Parameter Server", Mu Li and Co. discuss and propose a new parameter server framework for distributed machine learning problems. As time passes and the size of data grows, machine learning algorithms are becoming increasingly complex - as a consequence, the number of parameters are increasing rapidly. Thus, it order to solve these problems in a reasonable amount of time, distributed optimization and inference is becoming a prerequisite for these large scale machine learning algorithms. To get a better understanding of industry level workloads, realistic quantities can range from TBs to PBs of data and trillions to quadrillions of parameters. Furthermore, most ML algorithms frequently access shared parameters to fine tune them (a clear bottleneck). As a result, this paper attempts to tackle three challenges:
1) Parameter access requires a lot of network bandwidth
2) ML algorithms are mainly sequential so overhead is high when considering that synchronization and machine latency is high.
3) Fault tolerance is very critical - cloud environments aren't the most trustworthy place to run ML algorithms.
Keeping these points in mind, Mu Li and others discuss this new framework in high level detail to target pain points of current ML developers. Using worker nodes to distribute data/workloads and server nodes to maintain globally shared parameters, this frameworks allows for asynchronous data communication between nodes, flexible consistency models, elastic scalability, fault tolerance, and ease of use. Thus, it seems clear that this can serve as a great asset to the machine learning community - an interesting and important exploration in this field.

This paper is split into several sections that methodically explain the core contributions:
1) Machine Learning
a) Goals: Objective function captures properties of the problem and we want to minimize it. Multiple iterations are made to refine the model; it stops when it has reached a near-optimal solution.
b) Risk Minimization: Risk or prediction error is something we want to keep to a minimum. We want to avoid "overfitting" (memorizing answers) and "underfitting" (lack of training data). In general we want low error and low complexity.
c) Generative Models: Models that fall under unsupervised learning end up doing a lot of parameter sharing. Since the scale of these models are so huge, it is imperative to parallelize the algorithms across large clusters.

2) Architecture
a) Key-Value Vectors: Key-value pairs enable for abstract assignment in different workloads. Specifically, these pairs are treated as sparse linear algebra objects and optimize for sophisticated operations.
b) Push/Pulls: Each worker pushes its entire local gradient into the servers, and then pulls the updated weight back. In the case of a more advanced algorithm, a range of keys is communicated each time instead.
c) Asynchronous Tasks: Tasks are executed asynchronously. This means that the caller can perform other computations immediately after issuing a task.
d) Flexible Consistency: Three different models can be implemented by task dependency: Sequential consistency, Eventual consistency, and Bounded Delay. These different models cover a variety of use cases and does not force the user to adopt a particular type of dependency that might impose some consequences for a particular problem.

3) Implementation
a) Vector Clocks: They use a specialized version of vector clocks that reduces space overhead. This is because many parameters share the same timestamps so fewer things need to be stored.
b) Messages: Messages can be sent between nodes as key-value pairs. Since ML algorithms require high bandwidth, there much be caching and compression.
c) Replication: Replication only occurs after aggregation to reduce bandwidth. Thus, we have copies of previous states after core steps to roll back to in the case of failure.
d) Server/Worker Management: Much like other Google architecture we have seen in the past, master and worker nodes communicate with each other to distribute workload and identify and handle failures.

Much like other papers, this paper also has some drawbacks. Before discussing the drawbacks, I would like to give some appreciation to the section that described current machine learning goals and methods. This made the rest of the paper easier to understand, unlike the TensorFlow paper which gave no background and was hard to follow. The first drawback that I noticed was near the end of the paper where they mentioned this design as a third generation parameter server but never do performance analysis against the previous two generations. Furthermore, they were being quite vague when they say that the second generation fails to factor out common difficulties between different types of problems. Another drawback that was present in the paper was when they were discussing consistent hashing. Unlike Amazon, it seems they did not use virtual nodes to help with load balancing between machines of different speeds.

Review 11

This paper proposes a method of implementing machine learning algorithms on a highly distributed system. Distributed machine learning systems have a few issues they need to contend with:

They need fault tolerance.
Accessing shared parameters can be a bottleneck.
Many machine learning algorithms are sequential, and it’s hard to synchronize a distributed system.

The system that the paper proposes has these properties:

Efficient asynchronous communication.
Relaxed consistency requirements.
Scaling up without restarting the system.
Fault tolerant, and recovering within 1 second.
Easy to use.

The parameter bottleneck is solved by dividing parameters among several parameter servers, each which manages only a subset of all of the parameters. The parameters are divided among servers through consistent hashing. Since each worker usually only needs a few parameters at any given time, it doesn’t have to access all of the servers.

The parameters that the servers store are key-value pairs. However, the system often wants to perform linear algebra operations on these, so they can be stored as linear algebra objects, and then treated as key-value pairs when necessary. Any missing keys in the parameter list can be treated as 0s. If the keys of parameters are sorted, this makes storage easier.

Worker nodes that actually do computation can request parameters from a server. Workers don’t request single parameters, but rather a range of keys, for which all the values are returned. Key-value pairs are pushed and pulled between workers and parameter servers.

The worker tasks are asynchronous by default, but the user can define one of several consistency levels. Sequential consistency is slower, but stronger, if necessary. Eventual consistency can be used if immediate consistency isn’t necessary, which may make the algorithm converge slower, but it will make individual steps faster. In the middle of these two, a user can use a bounded delay model, where a new task can’t start until tasks started a time r ago have finished. Vector clocks can be used to resolve inconsistencies if they exist.

New parameter servers and workers can be added at any time. If a parameter server is added, it needs to get a key range of parameters that it will manage, taking them from other servers if necessary. If a worker is added, it just needs to be assigned the parameter that it needs for its tasks, and then it can ask the servers for them.

The advantage of this system is that it allows the implementation of machine learning algorithms efficiently on massive scales. The ability to run the system effectively on a distributed system. The changeable consistency levels allow the user to tune the algorithm’s performance to their needs.

However, there isn’t much comparison between this system and others. It’s an improvement over standard machine learning systems, so the paper should display in exactly which ways it improves, and by how much. Most of the experiments seem to show the performance of just the one system in different environments.

Review 12

The paper presents Parameter server, a distributed machine learning framework. In this framework, globally shared parameters can be used as local sparse vectors or matrices to perform linear algebra operations with local training data. All communication is asynchronous and Flexible consistentcy models are supported to balance the trade-off between system efficiency and fast algorithm convergence rate. Furthermore, it provides elastic scalability and fault tolerance, aiming for stable long-term deployment.

Contributions:
(1) The parameter-server approach for distributed machine learning is one of the most promising approaches to scale machine learning for industrial-scale data, and I believe this paper provides a significant contribution by addressing practical concerns that can only arise in such a scale and not yet widely-known in the machine learning community.
(2) Relaxed consistency further hides synchronization cost and latency. The paper gives freedom to the algorithm designer to balance algorithmic convergence rate and system efficiency. The best trade-off depends on data, algorithm, and hardware.

Weak points:
(1) An unfortunate drawback of such a writing style is that the generality of the proposed framework is a bit lost. It might be more desirable if authors could describe the class of machine learning algorithms well supported by the third generation parameter server; an abstract mathematical model for such algorithms, if possible, would be even better.

Review 13

This paper proposes a parameter server framework for distributed machine learning problems. Both data and workloads are distributed over worker nodes, while the server nodes maintain globally shared parameters, represented as dense or sparse vectors and matrices. The framework manages asynchronous data communication between nodes, and supports flexible consistency models, elastic scalability, and continuous fault tolerance. The main contributions of this system are: efficient communication, flexible consistency models, elastic scalability, fault tolerance and durability, and ease of use. The globally shared parameters are represented as vectors and matrices to facilitate development of machine learning applications. The goal of this system is to minimizes the objective function to obtain the model. Objective function captures the properties of the learned model, such as low error in the case of classifying e-mails into ham and spam. Risk minimization learns a model that can predict the value y of a future example x. For risk minimization, regularized risk minimization is a method to find a model that balances model complexity and training error. The framework should also support generative models, since for a major class of machine learning algorithms, the label to be applied to training examples is unknown. The model shared among nodes can be represented as a set of (key, value) pairs. Data is sent between nodes using push and pull operations. Each worker pushes its entire local gradient into the servers, and then pulls the updated weight back. The parameter server optimizes these updates for programmer convenience as well as computational and network bandwidth efficiency by supporting range based push and pull. Beyond aggregating data from workers, server nodes can execute user-defined functions. Tasks are executed asynchronously: the caller can perform further computation immediately after issuing a task. Callees execute tasks in parallel, for best performance. A caller that wishes to serialize task execution can place an execute-after-finished dependency between tasks. There three modes of consistency: sequential, eventual, and bounded delay. In addition to a scheduler-based flow control, the parameter server supports user defined filters to selectively synchronize (key,value) pairs, allowing fine-grained control of data consistency within a task.
The paper then presents the implementation details of the parameter server framework and evaluation of the system using three different machine learning problems, which are sparse logistic regression, latent dirichlet allocation, and sketches.
The advantage of this paper is that the evaluation is based on three different machine learning problems, which is very thorough. However, the background knowledge, engineering challenges, and related work parts are too long and should not be the focus of this paper.

Review 14

This paper presents a new parameter server framework for distributed machine learning. As a parameter server framework, the iteratively optimized machine learning parameters are shared and stored on server nodes, while worker nodes do computation. The new features and improvements of this particular framework are efficient communication, flexible consistency models, elastic scalability, fault tolerance and durability, and ease of use. The architecture for this framework consists of a server group (with server nodes that communicate among themselves and each contain partitions of the parameters) and worker groups (each group running an application, and worker nodes doing computations based on training data). Note that worker nodes only communicate directly with the server nodes and not with each other. Model data (e.g., a feature ID and its weight) are represented as (key, value) vectors, and are stored as sparse linear algebra objects for improved performance. The authors evaluated the system with a couple machine learning problems: sparse logistic regression and latent dirichlet allocation. They also test on sketching algorithms, a very different kind of workload.

For a non-expert in machine learning, I think it is very nice that the paper has a section giving an overview of machine learning. This makes sense, since the paper is in the proceedings of a systems (and not AI/machine-learning) conference.

Review 15

This paper proposed a parameter server framework for distributed machine learning problems. The requirement for a distributed machine learning system comes from the situation that the quantities of training data can be so large, 1TB and even 1PB, which means that there can be over 10^9 to 10^12 parameters in a complex model.

The paper gives the design of their system and shows the empirical results on real data experiment. The main contribution of the paper is to factor out commonly used components of machine learning and enable application -specific code to remain concise. Since the production environment over clusters and distributed system is complex, the parameter server considers five key features: efficient communication, flexible consistency models, elastic scalability, fault tolerance and durability, and easy of use.

The strong part of the paper to me is that it considers what a distributed machine learning system requires. They use server nodes to maintain globally shared parameters and replicate over nodes to improve reliability. And worker nodes are assigned with a portion of training data to compute local statistics such as gradients. And the novel part is that shared nodes can be represented as a set of (key, value) pairs, and range push and pull is introduced to improve the efficiency of network bandwidth.

The backward of the paper to me is that the system is somehow similar to previous distributed systems such as Google file system and Amazon Dynamo. I think the real novelty of this paper is relatively little. The ideas of vector clock and using replication is common in distributed systems. I think it's a common short part for system papers.

Review 16

In this paper, the authors proposed a novel parameter server framework for distributed machine learning problem. The problem they are trying to solve is to design and implement a new framework which focusing on very large-scale machine learning workloads featuring high flexibility, consistency, scalability and failure tolerance. This problem is definitely an important issue, nowadays the machine learning is a very popular topic and widely applied to different areas, however, most algorithms are subjects to large-scale problem, at scale, no single machine can solve these problems sufficiently rapidly, due to the growth of data and the result model complexity, a good insight is to use a distributed system to split down issues. However, making machine learning workload distributed is not an easy way due to intensive computational workloads and the volume of data communication demand. It is worthwhile to explore this field and propose a new scalable framework to follow this trend and this is why this propose this framework. Next, I will summarize this paper with my understanding.

Distributed large-scale machine learning is facing many challenges inducing large network bandwidth requirement, the high cost of synchronization and machine latency and fault tolerance requirement. The goal of this paper is to solve these problems by introducing a distributed system supporting the efficient execution of Machine Learning algorithms used daily on “trillions of trillion-length feature vectors” possibly even in real-time. They outlined three key components for machine learning workloads which are feature extraction, objective function, and learning. For the system architecture, there are two main kinds of roles the server node and worker node. Parameter server nodes are grouped into a server group and worker nodes are grouped to form worker groups. Each parameter server only responsible for part of parameters and each worker is only assigned with partial data and tasks. The server node can communicate with other server nodes, each server is only responsible for its parameters and the server group maintains the parameter updates together. There is no communication between workers and worker only communicate with its server through task scheduler.

The model shared among nodes can be represented as a set of (key, value) pairs. It makes things easier by representing the parameters are key-value pairs, the communication between worker and servers are using push and pull, the parameter server allows range push and range pull. In this framework, asynchronous tasks are utilized to improve the performance of the system, however, it may degrade the convergence speed. To solve this problem, this server provides flexible consistency including sequential, eventual and bounded delay. The servers store the parameters using consistent hashing. For fault tolerance, entries are replicated using chain replication. Different from prior key-value systems, the parameter server is optimized for range-based communication with compression on both data and range-based vector clocks. For evaluation, they test their system based on Sparse Logistic Regression and Latent Dirichlet Allocation. They also test some stream processing like sketching to illustrate the generality, the result shows that this framework is promising.

I think the main contribution of this paper is the introduction of the parameter server framework which is a pioneering job in solving large-scale machine learning problem in a distributed pattern. The system design decision is guided by real-world workload which makes it more practical. This framework factors commonly required components of machine learning systems which enables application-specific code to remain concise. Besides, they utilized a shared platform to target for system-level optimization. There are several advantages of their design, first of all, the communication model is optimized for machine learning tasks to reduce network traffic and workload which can guarantee that communication model never blocks computation, this optimization reduces the overhead and makes the system more efficient. In order to reduce the synchronization cost and latency, this framework introduced a flexible consistency model which it configurable for the tradeoff between convergence rate and efficient. Also, this framework is scalable, one can easily scale out this system by adding more nodes. Besides, this framework is fault tolerant, durable and easy to extend for different algorithms. Last but not least, this framework can handle different type of learning problem including batch and streaming.

Generally speaking, it’s a nice paper with great insights and the downsides of this paper are minor. I think this work looks like a milestone for distributed machine learning framework and the impacts are huge. Personally, I think this framework is a little bit complicated and require a great amount to hardware to support the high performance so that it can only be used by large companies or institutions, a better way is to provide some high-level abstraction to make programming much easier. Besides, I want to see more example about how we can utilize the parameter server to support a different kind of algorithms. Last, I think they don’t discuss any limitation of their model in this paper, I wonder if there is some scenario that parameter server doesn’t work, MXNet is a result for parameter server, however, this framework is not as popular as TensorFlow.

Review 17

This paper introduces an open-source parameter server framework developed by Google for machine learning workloads. It is a distributed system built on a data flow architecture that consists of a server group (made up of server nodes) and a worker group (made up of worker nodes). The server nodes contain the global state information (parameters, etc) and the worker nodes do the actual computation on training data for a given application. that aims to capitalize on 3 observations made about distributed machine learning workflows:

1) Accessing the parameters of a given optimization/algorithm requires large bandwidth. This is partially due to the massive amount of parameters in some models and partially due to the distributed nature of the system. Google’s parameter server framework alleviates this by only allowing worker nodes to communicate with server nodes, and not with each other—this is why only server nodes store global state information. Server nodes keep track of isolated states / namespaces as well, for when worker group states’ conflict.

2) The sequential nature of ML algorithms makes synchronization painful in terms of performance. Google’s PS framework addresses this by making data communication between nodes asynchronous, and its flexible consistency model makes this even less painful.

3) Fault tolerance is crucial when the scale is high, and failures should be assumed to be frequent and tolerable without major performance/safety downfalls. Google’s PS framework uses vector clocks in its recovery scheme which help make recovery accurate even in the presence of network partition/failure.

The weaknesses of this paper / framework are mostly addressed in the TensorFlow paper. For example, the “ease” of development for the developer, which could be considered this paper’s greatest contribution, also hinders flexibility as there are several optimizations / algorithms which cannot be efficiently accomplished in this parameter server framework—most notably, when an algorithm needs to change parameters mid-execution.