Technical Blog

All the latest technical and engineering news from the world of Guavus

Distributed Machine Learning for Big Data and Streaming

by Kuldeep Jiwani, Distinguished Architect — Data Science, Guavus, Inc.

In this era of informed decision-making industry, demand for Machine Learning (ML) and Artificial Intelligence (AI) based analytical solutions has increased exponentially. As data processing technologies are becoming more mature and with new technologies in Big Data, streaming and IoT space, data is now available in abundance and at high speeds. Therefore, the way data is processed by analytical systems calls for a transformation. A new form of ML namely the Distributed ML can prove to be a good choice to handle large and high-speed data, while combating the challenges of noisy and dynamic data sources.

Key application areas of Distributed ML:

  • Distributed inference on the edge
    • Lightweight inference over Internet of Things (IoT) devices
    • High speed inference over real-time streams
  • Distributed training over a compute cluster
    • High communication bandwidth GPU farms
    • Low communication bandwidth inter-connected machines

Distributed design strategy utilises best of both worlds, Distributed Computing and ML algorithms. It brings along the advanced analytical and mathematical reasoning capabilities of ML and combine it with efficient distributed processing in the form of a group of machines either in a Big Data cluster or over a GPU farm or steaming edge devices or over IoT devices. Enterprises should focus on a design thought process while planning to leverage Distributed ML architecture for their business problems.

Distributed ML design strategy aims to solve a multitude of challenges:

Engineering challenges

  • ML at high speeds
  • ML at high volumes
  • ML at low footprint

Mathematical challenges

  • Synchronicity of distributed update equations
  • Convergence to global/local optimum
  • Distribution of ML models

Although there are lots of tools and platforms that have been developed in the space of distributed processing and ML, but due to technical challenges in overall solution, majority of them focus on silos of problem. The most efficient distributed platforms majorly focus on distributed data processing, while the most advanced ML libraries hardly ever focus on distribution.

A majority of ML-based algorithms iterate multiple times over training data, to find the optimal model that best describes the data. If data is distributed, then a high amount of data communication is required along with high compute power. Then there are additional challenges such as a few ML-based algorithms cannot be executed in parts. For example, if we partition a data set randomly into two halves and then apply a clustering algorithm on each partition, the resulting clusters on the two partitioned sets will be very different from each other. It may even contradict the clusters obtained through processing the entire data.

Distributed ML is a concept that is still in its research phases. There is a noticeable performance degradation in ML models when the data size is large and all the data for training models is not utilised. Moreover, some business requires the ML models to respond to high speed data and to match such enormous speeds, the accuracy is usually compromised. It is a critical decision to settle on a ML modelling solution that business requires and that can also be effectively trained and deployed in a distributed manner without compromising the quality and timeliness of the resulting insights.

As the data is large and fast paced, the first step is to move ML closer to the source of data instead of bringing data from different sources at a central location and then applying ML. The second step is to figure out the right mathematical modelling approach combined with the right architectural design for effectively doing Distributed ML processing.

As the challenges in Distributed ML are multi-fold, one strategy for all kinds of cases may not be successful. We need to balance the distributed technological capabilities with the need of the ML algorithm. At times we may need to focus on one aspect more than the other as per the business requirement. So to find that middle ground, let us first look at the various aspects of designing a Distributed ML system.

Machine Learning at Low Footprint

The ML models encapsulate equations or relations that capture the unique distinguishable characteristics of their training data. A model tries to build decision boundaries to distinguish between various classes of data. Depending upon nature of data, the decision boundary could either be a simple N-dimensional line, represented by just the N weights of the equation or be highly non-linear. In the extreme cases like a tree, the tree has to make multiple splits at various depths to capture all the non-linear decision boundaries. This leads to increase in the size of the model and makes it unusable for low memory devices such as IoT devices.

To handle such scenarios, various mathematical techniques are deployed to approximate the non-linear decision boundary with multiple linear boundaries. In the same way as we can approximate a circle with a N-dimensional polygon. For example, as N increases from hexagon, octagon, decagon, and so on, the closer it is able to resemble the circle. The same analogy applies to granularity of approximation of a non-linear model through linear models.

Machine Learning at High Speeds

There have been many advances in this area, for example, the High-Performance Computing (HPC) community has been actively researching in this area for decades. As a result, the HPC community has developed some basic building blocks for vector and matrix operations in the form of BLAS (Basic Linear Algebra Subprograms), which has existed for more than 40 years. This has been wrapped into higher abstractions such as LAPACK, Breeze (Scala), and so on. Intel has also brought out an MKL (Maths Kernel Library) that accelerates the linear algebra operations in conjunction with the underlying hardware. The HPC community has worked closely with hardware manufacturers and even developed special hardware instructions to do complex machine operations in one go. With the introduction of FMA (Fused Multiply Add) instruction set for floating point scalars, along with SIMD (Single Instruction Multiple Data) operations, an operation like below can be done in a single instruction cycle:

y = a ∗ x + b

VFMADD213SD           xmm,xmm,xmm/m64             $0=$1×$0+$2

Over the years, all such specialised computing support has been successfully adopted and integrated into systems by the ML community.

Machine Learning at High Volumes

From architectural perspective, any viable solution to large-scale computational challenge can be categorised under the following:

  • Scaling-Up: Adding more resources to a single machine. For example, adding GPUs have been a common approach.
  • Scaling-Out: Adding more nodes in parallel to a system (horizontal scaling). For example, using a cluster of machines.

Depending upon the upfront challenge for ML, either of these strategies could be useful. If the challenge is in handling high volume of compute (where data can fit in memory), Scaling-Up through GPUs is a good strategy, which is often utilised in Deep Learning based models.

On the other hand, if the challenge is in processing high volume of data with reasonable compute needs (a common case in large enterprises), the ingestion of data can become a serious performance bottleneck. Therefore, the only option is to Scale-Out. As each node has a dedicated I/O subsystem, it is an effective technique for reducing the impact of I/O on performance by effectively paralleling the reads and writes over multiple machines.

The most effective strategy for Distributed ML is to Scale-Out by distributing both the model and data using the underlying I/O and networking subsystems.

The Architecture

Designing a generic Distributed ML system is challenging, as ML algorithms are fundamentally different from each other in multiple ways and have a distinct communication pattern. This brings us to making the right choices from various architectural options that are available in designing a Distributed ML system. In this article, only those ML algorithms that can work in a distributed fashion will be discussed. Especially the gradient based methods that cover huge chunk of ML algorithms.

For Distributed ML, let us look at the four fundamental aspects of architecture jointly:

  • Machine Learning
  • Parallelism
  • Topology
  • Communication

Machine Learning

At the heart of every ML algorithm, there are two basic phases, the first is training and the second is prediction. During the training phase, large amount of training data is provided to train an ML model. Whereas in the prediction phase, this trained model is deployed  to accept new data and generate insights as output. Both these phases are not always mutually exclusive, in cases such as online continuous learning, both work simultaneously.

ML-based algorithms learn patterns and features of data to make decisions or predictions. The current ML algorithms can be categorised based on the following three characteristics:

  • Feedback: The feedback mechanism of underlying ML algorithm
  • Purpose: The final objective of applied
  • Method: The algorithm used for self-improving based on evidence from data 

Training an ML-based algorithm requires Feedback to improve the quality of the algorithm, this can be done in the following ways:

  • Supervised Learning: Where the labelled data acts as ground truth for improvising
  • Unsupervised Learning: Where a similarity metric guides the data groups
  • Semi-Supervised Learning: Where similarity metric shares same label for improvising
  • Reinforcement Learning: Feedback relies on a reward based cost function

The Purpose of ML algorithms can be one of following: {Anomaly Detection, Classification, Regression, Clustering, Dimensionality reduction} as per the need of the business use case.

Every effective ML algorithm needs a Method that forces the algorithm to improve itself based on new input data so that it can improve its accuracy. Optimisation is the core layer behind all ML algorithms to optimise a cost function J(θ), a few of such methods are as follows:

Gradient based: Minimise a loss function J(θ) by adapting model parameters θ

  • Batch gradient descent: θ=θ-η .∇_θ J
  • Mini-batch gradient descent: θ=θ-η .∇_θ J(θ;x^((i:i+n) );y^((i:i+n)))
  • Stochastic gradient descent (SGD): θ=θ-η .∇_θ J(θ;x^((i) );y^((i)))

EM (Expectation Maximisation) algorithm: Iteratively find maximum likelihood

Search based like Genetic algorithms: Learns iteratively based on evolution

Rule based ML: Association rule mining, Decision Trees

Gradient descent based approaches tend to fit well in the parlance of Distributed ML and they also cater to many popular ML algorithms such as: {Artificial Neural Networks ANNs – [DNNs, CNNs, RNNs, SOMs, Auto-Encoders, GANs], Support Vector Machines (SVMs), Perceptron, Logistic Regression, …}. Therefore, we will focus more on variants of gradient based approaches for Distributed ML.


When it comes to distribution, there are two fundamentally different ways of partitioning the problem across all machines:

Data Parallel

The data is partitioned as per number of worker nodes in the system. All workers apply the same algorithm to different partitions of data. The same model is available to all worker nodes (either through centralization, or through replication) so that a single coherent output emerges naturally. This assumes i.i.d (independent and identically distribution) of data samples which is valid for most of the ML algorithms.


Model Parallel

Exact copies of entire data is processed by the worker nodes, which operate on different parts of the model. The model is aggregate of all model parts. This cannot be applied to every ML algorithm, as parameters are often not divisible.

Data & Model Parallel – Ensemble Model

Ensemble applies a combination of two approach mentioned above. Training happens in two stages; first at local sites where the data is stored and second in the global site that aggregates over the individual results of the first stage. This global aggregation can be achieved by applying ensemble methods such as Bagging, Boosting, Random Forests, Stacking


The topology of the Distributed ML system is an important part of the architectural design for Distributed ML. The different nodes of the distributed system need to be connected through a specific architectural pattern to fulfill a common task. However, the choice of pattern has implications on the role that a node can play in the degree of communication between nodes and in the failure resilience of the system.

The deciding factor for the topology is the degree of distribution:

Centralized systems employ a strictly hierarchical approach to aggregation, which happens in a single central location

Decentralized systems allow for intermediate aggregation

  • Intermediate aggregation: A replicated model that is consistently updated when the aggregate is broadcast to all nodes such as in tree topologies
  • Partitioned model: A model shared over multiple parameter servers

Fully distributed systems represent a network of independent nodes that ensemble the solution together and where no specific roles are assigned to certain nodes.

Based on the above strategy, we can have the following designs for distributed topologies:

Trees: In Tree topology, each node communicates only with its parent or child nodes. For example in Map-Reduce, nodes in a tree, accumulate their local gradients and pass this sum to their parent node to calculate a global gradient. This can be done in both centralized and decentralized fashion as shown above.

Rings: In situations where the communication system does not provide efficient support for broadcast or where communication overhead needs to be kept to a minimum, nodes synchronise only through messages.

Parameter Server: Uses a decentralized set of workers with a centrally maintained shared state. All model parameters are stored in a shard on each Parameter Server. They can be read and write as a key-value store from globally shared memory, which is advantageous. One disadvantage of this approach is that the parameter server can become a bottleneck.

Peer to Peer: In the fully distributed model, each node has its own copy of the parameters, and the workers communicate directly with each other. This has the advantage of higher scalability and the elimination of single points of failure in the system. One disadvantage of this approach is that the broadcasting can lead to high communication overhead.

The following images illustrate (a) Centralized ensemble system, (b) Decentralized systems based on tree topology, (c) Decentralized systems based on parameter server, and (d) Fully Distributed system peer to peer system.


Image Source:arXiv:1912.09789 [cs.LG]


In Distributed ML, one of the major problems is to decide upon the trade-off among Communication, Compute-Time, and Accuracy. The distribution strategy has direct implications on the amount of communication required to train the model. Paralleling the learning can reduce computation time, as long as the communication costs are not becoming dominant. Right amount of communication is needed to achieve the desired accuracy within an acceptable computation time. To schedule and balance the workload, there are three concerns that have to be taken into account. Firstly, identifying which tasks can be executed in parallel. Secondly, deciding the task execution order. Thirdly, ensuring a balanced load distribution across the available machines.

There are several techniques to enable effective communication between the nodes:

Bulk Synchronous Parallel (BSP)

This is the simplest model in which programs ensure consistency by synchronising between each computation and communication phase. The popular Map-Reduce is an example of BSP.

Stale Synchronous Parallel (SSP)

Relaxes the synchronisation overhead by allowing faster workers to move ahead for a certain iterations. If this number is exceeded, all workers are paused. Workers operate on cached data and only commit changes at end of a task cycle, which can cause other workers to operate on stale data.

Approximate Synchronous Parallel (ASP)

Limits how inaccurate a parameter can be in contrast to SSP that limits upon how stale a parameter could be.

Barrierless Asynchronous Parallel (BAP)

Lets worker machines to communicate in parallel without waiting for each other.

One needs to weigh the advantages and disadvantages of the above strategies and choose the one that best suits their Distributed ML design.

Guavus Distributed ML Strategy

Guavus has been a pioneer in Big Data processing for major telecom operators for over a decade, ingesting and analysing terabytes of data. With additional ML capabilities added to the Guavus Reflex engine, Guavus was able to detect outages through network anomalies, service degradation of popular applications through modelling network service behaviours, malicious threats in Cyber Security and frauds in bank transactions, as well as categorize content over network traffic, and perform root cause analysis of complex chain of network events for operators.

Moreover, the ML processing capability could be easily taken to the edge devices after the addition of SQLstream real-time streaming capabilities in Guavus Reflex platform. Thus, making it possible to take closed-loop real-time decisions and making it more business effective for Guavus’ customers.  The ML strategy of the Reflex platform was transformed on the principles of Distributed ML to match with existing massive scale operations of Guavus’ customers.

The Guavus Distributed ML architecture covers:

Feedback: Supervised, Unsupervised, Semi-Supervised, Reinforcement.

Algorithms: A wide variety of algorithms, from standard ones to self-developed research algorithms.

Optimisation: Batch, Mini-batch, Stochastic (real-time continuous).

Parallelism: Data-parallel approaches along with Ensemble model parallel.

Topology: Decentralized systems with both intermediate aggregation and partitioned.

  • Covers the Tree-topology in both centralized and de-centralized fashion.
  • Decentralized parameter server implemented on top of Redis technology.

Communication: Majorly Bulk Synchronous Parallel (BSP) through Spark and Map-Reduce, along with Barrierless Asynchronous Parallel (BAP) through Hogwild.

Guavus strategy on Distributed ML focuses on two key aspects: Distributed Big Data modelling over data lakes and Distributed Edge analytics on streaming data.

The edge here implies going as close to the source as possible, it could be individual routers or IoT devices or terminal nodes. The role of Distributed ML is to perform the ML prediction, analyze directly over the edge device rather than after bringing to a central location.

When it comes to technological design, Guavus’ architecture uses a hybrid strategy that extends the capabilities of existing data distribution frameworks to work in tandem with the advanced ML capabilities of the research languages. Therefore, a Big Data framework such as Apache Spark and SQLstream real-time streaming framework are conjoined with advanced research capabilities of Python. Guavus created a new framework to extend its pre-existing Big Data/Streaming stack and could still apply advanced ML intermittently within the same larger framework.


“A Survey on Distributed Machine Learning” arXiv:1912.09789 [cs.LG]

Posted by guavus