bg
Stream processing platform's modern architecture for realtime data
Published on Feb 05, 2022

Architecture of modern stream processing platform for realtime data analytics

Why a new architecture for modern stream processing is needed for realtime data analytics as we see explosion of fast moving device data which contain high value for short period of time

Architecture of modern stream processing platform for realtime data analytics

Humans to Machine – Shift of data source

There is a rapid shift happening at the data level as we speak. More and more data is being created by devices, which are fast-moving and contain very high value but perishable insights. This shift is demanding a new architecture of stream processing platform for real-time data analytics. Data has been growing exponentially. We have more data streaming through the wire that we can keep them on disk from both value and volume perspectives. These data are being created by everything we deal with on daily basis. When humans were the dominant creator of data, we naturally used to have less amount of data to deal with and at the same time value used to persist for a longer period. This in fact holds true now as well if humans are the sole creator of the data.

However, humans are no longer the dominant creator of the data. Machines, sensors, devices, etc. have taken over a long time back. The data, predominantly, are being created by machines with humongous speed, so much so that in the last two years we had 90% of the data created since the dawn of civilization. These data tend to have limited shelf life as far as value is concerned. The value of data decreases rapidly with time. If the data is not processed as soon as possible then it may not be very useful for ongoing businesses and operations. Naturally, we need to have a different thought process and approach to deal with these data

Why stream analytics is the key to future analytics

Since we are having more of these data streaming in from all different sources, that if combined and analyzed then a huge value could be created for the users or businesses. At the same time given the perishable nature of the data, it’s imperative that these data must be analyzed and used as soon as they are created

The value of data is maximum when it’s created, the streaming data is perishable in nature, needs to extract the insights immediately

More and more use cases are being generated that need to be tackled to push the boundaries and achieve newer goals. These use cases demand the collection of data from different data sources, joining across different layers, correlation, and processing across different domains, all in real-time. The future of analysis is less about understanding “what happened” and more about “what’s happening or what may happen”

Few examples Use cases in this context

E-commerce

Let’s analyze some of the use cases. Consider an e-commerce platform that is integrated with a real-time stream processing platform. Using this integrated streaming analysis of data, it could combine & process different data in real-time to figure out the intent or behavior of the user to present a personalized offer or content. This could increase the conversion rate significantly or reduce to eroding customer engagements. It could also have better campaign management to yield better results for the same spend

Data Center

Think of a small or mid-size data center (DC) that typically has many kinds of different devices and machines each generating volume of data every moment. They typically use many different static tools for different kinds of data in different silos. These tools not only restrict the DC from having a single view of the entire data center but also works as a BI tool. Because of this, the issue identification in a predictive or real-time manner doesn’t happen as a result firefighting becomes the norm of the day. With a converged integrated stream processing platform, DC could have a single view of the entire DC along with real-time monitoring of events, and data to ensure issues are caught before they may create bigger problems. A security breach could be seen or predicted much earlier before the damage is done. Better resource planning and provisioning could be done by analyzing the bandwidth usage and forecasting in near real-time

IOT

The entire IoT is based on the premise that everything can generate data and interact with other things in real-time to achieve larger goals. This requires a real-time streaming analytic framework to be in place to ingest all sorts of unstructured data from different disparate sources, monitor them in real-time, and take actions as required after identifying either known patterns or anomalies

AI and Predictive Analysis

AI and predictive analytic means that the data is being collected and processed in real-time otherwise the impact of AI could only be in understanding what happened. And with the growth of data and types, it will be prudent to not rely solely on what has been learned so far in the hindsight. Demand will be in reacting to new things as it’s seen or felt. Also, we have learned from our experiences that a model trained on older data often struggles to deal with newer data with acceptable accuracy. Therefore, here also the real-time stream processing platform becomes the required part rather than a good to have a piece

Limitations with existing tools or platforms

There are two broad categories in which we can slot the options available in the market. One is an appliance model and another one is a series of open-source tools that need to be assembled to create a platform. While the former costs several millions of dollars upfront the latter requires dozens of consultants for several months to stitch create a platform. Time to market, cost, ease of use, and lack of unified options are a few major drawbacks. However, there are bigger issues to be addressed by either of these options when it comes to stream processing and here, we require a new approach to solve the problems. We can’t apply older tools to newer, future-looking problems, otherwise, it will remain a patchwork and would not scale to the needs of the hour

Challenges with Stream Processing

Here are the basic high-level challenges when it comes to dealing with the stream of data and processing them in real-time.

  • Deal with the high volume of unstructured data in an extremely low latency manner
  • Avoiding multiple copies of the data across different layers and over a network as well
  • Optimal flow of the data through the system
  • Partitioning the application across resources
  • Stream processing data in real-time
  • Data storage in a unique manner and most suitable manner
  • Processing data and taking actions before it’s persisted
  • Remain predictive rather than only forensic or BI tool
  • Ease of use – we should not code for months before seeing the results
  • Time to market – off the shelf such that the app can go to market in a short time
  • Deployment model – hybrid. From within device to LAN to Cloud – all interconnected

Most of the options in the market suffer from these bottlenecks. Let’s take a few examples.

Spark

Spark follows the map-reduce model philosophically although in a much more efficient manner. However, it still deals with batches. Spark deals with micro-batches of a given size with a given batch interval. It has several problems when it comes to aligning with stream processing, in fact, its approach is the antithesis of stream processing

  • A Micro or macro batch is not important as long it’s a batch. Consider a macro and micro-batch both with an equal number of events because of different speeds of the data. The concept of the batch doesn’t align with stream processing where processing every single event is important
  • Processing starts when a batch is full. This fails the premise of processing events as it comes
  • Stream processing happens within a moving or sliding window. Windowing with batches is not possible
  • When the batch processing time is more than the batch interval then the backlog only grows, this coupled with persisting data sets only aggravates the situation

Kafka + Spark + Cassandra

This model typically uses 5 or more distributed verticals, each containing many different nodes. This increases the network hops, and data copy, to a great level which eventually increases the latency. Scaling such a system is not trivial as we have different dynamic requirements at a different levels. Further, the cost of adding new processing logic is significantly higher than a simple BI tool where things could be handled using a dashboard. Finally, it requires a large team and resources which increases the cost. This can hardly be deployed for a scenario where sub-second latency is desirable

Kinesis

AWS kinesis at best is equivalent to Kafka, a distributed, partitioned messaging layer. Users still must assemble, process, store, vision, etc. layers themselves

Why BangDB

We need a platform that is designed and implemented inhomogeneous manner, towards a single goal of process stream data in real-time, which avoids all the above pitfalls and remain immune to future requirements, and scales well for higher load and volume of data

BangDB has tried to address most of the above-mentioned problems by designing and building the entire stack from the ground up. Here is a brief introduction of the BangDB platform in the light of the known issues identified

Deal with a high volume of unstructured data

BangDB has built a high-performance NoSQL database that scales well for a large amount of data. Also, it follows the convergence model to scale linearly. Scaling “single thing” vs “many things” addresses the problem to a large extent. BangDB also follows the FSM model and implements SEDA to achieve a cushion against the sudden surge in data

BangDB follows a true convergence model for higher performance, ease of management, and massive linear scale

Avoiding multiple copies of the data across different layers

BangDB removes all silos. The silos not only add the latency but also forces data to be copied across different verticals

Optimal flow of the data through the system

BangDB processes the data before it reaches the disk. This is opposite to most of the systems in the market. Further BangDB also avoids post-processing as much as possible, almost negligible. All these happen when data reaches a node, therefore there are no network packet hops for the data

Partitioning the application across resources

Convergence allows BangDB to partition application and data and all other resources in a single-dimensional manner. This enables the partitioning of space rather than the partitioning of a different set of spaces. Therefore, it naturally enforces optimal use of added capacity and resources which is otherwise is difficult to predict and provision

Processing streaming data in real-time

BangDB process every single event rather than micro or macro batches. Hence, the data is updated in real-time, a pattern is identified in real-time, and insight is also served to the application in real-time. Most of the streaming real-time use cases emphasize the need to process data in a sliding window, BangDB provides a configurable sliding window within which most of the processing happens

True stream processing with continuous sliding window. Most of the operations happen within the sliding window

BangDB follows the reverse of Map Reduce to achieve very high performance for reads. This is done by avoiding all sorts of post-processing of data and keeping the data in a format needed by the user

Data storage

BangDB stores both raw data and the extracted insights or aggregated data within the system. It’s a persistent platform hence it could store as much data as required. However, most of the time it is critical to process data in real-time and then push it to an offline system for deeper analysis. Therefore, BangDB connects with Hadoop and other long term storage frameworks as well

BangDB has an IO layer that uses SSD as an extension of RAM rather than a replacement for File System, thereby allowing out-of-memory computations and data handling without severe degradation in performance

BangDB also uses SSDs in totally different ways to achieve cost-effectiveness and elasticity. SSDs are typically used by others as a replacement for file systems (or HDD) where the gain is limited and if not used properly life and performance go down as well. BangDB has written software to mimic SSD as an extension of memory by which the performance can be increased multifold and also the cost-effectiveness can be achieved to a great extent

Remain predictive rather than only forensic or BI tool

BangDB aims to be predictive. BangDB processes and analyses data in both an absolute and predictive manner. It uses complex event processing for absolute pattern recognition. It also uses supervised and unsupervised machine learning for the pattern or anomaly detection. BangDB platform provides simple ways to upload and train models. It also integrates with “R” for data science requirements

Ease of use

Both appliance and open-source models provide a technology platform where the new analytic application or processing code must be developed and deployed on the production system. This requires a typical test for production DevOps and release management. BangDB provides an integrated dashboard to make the platform totally extensible. Users can perform all actions using the dashboard without ever developing code or application. Further BangDB has developed pre-baked apps in different domains and uploaded them to its AppStore such that users can simply take those apps, configure and start dealing with real-time insights.

Time to market

BangDB platform is hosted on a cloud as a SaaS model along with AppStore with several solutions. This allows users to start within an hour or even less. There is no stitching time, deployment time, or even development time, everything is ready to go with a set of clicks

Deployment model

BangDB can be deployed within the device for state-based computations including CEP and ML processing. Further BangDB could be in LAN and Cloud too. All of these could be interconnected for supercharging orchestrations. BangDB has a subscription model and users can start within a minute using BangDB SaaS and then grow as needed. Get started with BangDB by simply downloading it

Further Reading: Also check out two other blogs on similar topics which might be useful.

 

RELATED STORIES

Why AI needs Graph and Streaming database for higher efficiency
Why AI needs Graph and Streaming database for higher efficiency
AI has become necessary entity for any kind of data processing today when it comes to data analysis....
Read More
REAN model to achieve higher conversions through hyper personalisation and recommendations
REAN model to achieve higher conversions through hyper personalisation and recommendations
BangDB implements REAN Model to enable conversion through personalization. It ingests and processes ...
Read More
How to mitigate security risk using BangDB
How to mitigate security risk using BangDB
Security risk is everywhere and it has been growing rapidly while we try to mitigate security risk a...
Read More