Why AI needs Graph and Streaming database for higher efficiency

Introduction

AI has become a more necessary entity for any kind of data processing today than ever when it comes to data analysis. More and more systems are trying to improve the AI model efficiency as it directly translates to business outcomes. If the model performs well then, the outcome is good else the outcome is counter-productive in many cases. Therefore, it’s extremely important to get the AI model right. What makes the biggest difference when it comes to the model’s efficiency? it’s the ability to leverage the context as well along with the set of features for training purposes. Graph structure is one of the most powerful mechanisms to represent the context within the data. Therefore, AI needs Graph and Streaming Database for higher efficiency.

Example use cases

Every decision-making is largely driven by AI at the core. This can be seen in the way e-commerce companies are targeting users. The way fintech is improving the customer experience is through AI models. The way vehicles are becoming more autonomous. The way cyber security is hardening the firewall to thwart the security threats, and many more use cases in various sectors. But having AI models trained onset of features is not sufficient, we must also capture the context in the most natural ways to improve the efficiency of the prediction. For example, it’s hard to tell the difference between two random similar acts without taking the context into the account. Therefore, it’s imperative to represent the context along with data and leverage these together during the model training process.

Let’s consider the example where an e-commerce company or a fintech company wants to recommend a set of products to users. If we just consider two random people with similar basic profiles, we might end up recommending an inhomogeneous manner and risk losing both the potential opportunities. However, if we consider their context then suddenly, we serve them a lot more relevant recommendations. This concept is widely understood and accepted. But the central question is, how do we capture and use contexts for the AI model?

Definition of context

To answer this, we must understand the meaning of context. And the context in a simple sense could be defined as the time, environment, and background in which certain events occur. These time, environment, and background can also be loosely defined as the different participating entities and their various inter-relationships. While the identity of nodes may be invariant, there could many numerous dynamic relationships that could be defined and the properties of the nodes could change continuously. The combinations of entities and relationships took together to bring the context into the processing as well.

Once we have understood that the context can be defined as entities and their relationships, the next question would be, how do we efficiently store these contexts in the database and query them in a high-performance manner? How do we enrich the data as it flows into the system such that when it comes to model training, we can leverage not only the contexts but also some of the extra computed values as part of the features for models? For example, can we find the natural clusters within the data? can we find the similarity scores among different entities? can we find some recurring patterns? If yes, then these could become part of the feature set for model training.

Steps to ingest, process, store in Graph, and Train AI Models

First, we must ingest the data in such a manner that it could be taken into the system without any impedance. One of the most difficult and heavy tasks for data processing is ETL and it’s widely recognized that sometimes we spent the majority of our time in ETL to try to get it right otherwise heavy penalty awaits if even a minor thing goes wrong at the beginning. BangDB avoids this process to great extent by implementing continuous data ingestion mechanism along with processing that could be done to extract and transform what we need at any given time. BangDB allows users to continuously ingest the data and also transform them while it’s being ingested at any point in time to keep enriching it for various further computations. Running statistics, joins, refers, complex event processing, filters, computed attributes, etc. are some of the tools available that can be used to enrich, add, and expand the scope in a real-time manner while processing every single event at a time. It can also continuously update the underlying Graph Store as data arrives. Since we have plenty of methods to add/transform the data within the event processing framework, the graph receives a lot more enriched data along with raw events which makes the structure way more valuable.

Next, we pass the data from the stream layer to the graph store where all different entities and their relationships could be dynamically stored. We can simply tell the stream layer to pass the data to the graph store. BangDB Graph store is very powerful and efficient where the triples (subject, object, and predicate) can be stored explicitly or implicitly. While the stream layer explicitly pushes the data, we can use IE (information extraction) to do NER (name entity recognition) and relationship definitions among the entities. BangDB graph is feature-rich and quite an efficient store for triples which allows Cypher and SQL (like) queries to be executed for data retrieval.

Now how does Graph help in building AI models? First, it allows us to build the feature sets quite efficiently. While we fetch the entities, we also fetch these entities based on the relationships. Further, we can exploit the way data is stored within the Graph store by extracting the various natural clusters and groups. Next, we can use the inbuilt Graph processing methods for computing the similarity scores between different entities and use these scores while building the models. These clusters, groups, and similarities could be computed in many different dimensions. For example, groups based on location, age, purchasing habits, common products, spending behavior, etc. Similarity scores are based on past patterns, anomalies, personal data, life journeys, etc.

    Questions we want to answer at run time

    Some of the interesting questions that we can answer using Cypher within BangDB are the following. While these examples are just to give you a sense of the power of Graph processing within BangDB and help you in extending these to define many more such questions/ commands as relevant for your business

    1. Process the entities cluster Analysis within Graph to compute and return similarity scores. This is the template for similarity based on feature set X
    2. Process Association rule mining using natural Graph properties for recommendations
    3. Do customer segmentation based on cluster analysis and return similar users
    4. Use collaborative filtering for a set of features that have fixed and limited set of values to identify similar users
    5. Do Classification of different groups/clusters
    6. Popularity based / Trend based similarity scores and clusters
    7. Seasonally based ontologies and triple set

    BangDB Graph Features

    BangDB is a converged database platform that natively implements and provides stream processing, AI, Graph, and multi-model data persistence and query. It provides the following high-level features for Graph processing.

    • Node, entity, triple creation
    • Running query and selecting data (Cypher and SQL*)
    • Statistics (Running and continuous)
    • Graph Functional properties
    • Graph algorithms
    • Set operations
    • Data Science (entire AI within Graph)

    To see more details on Graph, please check out the Graph introduction

    You can check out this paper which has pretty good details on Representing Learning on Graph