Analytics with BangDB

One of the main goals of BangDB is to allow user to deal with high volume of data in an efficient and performant manner for various use case scenarios. Features like different types of tables, key types, multi indexes, json document support etc... allow users to model data according to their requirements and gives them flexibility for storing and retrieving data as needed. This certainly means that users can create their own custom app using BangDB for doing data analysis

The other approach would also be to provide fully baked up native constructs, the abstractions which can be used off the shelf for enabling data analysis in different ways. The abstractions hide all the complexities and expose simple APIs to be used for storing and retrieving data for analysis. Thus the built in contstructs frees developers from worrying about the data modeling, configuring db objects, processing the input, querying method, post processing etc... and allows them to just enable the analysis by using the object of the type

Advantages of abstraction

In many scenario where nosql db is used for doing analysis, various concepts are used. Sometime pre processing is required and many other times post processing becomes critical apart from stroing and retrieving data in efficient manner. While the generic approach is important and will be required but sometimes it's helpful and easier to just use the pre assembled constructs for given use case. The 80/20 rules apply here as well where we experience that 80 percent of the time we need to respond to 20 percent queries. Thus identifying these 20 percent of queries and making pre assembled constructs for these 20 percent available becomes attractive. The BangDB strives to do the same and in the process it has started with few constructs as mentioned in the next section.

The visible advantage of having assembled constructs are multifold. BangDB in each of these available abstractions, stores data in a fashion which don't require any post processing at the time of executing query. This is huge gain as the response could be given immediately. Also each of these abstractions are implemented in such a fashion that consume minimum resources and work with different configurations as suited in different scenarios. For ex; using sliding window concept, with 30 min as window range, one can handle ~200GB of data per day, keeping all data in memory all the time on a 8GB RAM commodity machine, or ~400GB of data if the window size is 15 min or over 3TB of data per day with 64GB RAM with 15 min window size etc... Important point here is that it is not restricted to only operate in-memory, data can overflow to disk as well with BangDB and hence much more amount of data can be handled with just a single machine

And finally doing analysis with pre built constructs saves time and effort and allows users to quickly start dealing with data and analytics

Native Constructs or abstraction

BangDB therefore, has provided following high level contructs in 1.5 version for doing specific set of data analysis;

  1. Sliding Window
  2. Counting
  3. TopK

The approach is to work backword, from analysis or query to data model and abstraction. Hence the above there items can be used separately or in some combinations to create data analysis for set of use cases. However, in coming days more constructs would be added to address different set of analysis

Sliding Window

In real time analysis, we are interested in most recent data and wish to analyse the data accordingly. This different from typical hot or cold data concept where older data could be hotter than recent data. Here we strictly want to work within the defined recent window.

BangDB provides the concept of Sliding Window as a type where user can define the term 'recent' by providing time range and then work within the time range always as the window keeps on sliding continuously.

To further ease the development, BangDB also provides sliding table concept, which means that user can simply create a table which always works on recent data window sliding continuously. Similar abstraction is for counting and topk.

Counting

In almost all analytical purposes, counting in inevitable. Many a times we need exact total counting and some times aproximate count is also sufficient within acceptable error margin, and in many other cases we need unique counting or may be non-unique in some other scenarios. Again these counting could be couting since begining or for specified time window which keeps sliding. For such use cases, BangDB provides native constructs for counting.

Counting can be done in various ways using BangDB. For example, we can simply create the object of Counting type and let is count uniquely for ever. Now in some case this would be good but imagine a scenario where user would like to do counting for each entity uniquely and if the number of entity is large then overhead of counting becomes very high. Let's say we have 100 M entities and we would like to count for each entity. Even if have dedicated 16 bytes for each entity for counting we would need 1.6GB of space to do that and since we need to respond quickly we would like to keep these in memory as much as possible. In such scenario, if we are fine with not counting exactly and are ready to tolerate error margin or say 0.05% then BangDB provides a construct using which we can count in required fashion with few MB overhead only. This is probabilistic count with using hyperloglog concept.

All these counting can then be done in sliding window and there are many configurations for different setting in different use cases.

TopK

This is another important feature from analytics perspective. TopK has been a topic of interest for many researchers and analysts and therefore used at many places. BangDB provides native construct for TopK.

TopK means keeping track of top k items. These top k items could be anything, for ex; top 30 users with highest items in cart, top 20 prodcuts searched every 15 min, top 10 queries done every 1 hour etc... Using BangDB topk abstraction, user can simply do the topk analysis with just using get and put API.

TopK can again be done in absolute manner or within a sliding window with different settings

These are available in BangDB as fully baked up constructs and hence amy be used directly. However user can enable different analytical capabilities using BangDB different features. In coming days more such abstraction will be added for different analysis needs