bg
bangdb config – BangDB = NoSQL + AI + Stream

bangdb config

chevron

Configuration for BangDB

bangdb.config

There are several parameters that BangDB exposes for users to configure the BangDB for optimal and efficient running and performance.

Let's first categorise these config params for better understanding and then go into each of these. We will provide recommendations as well for each of these params

A. Configuration that affects execution or running of BangDB B. Configuration specific when BangDB runs as a server C. AI/ML related server and db configuration D. Advanced configuration to tune core of BangDB
Before we cover this in details, it will be good to see what all command line arguments that server takes if you run bangdb-server-2.0 directly.
Here are these configurations; these are self explanatory
-----------------------------------------------------------------------------------------------------------
Usage: -i [master | slave] -r [yes | no] -t [yes | no] -d [dbname] -s [IP:PORT] -m [IP:PORT] -b [yes | no] -v
-----------------------------------------------------------------------------------------------------------
Options
--------
 -i: defines the server's identity [master | slave], default is SERVER_TYPE as defined in bangdb.config
 -r: defines replication state [yes | no], default is ENABLE_REPLICATION as defined in bangdb.config 
 -t: defines if transaction is enabled(yes) or disabled(no) [yes | no], default is no 
 -d: defines the dbname, default is BANGDB_DATABASE_NAME as defined in bangdb.config 
 -s: defines IP:Port of this server, default is SERVER_ID:SERV_PORT as defined in bangdb.config
 -m: defines IP:Port of the master (required only for slave as it declares master with this option)
 -b: defines if server to be run in background as daemon
     default is MASTER_SERVER_ID:MASTER_SERV_PORT as defined in the bangdb.config
 -v: prints the alpha-numeric version of the executable

 Hence to run master with other values as defined in the bangdb.config, issue following command
 ./bangdb-server -s 192.168.1.5:7887

 To run slave for this master with default other values..
 ./bangdb-server -i slave -s 192.168.1.6:7887 -m 192.168.1.5:7887
etc...
-----------------------------------------------------------------------------------------------------------
The command line args can only be provided only when you run the server directly from the executable, bangdb-server-2.0. If you run the BangDB server using the script bangdb-server, then it's not possible to provide these command line args. However, you may set all these in the bangdb.config file and then run, using either method.
Let's see how these params can be set using the bangdb.config;

Set master or slave
SERVER_TYPE is the config param and we can use this to set whether this server is master or slave
0 for master, 1 for slave

Set whether replication is ON or OFF
ENABLE_REPLICATION is the config param to set it.
0 for ON and 1 for OFF

Set db name
BANGDB_DATABASE_NAME is the param. By default it's always mydb

Set the (this) server ip and port
SERVER_ID for IP address, SERV_PORT for port. We can use ip address or name of the server SERVER_ID = 127.0.0.1
SERV_PORT = 10101

Set the master's ip and port
This is mainly for slave as it has to know where is the master
MASTER_SERVER_ID for ip address of master, MASTER_SERV_PORT for port of master

Run the server in the background
Need to use -b command line argument, can't set using bangdb.config as of now
-b yes

Run the server with transaction
Need to use -t command line argument, can't set using bangdb.config as of now
-t yes

A. Configuration that affects BangDB execution

The following config params are for DB, whether it is run in embedded or server manner

the param setting is done without using any quotes, either for numerical or string values

SERVER_DIR
The dir where the db files will be created. Please edit it with suitable dir location default is the local dir, note: this can be provided as input param while creating a database using DBParam

BANGDB_LOG_DIR
Log dir. This is where database write ahead log files will be kept. Default is local dir note: this can be provided as input param while creating a database using DBParam

BUFF_POOL_SIZE_HINT
Memory budget for the DB. This is defined in MB and once set, BangDB will not use memory more than this. If it's handling more data than the size of the buffer pool, then it will do page flush as required for dirty pages etc. BangDB has patent in managing buffer pool in a manner which is very efficient and maintains the level of performance to acceptable range even in worse conditions and tends to degrade gracefully.

We should select this properly as it has direct implication on performance. Max limit for buffer pool size on a machine is ~13TB and min limit is 500MB.
Ideal value is of course dependent on the use case, but if it's a dedicated BangDB server then buffer pool should be RAM Size - 3/4 GB. Therefore, on a 16GB machine, 11/12 GB would be a good number etc.

BangDB buffer pool is very efficient, performant and implements several novel techniques for high performance. BangDB has Patent for Adaptive prefetching in Buffer Pool
And also Patent for Buffer Pool and Page Cache

BANGDB_APP_LOG
When set to 1, then BangDB will do logging using syslog (/var/log/syslog file)
When inactive(set to 0) then BangDB will flush the logs on to standard output(terminal)
and when set to 2 it will flush to the logfile maintain by the BangDB. The preferred
value is 2 as BangDB implements high performance logging mechanism
User should set it to 2 and DB will keep log files in data/dbname.applog. You should tail it for seeing what's going on. The BangDB log analysis can be used to further analyse the logs and it will continuously monitor and notify people as needed/set
DB_APP_LOG_SIZE_MB
This sets the size of the applog (when BANGDB_APP_LOG = 2). When the file(applog) gets full, it creates another one and keeps rolling.
After sometime when there are too many such files, users to should delete older ones. DB doesn't clean or reclaim these log files on its own
BANGDB_APP_LOG_LEVEL
This sets the log level, following options;
      Possible values
      0: Critical
      1: Error
      2: Warning
      3: Info
      4: Debug
User should run the production with 2 or 3 (and not with 4). However for development and debug, pls set it 4. It requires DB restart after setting any param in the config file, hence here too it applies
BANGDB_DATABASE_NAME
You may leave it the default val here. Please note, you can always pass dbname through command line or using the API
BangDB at a time deals with only single db, we may have many tables, streams etc. in it
CEP_BUFFER_SW_SIZE
BangDB provides complex event processing support, therefore we can look for a given complex pattern on the streaming data. These pattern analysis is state based query which runs in a sliding window. Most of the CEP out there in the market are in-memory based model, this is bit inefficient as if we run few queries over a period of time, and number of event ingestion is bit moderately high then memory is not sufficient and the system starts dropping events. To get rid of this bottleneck, BangDB CEP buffer is backed with a table and this table runs in a sliding window.
Therefore, this param is to set the sliding window size for the buffer table for cep related items. Size is defined in time in seconds, default is 86400 (1 day).
Since most of the CEP queries are temporal in nature therefore all the queries are limited by this size as the high watermark, which means we can't have more than this size for individual cep queries. 1 day is too long for most of the queries, therefore we hardly need to change this, but it could be changed as needed
BANGDB_PERSIST_TYPE
This is a table config param, it basically tells whether the table should be backed by file on the disk or is it going to be in-memory.
This should be set by using TableEnv type

BANGDB_INDEX_TYPE
This is a table config param, it defines the index type (primary key arrangement type) for the table.
This should be set by using TableEnv type

BANGDB_LOG
This is to set database log, this is different from app log which is basically db debug, error logging.
BangDB supports write ahead logging (WAL) for every write operations. WAL also ensures atomicity, transaction and durability. It furthers allows BangDB to recover from crash in automated manner.
BangDB has Patent for efficient write ahead log

LOG_BUF_SIZE
If BANGDB_LOG is set to be 1 (ON), then we can set the size of the log file. This is mmap area and WAL keeps rotating as it gets filled.
Default value (128MB) is good in most of the cases, however if buffer pool size is large (for larger servers) , for ex; 64GB or more, then 256MB is better choice for the WAL size.
Note that WAL is append only log, which ensures the durability of data once written even though data is actually not flushed to the file system. Further BangDB keeps checkpointing (if ON) and keeps reclaiming the logs so that it doesn't fill the disk on the server
DAT_SIZE
This denotes maximum size of data in KB. This is only true for normal key value or document data (for NORMAL and WIDE tables, see here for details). If the size is less than that of MAX_RESULTSET_SIZE (see below), then BangDB sets it to MAX_RESULTSET_SIZE . This can't be more than MAX_RESULTSET_SIZE.
However, for larger data (LARGE TABLE), we can deal with large data size, for example hundreds or MBs or GBs upto 20GB file/data

KEY_SIZE
This is again a config param for table and not for db. This sets the default value for keysize when not specified using TableEnv.
This should be set by using TableEnv type

MAX_RESULTSET_SIZE
BangDB supports scan method for running range query. These scan method returns ResultSet which has list of key/vals/docs as needed by the query. MAX_RESULTSET_SIZE defines the max size of such resultsets.
Let's say there are 1.2GB of data in the db for a particular query. Returning all data in a single shot will be very inefficient and also unnecessary. Therefore BangDB supports recursive scan and keeps returning data until all is served or user decides to not fetch more. MAX_RESULTSET_SIZE defines what should be the size of data returned in every call. However, user can still configure each scan using ScanFilter. Checkout more on query and ScanFilter
KEY_COMP_FUNCTION_ID
Since BangDB arranges keys in order, it uses two key comparison methods, lexicographical (1) and quasi lexicographical (2). Default value is 2.

BANGDB_AUTOCOMMIT
When BangDB is run in transaction mode, If auto commit is off(0) then explicit transaction is required (begin, commit/abort), else implicit non-transactional single op can be run in usual manner later this can be set/unset whenever required.
This is only for single operation. It is important since single operation can have multiple sub operations which if not committed in an atomic manner, then it may leave DB in an inconsistent state. Therefore we should keep it ON (1). Note that, for multi ops atomicity, we must use explicit transaction
BANGDB_TRANSACTION_CACHE_SIZE
BangDB supports transaction using Optimistic Concurrency Control (OCC). OCC demands size of memory kept aside for transaction related operations. BANGDB_TRANSACTION_CACHE_SIZE defines that size in the memory.
Most of the time, default size is good enough, but if you are going to club too many operations in a single transaction then size should be increased. Note that BangDB supports many concurrent transactions and that has little implications on this size. This is mainly for large number of operations in a single transaction

TEXT_WORD_SIZE
BangDG supports reverse indexing, hence we need maximum size of a token/word. TEXT_WORD_SIZE defines the same. Default is good from logical perspective

MAXTABLE
BangDB supports several thousands of tables, infact it is only limited by the number of open file fds on the system which is only 1M. But to optimise the running of BangDB, it is good to define this reasonably. Default value 16384 is good, however you may increase it as needed

PAGE_SIZE_BANGDB
BangDB page size can be configured. Default is 16KB which is a good fit for most of the scenario, however you may increase or decrease as needed.
Page size is defined per DB, once defined it can't be changed subsequently unless new db is created. Also, the page size is always multiple of 8KB
MASTER_LOG_BUF_SIZE
To maintain WAL, DB needs a masterlog for various housekeeping. MASTER_LOG_BUF_SIZE is the size of the master log. Default 4MB is good for many cases, however if you intend to have DB which will have large size (few TBs) then increase the size. Typically for few TB of DB size 4 - 16MB is good enough

B. Configuration specific when BangDB runs as a server

Following are the configurations when BangDB runs as server hence, these are server specific config params

SERVER_TYPE
BangDB when runs a server, then it may run as master or slave. SERVER_TYPE defines whether it's master (0) or slave (1).
we can pass this as command line arg as well when we run server directly. 

./bangdb-server-2.0 -i master
or 
./bangdb-server-2.0 -i slave

Or we can set the SERVER_TYPE param in the config file for the db, this is needed when we run bangdb using the script (bangdb-server)
ENABLE_REPLICATION
We can run BangDB Server with replication ON (1) or OFF (0). If OFF then slaves can't be attached.
We can do this with command line arg as well;
./bangdb-server-2.0 -r yes
or
./bangdb-server-2.0 -r no
SERVER_ID
This sets the ip address or name of the server.
We can do this with command line arg as well;
./bangdb-server-2.0 -s 127.0.0.1:10101
SERV_PORT
This sets the port of the server.
We can do this with command line arg as well;
./bangdb-server-2.0 -s 127.0.0.1:10101
MASTER_SERVER_ID
When a server is slave of another server, then we need to tell this server about the master.
This tells the server about the ip address of the master.
We can do this using command line arg as well;
./bangdb-server -m 127.0.0.1:10101
MASTER_SERV_PORT
When a server is slave of another server, then we need to tell this server about the master.
This tells the server about the port of the master.
We can do this using command line arg as well;
./bangdb-server -m 127.0.0.1:10101
MAX_SLAVES
This is for master, to set the limit for number of slaves

OPS_REC_BUF_SIZE
BangDB allows read/write operations to continue even when slave is syncing with the server.
This happens using the Ops record buffer when syncing is in progress with a slave
OPS_REC_BUF_SIZE sets the size in MB for the ops record. Default is good for most of the cases

PING_FREQ
Master and slaves checks each other liveliness using UDP based ping pong. PING_FREQ sets the frequency for the ping pong.
Default value 10 sec is good enough, however you may increase or decrease the frequency as needed

PING_THRESHOLD
How many pings or pongs to fail before one can conclude that the other server is unreachable or down?
PING_THRESHOLD defines that. Default 5 times in a row is good enough

CLIENT_TIME_OUT
All clients connect to the server using tcp. BangDB server handles tens of thousands of concurrent number of such connections. However, user may define if server can time out some of the connections if no requests have been received for some period of time.
CLIENT_TIME_OUT defines the same in number of seconds. Default is 720 seconds

NUM_CONNECTIONS_IN_POOL
This is for clients only. It sets the number of connections with the server to be in the pool for performance and efficiency purposes.
Default is 48, however you may increase as you need, no performance impact* due to this

SLAB_ALLOC_MEM_SIZE
BangDB Server uses pre allocated slabs for run time memory requirements. SLAB_ALLOC_MEM_SIZE defines the same in MB. default value of 256MB is good enough

TLS_IDENTITY
BangDB can run in secure mode as well and clients have to connect using the secure channel.
TLS_IDENTITY can be set (reset) by the user for security purpose

TLS_PSK_KEY
BangDB can run in secure mode as well and clients have to connect using the secure channel.
TLS_PSK_KEY can be set (reset) by the user for security purpose

BANGDB_SYNC_TRAN
If set then BangDB will sync forcefully with the filesystem after flush. Ideally it should be OFF (0), but in case of hard need, you may set it ON (1)

BANGDB_SIGNAL_HANDLER_STATE
There are various signal handlers set already, but for few extra ones, user may add the handlers. Ideally not required, but still user may switch them ON

LISTENQ
Queue size for the listen() call, default 10000 is quite a good number

MAX_CLIENT_EVENTS
Maximum number of concurrent connections to the server or num of concurrent connections.
Server can handle default 10000, but change it to less number as suitable.

SERVER_STAGE_OPTION
BangDB server implements SEDA (Staged Event Driven Architecture). Therefore, we can organise the whole processing in different ways. There are two options available and can be selected by the user.
stage options, basically it tells server to create the number of stages to handle the clients and their requests there are two types of stages supported as of now
      1. two stages, one for handling clients and other for handling the requests
      2. four stages, one for handling clients, one for read, one for ops and finally one for write
Note: default is option 1 and works well in most of the scenarios
SERVER_OPS_WORKERS
If SERVER_STAGE_OPTION = 2, then this can define how many workers to allocate for operations for db. . Default 0 is fine.
Default 0 allows db to select the number of workers best suited for the given server configuration.

SERVER_READ_WORKERS
If SERVER_STAGE_OPTION = 2, then this can define how many workers to allocate for read (network). Default 0 is fine
Default 0 allows db to select the number of workers best suited for the given server configuration.

SERVER_WRITE_WORKERS
If SERVER_STAGE_OPTION = 2, then this can define how many workers to allocate for write (network). Default 0 is fine
Default 0 allows db to select the number of workers best suited for the given server configuration.

EXT_PROG_RUN_CHLD_PROCESS
For IE (information extraction), or ML/DL related activities, BangDB may run external code such as python or c. In that case this flag tells whether the external libs or code can be run in the same process or in separate process for safety purpose. Default is to run in separate process.
0 for run in separate process, 1 will allow db to run in same process in case running in separate process fails

C. AI/ML related server and db configuration

Checkout this discussion on ML to know more on this

BRS_ACCESS_KEY
BangDB supports large data as well. These large data could be binary object data or could be file. While large object data is written into LARGE_TABLE, the files are stored in BRS.
BRS stands for BangDB Resource Server. BRS is line S3 and supports similar concept and API. BangDB can run as BRS or as DB + BRS, depending on configuration (as described below).
User may create buckets and store files in these buckets. To access these buckets user may define the access key using this param.
the access key could be defined using the request json for creating such buckets as well

BRS_SECRET_KEY
BangDB supports large data as well. These large data could be binary object data or could be file. While large object data is written into LARGE_TABLE, the files are stored in BRS.
BRS stands for BangDB Resource Server. BRS is line S3 and supports similar concept and API. BangDB can run as BRS or as DB + BRS, depending on configuration (as described below).
User may create buckets and store files in these buckets. To access these buckets user may define the secret key using this param.
the secret key could be defined using the request json for creating such buckets as well

BRS_DATABASE_NAME
When BangDB runs as separate instance as BRS then it can have different DB name, whereas if it runs as part of the DB then it shares the same name as DB's database

BRS_SERVER_ID
When BangDB runs as separate instance as BRS then it has different IP, whereas if it runs as part of the DB then it shares the IP as DB
Using this param you may set the server ip address accordingly

BRS_SERVER_PORT
When BangDB runs as separate instance as BRS then it has different Port, whereas if it runs as part of the DB then it shares the Port as DB
Using this param you may set the server Port accordingly

BRS_ML_BUCKET_NAME
This sets the default bucket that's created by the DB at the start, you may use this (along with the default access key and secret key) to store files in this bucket

ML_TRAINING_SERVER_IP
BangDB can run as separate ML training server or as part of the DB as well. When it runs as part of the DB then it shares the IP else it has it's own IP
Using this param, you may set the IP of the training server accordingly.

ML_TRAINING_SERVER_PORT
BangDB can run as separate ML training server or as part of the DB as well. When it runs as part of the DB then it shares the Port else it has it's own Port
Using this param, you may set the Port of the training server accordingly.

ML_PRED_SERVER_IP
BangDB can run as separate ML prediction server or as part of the DB as well. When it runs as part of the DB then it shares the IP else it has it's own IP
Using this param, you may set the IP of the prediction server accordingly.

ML_PRED_SERVER_PORT
BangDB can run as separate ML prediction server or as part of the DB as well. When it runs as part of the DB then it shares the Port else it has it's own Port
Using this param, you may set the Port of the prediction server accordingly.

BANGDB_ML_SERVER_TYPE
This is to set up the ML cluster including the BRS
For any server, this param defines what type of this server is as far as ML is concerned
        0 - invalid [ default will be used - default is prediction server ]
        1 - Training Server [ no prediction will happen, it's a standalone training server ]
        2 - Prediction Server [ no training will happen, only for prediction ]
        3 - Hybrid - both train and predict at a single place

TRAINING_PREDICT_FILES_LOC
During training or prediction, DB will keep some of the files locally. This defines the place where the DB will keep those files.
Default is /tmp/BRS, however, you change as you like, but ensure that DB has read/write to the folder

TRAIN_PRED_MEM_BUDGET
Since BangDB trains, predicts in concurrent manner, therefore it could hog the memory as we do more of these operations, esp training.
Also, for performance reasons it keeps the models in the memory in loaded condition. Therefore it is important that we put a limit to the memory that it could use.
TRAIN_PRED_MEM_BUDGET sets the amount of memory ML can use. The loaded models are in LRU list and DB auto loads or unloads depending upon the usage pattern
memory budget for training or prediction
      It depends on kind of server it is. For Training the mem budget
      will be used for training only, similarly for prediction it will
      be solely for prediction. However if the server is running in
      hybrid mode then it will be for both, 50% each
      value is in MB

MAX_CONCURRENT_PRED_MODEL
How many models could be trained or kept in the LRU list, this param sets that number. Default 32 is good for most of the scenario, however edit it as required

D. Advanced configuration to tune core of BangDB

The following are config params to tune the internal working of core BangDB. Therefore we need to be really sure before editing these. Let's go and understand these params as well

PAGE_SPILT_FACTOR
Since BangDB uses B+Tree* which keeps keys in sorted manner. When page splits then we need to transfer keys from one page to other. This variable decides the split factor.
Simple rule is, if the ingestion of data is going to be mostly sequential (and not random) or semi sequential, then higher value is better. Else keep the default.
As of now this is applicable to the entire db, however it should be for table. Will make it table specific in upcoming release

LOG_FLUSH_FREQ
This is frequency of log flush initiation. It's tuned for higher performance for general cases, however, you may play with the number and set what works best for you

CHKPNT_ENABLED
This is set to checkpointing of WAL. 0 means not checkpointing else yes. It's recommended to keep it ON, but for higher performance in certain cases you may turn it off as well

CHKPNT_FREQ
If checkpointing is ON then what's the frequency? Again this is set for better performance in general, however you may chose to edit it for experimentation and select the right value

LOG_SPLIT_CHECK_FREQ
WAL maintains append only rolling log file. It is recommended to keep checking if log file needs split at certain frequency. The value is selected for higher performance for general use cases, however you may experiment and pick the right value.
This should be lower if ingestion rate is high

LOG_RECLAIM_FREQ
BangDB generates WAL log files for durability and crash recovery along with atomicity and transaction. However it writes close 2.2X+ 4X more data in the WAL log than the ingested data. Which may result in large amount of logs generated on filesystem, which may cause disk full scenarios and db could go down. To avoid this, BangDB keeps checking and reclaiming the log files not needed by the db, even in the case of DB crash and recovery. It's a very complex process but very important. Therefore, we should have this value properly set to ensure DB runs properly without filling the disk with log files.

LOG_RECLAIM_ACTION
This will tell DB to steps it can take when it finds out that WAL logs could be reclaimed.
      0 means don't do anything,
      1 means archive in reclaim folder
      2 means delete the log files

usually, 2 is good
LOG_RECLAIM_DIR
If LOG_RECLAIM_ACTION = 1, then it tells which directory logs should be reclaimed. Ideally when we wish to keep the log files and not delete then reclaim folder should be on network or other disk where capacity is large

BUF_FLUSH_RECLAIM_FREQ
This is for buffer pool and defines buffer cache dirty page flusher and the buffer cache memory reclaimer frequency in micro sec.
note that this is just a hint and db changes this as per need

SCATTER_GATHER_MAX
Maximum number of pages to look for scatter gather, put 0 to select the system supported number (suggested), else put whatever num, but if it's more than system supported then it will be corrected to the system supported one.
Ideally no need to change this

MAX_NUM_TABLE_HEADER_SLOT
This has implications on the length of the chain of pages for a slot. If there are too many tables ( more than 10,000 tables, then reduce this number a bit else leave it as default)
Higher number with large number of tables would increase the memory overhead for the DB

MIN_DIRTY_SCAN
How many pages to scan to find out dirty pages. This is tuned for higher performance however change it as per your need after experiment. Be sure before changing

MIN_UPDATED_SCAN
How may pages to scan to find updated page? Be sure before changing

IDX_FLUSH_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation

DAT_FLUSH_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation

IDX_RECLAIM_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation

DAT_RECLAIM_CONSTRAINT
This impacts the flushing of pages, it affects the core and hence only change when you are confident after experimentation

PAGE_WRITE_FACTOR
This in a way denotes how fast data is written, but we should not change it unless confident after experiment

PAGE_READ_FACTOR
This in a way denotes how fast data is read, but we should not change it unless confident after experiment

IDX_DAT_NORMALIZE
This normalises the idx vs dat pages, helpful when we favour one over other

PREFETCH_BUF_SIZE
The pre-fetch buffer max size defined in MB. DB treats this as the max limit for pre-fetching of pages in the pool

PREFETCH_SCAN_WINDOW_NUM
Size of window for prefetch scan

PREFETCH_EXTENT_NUM
To what extent pages would be pre fetched