Problematic states
In order to optimize replication, make it faster and reduce the amount of data transmitted, there are a couple of problematic states which can result if an application or database node is running out of resources (CPU, RAM, disk space, very rarely extreme network latency or other abnormal network states) that can cause delays and failures. These can lead to the database nodes becoming more and more out of sync over time (desynchronized).
Causes for data replication delay
If the local database or one of the remote databases is busy, or the tables required for one sproc are locked by another sproc or by the db engine (maintenance) operation, a replication delay can occur.
In these cases, the data they create, update or remove may not be available in the affected node’s web-based interface on the nodes where it was not updated.
The queue delay increases when some sproc execution plan deadlocks and that sproc locks tables needed for other sprocs:
Usually, a sproc deadlock is found by the database engine and reported back to the client, which reports it to
iddb
, which logs it and then retries. When that does not happen, or if something slows down processing at the database side, the sproc queues on all application nodes using that database as a backend start growing.Eventually the limit of running threads on the database engine is reached and the database client cannot send the server other sprocs.
At that point,
iddb
queues them. In Manage the system > Maintenance > Database replication > <node> the "Queue empty" value changes from the good value of "Yes" to "No".A good indication of how long this has been happening is in the "Time since last queue item was processed (seconds)" value.
Replication failure
Replication failure occurs if the data on one node already lacks required dependencies from other tables. Executing those changes can fail independently on either of the nodes, including the source node (see "desynchronized" below).
When the iddb service processes data as intended, sprocs that fail on a specific node (and are supposed to be sent for replication), are recorded in the instance’s db\ iddb-failed-procs*.log files:
These files are not rotated like the normal logs; if the files are not empty, they show up as critical errors in the healthcheck administrative interface dashboards on each node.
When failed-sprocs files are not empty, their contents are supposed to be checked by administrators as part of normal maintenance, and emptied.
Those contents must be preserved outside the instance file structure for later analysis.
If the failed-sprocs affect the product's functionality, they may have to be re-played on the node where they failed or on, after correcting the situation that made them fail in the first place.
Each row in those text files contains: timestamp of the occurrence, the sproc name, the module that called the sproc, and the (short, not helpful) summary of the error as returned to
iddb
from the database engine.Many failed sprocs during
psupdate
are irrelevant if they do not re-occur on subsequent runs, because Bravura Security Fabric is engineered to self-heal to some extent.
These failures have to be addressed within seven days of occurrence. This is the default number of days the idmsuite.log
is preserved. The Bravura Security Fabric log contains a more detailed description of the error as returned from the database engine. The administrator can search for "Native Client" (case sensitively), to find any such records in the log. These entries detail the error, including what caused it, as best as the database engine can determine. To interpret them, experience with the product and a knowledge of the database’s schema and constraints is required.
How to address failures
If the failed-procs files contain few rows, send the files to support@bravurasecurity.com to create a ticket and have the developers decide on a solution.
The dbcmd utility does pass the sproc and its argument to iddb
, however this is not recommended as a way to "replay" sprocs which failed previously. Sending the sprocs back to iddb
could cause them to replicate again, which in some cases, guarantees their failure on the nodes where the sproc already succeeded.
If too many of these sprocs have failed, either during a disastrous event or over time while no administrator addressed them, the database containing most of the common data will have to be propagated to the other nodes.
See the Migration use cases for various propagation methods.