Over-large replication queue disk usage

When multiple Bravura Security Fabric nodes are connected in replication and one or more nodes have suffered from a prolonged outage, it is possible for a non-outage node to be using a much larger than normal replication queue, even after the outage itself is resolved. This creates a risk for the outage to spread to otherwise unaffected nodes, if the queues grow too large and free disk space becomes too scarce.

After recovering from a prolonged outage involving other nodes in replication, non-outage nodes may show higher than normal disk space usage. If left unchecked, this may eventually lead to a message in idmsuite.log , like: “Commits to the database have been suspended”, indicating that the node refuses to allow any more operations until more disk space is made available.

To resolve this problem:

Check for large queues and available space on all replicated nodes.
Increase the space available.
Optionally remove unused replication queue data.
If increasing the space is not possible, adjust the replication space settings .

After completing these steps, replication should proceed with free space on disk above the configured threshold.

Verify queues and their settings

Navigate to Manage the system > Maintenance > Database replication.
Select a node showing high disk space usage.
Under the node’s Status tab, inspect each queue for the size of its Allocated space in the queue (bytes) entry relative to its Queue usage warning threshold (%) entry.
If any queue’s allocated size exceeds its threshold size, that queue has become over-large.
Verify the number of files in the instance's db\replication\ directory, and their timestamps.
If there are too many files, Windows File Explorer may crash, or not update correctly, in which case you can use another tool like robocopy.

Increasing space

If possible, Bravura Security Fabric should be installed onto its own, dedicated drive.

To maximize available free disk space, limit or remove unnecessary files on the Bravura Security Fabric instance’s installation drive.

Provide more space or remove unneeded files.
If an outage is prolonged, Bravura Security Fabric administrators should monitor more closely than usual the disk space usage on the partitions where the Bravura Security Fabric application is installed and replication queue sizes on all non-outage nodes.
If it appears that disk space will become an issue on those other nodes due to queue growth, either add additional disk space to the active modes where the queues keep growing, or remove the unavailable (outage) nodes from replication, and then restore them manually once those other nodes are recovered.
Remove unneeded/leftover replication queue files.
If a node’s queue files are still large after an outage for other nodes is resolved, perform a manual cleanup of the queue files as follows:
- Check the timestamps of the files in the instance's db\replication directory
  1. If they are old - from before the start time of the iddb service on that node - they probably can be deleted. First check if they are in use by iddb itself or some other binary.
  2. Use sysinternals' process explorer or any other tool that allows you to check for file handles per process.
  3. Check if the running iddb process for the current instance has file handles on all the files in db\replication.
  4. If it doesn't, delete the files not being used by iddb .
- (Optionally) Re-check that queue files allocation is now less than the configured threshold.

To prevent the instance from running out of disk space in the first place, ensure that e-mail notifications are working, and that a suitable e-mail address or addresses is configured RECIPIENT EMAIL variable, under Manage the system > Workflow > E-mail configuration.

Root cause

During an outage limited to one node, non-outage nodes will queue database updates to disk until contact is re-established with the outage nodes. If the outage is prolonged, those queues will eventually start to grow. With enough time, disk space will remain depleted until the free-space threshold is reached. At that point, the node will enter DB COMMIT SUSPEND mode (to safeguard against data loss), making the node unusable until the issue is addressed. This is normal, documented behavior.

When the iddb service is stopped unexpectedly (it crashes or the server it runs on loses power) some of the old queues may remain in place and take space without them being used.

In this section:

Over-large replication queue disk usage

Search results