Data Replication Queue
Whenever an event, such as a password change, login, or request submission, occurs on a single Bravura Security Fabric server, information about the event is stored on the local database instance dedicated to that Bravura Security Fabric server. It is replicated to all other Bravura Security Fabric servers, each with their own physically distinct database instance.
This replication happens in real time as long as the Bravura Security Fabric database services are connected to the respective Bravura Security Fabric servers.
If a given Bravura Security Fabric server (henceforth called originator) cannot contact the database service on another Bravura Security Fabric server (henceforth called replica), the update is queued on the originator’s file system in queue files.
If real-time replication stops and the queue starts to fill, the originator Bravura Security Fabric server will stop functioning when the queue is full. This is because the alternative would be to write updates to the originator’s database without replicating them to other servers, creating a single point of failure in the system.
Queue settings
When database replication is unavailable, replication data is queued until the queue file has reached its size limit, or the node comes back online. You can modify:
Receive queue configuration for this node
Send queue configuration for all nodes replicating to this node
You can set the queue limit to a fixed size, using the following settings:
Minimum queue size - default is 100MB
Maximum queue size - for fixed size limit
Queue usage warning threshold (%): - default is 60%
You can set the queue limit to a percentage of disk, using the following parameters:
Minimum queue size - default is 100MB
Maximum usage before queue stops growing (%): - default is 90%
Disk usage warning threshold (%): - default is 60%
It is best practice to set these variables the same on all Bravura Security Fabric servers. After making changes, you must restart the Database Service or click Propagate and reload replication configuration on all servers on the replication configuration page.
During the replication process, the replicating servers attempt to send data to another server. If they fail to connect, they retry after 30 seconds until they succeed, or the queue reaches its limit. Once the limit is reached, the sending server suspends database commits, and stops accepting requests.
DB COMMIT SUSPEND mode
The server state when the queue is full is called DB COMMIT SUSPEND mode. One symptom of this mode is that every Bravura Security Fabric UI page shows the following message:
Changes to this instance are temporarily disallowed. Please contact the Bravura Security Fabric administrator. Due to a problem in the replication environment all pages except Database replication and System logs are temporarily disabled.
When the server is in DB COMMIT SUSPEND mode, only superusers can log into Bravura Security Fabric . They will only have access to the systems log and database replication pages.
This state triggers the DB COMMIT SUSPEND exit trap. If you adjust the queue limit, or free some disk space when the limit is set as a percentage of disk, to allow the server to come back to normal replication mode. This triggers the DB COMMIT RESUME exit trap.
Caution
When the replication queue files have exceeded their limit, logging the superuser’s actions to fix the problem may only be recorded on the local system, leading to potential security problems and unsynchronization. You may have to manually update other systems with the session logs created at the time.
Event actions (exit traps)
You can configure Bravura Security Fabric to send email warnings, or some other notification, to administrators when replication failures occur.
Replication event action options can be accessed by navigating to Manage the system > Maintenance > System variables or Manage the system > Maintenance > Options.
By default some events are configured to run the pxnull interface program with the pxnull-replication script, located in the \<instance>\script\ directory, to notify users when the server has attempted to reconnect to the downed server a certain number of times. The script is set to use the settings set in the Email configuration page and contact the account in the RECIPIENT EMAIL field. You must install the pxnull program with the Connector Pack .
You can also send a warning email to administrators in the event of a short-lived replication problem, by configuring the DB REPLICATION CONN FAILURE event action (Manage the system > Maintenance > Options). The advantage of detecting short-lived replication problems is that administrators can react more quickly to problems. The disadvantage is that there may be many spurious replication problems, due to a busy server rather than any underlying problem in the environment, and these may generate too many emails. A carefully configured exit trap can check for a threshold number of problems before sending an email. See the sample script below.
Sample Replication Warning Event Configuration
function REPLICATION_FAILURE
{
var $retryKey;
var $retryVal;
var $retryThreshold = 50; #Default limit is 256 tries every minute before database is suspended.
for( var $i = 0; $i < size( $sessdat ); $i = $i + 1 )
{
$retryKey = keyAt($sessdat, $i);
$retryVal = valAt($sessdat, $i);
if (strcmp($retryKey, "retry")==0)
{
if ($retryVal > $retryThreshold)
{
# Add code to generate and send email notifications, and open
# monitoring tickets here
log ("replication-failure-pxnull.cfg: Send emails about failure: " + $retryVal);
}
else
{
# Threshold has not been reached. Do nothing.
}
}
}
return 0;
}Time available to fix problems
The time interval between when replication stops – for example, due to network outage or hardware problem – and when the originator server becomes non-functional, depends on the rate of data updates:
Event | Size in queue | Capacity of 100MB queue |
|---|---|---|
Interactive login to Bravura Security Fabric | 5850 Bytes | 17000 |
Checkout a single password | 3228 Bytes | 31000 |
A single password update to a target system: | 6400 Bytes | 15600 |
The time available to resolve a problem before Bravura Security Fabric functionality fails depends on the frequency of logins, password changes on target systems, and passwords being checked out.
Bravura Security recommends that you periodically estimate the time available to resolve problems based on current metrics and the current queue size.