Skip to main content

Data Replication Queue

Whenever an event, such as a password change, login, or request submission, occurs on a single Bravura Security Fabric server, information about the event is stored on the local database instance dedicated to that Bravura Security Fabric server. It is replicated to all other Bravura Security Fabric servers, each with their own physically distinct database instance.

This replication happens in real time as long as the Bravura Security Fabric database services are connected to the respective Bravura Security Fabric servers.

If a given Bravura Security Fabric server (henceforth called originator) cannot contact the database service on another Bravura Security Fabric server (henceforth called replica), the update is queued on the originator’s file system in queue files.

If real-time replication stops and the queue starts to fill, the originator Bravura Security Fabric server will stop functioning when the queue is full. This is because the alternative would be to write updates to the originator’s database without replicating them to other servers, creating a single point of failure in the system.

Queue settings

When database replication is unavailable, replication data is queued until the queue file has reached its size limit, or the node comes back online. You can modify:

  • Receive queue configuration for this node

  • Send queue configuration for all nodes replicating to this node

You can set the queue limit to a fixed size, using the following settings:

  • Minimum queue size - default is 100MB

  • Maximum queue size - for fixed size limit

  • Queue usage warning threshold (%): - default is 60%

You can set the queue limit to a percentage of disk, using the following parameters:

  • Minimum queue size - default is 100MB

  • Maximum usage before queue stops growing (%): - default is 90%

  • Disk usage warning threshold (%): - default is 60%

It is best practice to set these variables the same on all Bravura Security Fabric servers. After making changes, you must restart the Database Service or click Propagate and reload replication configuration on all servers on the replication configuration page.

During the replication process, the replicating servers attempt to send data to another server. If they fail to connect, they retry after 30 seconds until they succeed, or the queue reaches its limit. Once the limit is reached, the sending server suspends database commits, and stops accepting requests.

DB COMMIT SUSPEND mode

The server state when the queue is full is called DB COMMIT SUSPEND mode. One symptom of this mode is that every Bravura Security Fabric UI page shows the following message:

Changes to this instance are temporarily disallowed. Please contact the Bravura Security Fabric administrator. Due to a problem in the replication environment all pages except Database replication and System logs are temporarily disabled.

When the server is in DB COMMIT SUSPEND mode, only superusers can log into Bravura Security Fabric . They will only have access to the systems log and database replication pages.

This state triggers the DB COMMIT SUSPEND exit trap. If you adjust the queue limit, or free some disk space when the limit is set as a percentage of disk, to allow the server to come back to normal replication mode. This triggers the DB COMMIT RESUME exit trap.

Caution

When the replication queue files have exceeded their limit, logging the superuser’s actions to fix the problem may only be recorded on the local system, leading to potential security problems and unsynchronization. You may have to manually update other systems with the session logs created at the time.

Event actions (exit traps)

You can configure Bravura Security Fabric to send email warnings, or some other notification, to administrators when replication failures occur.

Replication event action options can be accessed by navigating to Manage the system > Maintenance > System variables or Manage the system > Maintenance > Options.

By default some events are configured to run the pxnull interface program with the pxnull-replication script, located in the \<instance>\script\ directory, to notify users when the server has attempted to reconnect to the downed server a certain number of times. The script is set to use the settings set in the Email configuration page and contact the account in the RECIPIENT EMAIL field. You must install the pxnull program with the Connector Pack .

You can also send a warning email to administrators in the event of a short-lived replication problem, by configuring the DB REPLICATION CONN FAILURE event action (Manage the system > Maintenance > Options). The advantage of detecting short-lived replication problems is that administrators can react more quickly to problems. The disadvantage is that there may be many spurious replication problems, due to a busy server rather than any underlying problem in the environment, and these may generate too many emails. A carefully configured exit trap can check for a threshold number of problems before sending an email. See the sample script below.

Sample Replication Warning Event Configuration

function REPLICATION_FAILURE
{
  var $retryKey;
  var $retryVal;
  var $retryThreshold = 50;  #Default limit is 256 tries every minute before database is suspended.
  for( var $i = 0; $i < size( $sessdat ); $i = $i + 1 )
  {
      $retryKey = keyAt($sessdat, $i);
      $retryVal = valAt($sessdat, $i);
      if (strcmp($retryKey, "retry")==0)
      {
         if ($retryVal > $retryThreshold)
         {
          # Add code to generate and send email notifications, and open
          # monitoring tickets here
            log ("replication-failure-pxnull.cfg: Send emails about failure: " + $retryVal);
         }
         else
         {
          # Threshold has not been reached. Do nothing.
         }
      }
   }
   return 0;
}

Time available to fix problems

The time interval between when replication stops – for example, due to network outage or hardware problem – and when the originator server becomes non-functional, depends on the rate of data updates:

Event

Size in queue

Capacity of 100MB queue

Interactive login to Bravura Security Fabric

5850 Bytes

17000

Checkout a single password

3228 Bytes

31000

A single password update to a target system:

6400 Bytes

15600

The time available to resolve a problem before Bravura Security Fabric functionality fails depends on the frequency of logins, password changes on target systems, and passwords being checked out.

Bravura Security recommends that you periodically estimate the time available to resolve problems based on current metrics and the current queue size.