Skip to main content

Single Bravura Security Fabric server fails

Users may complain that they cannot log into this server at all (no login page) and that other servers display the following error message:

Changes to this instance are temporarily disallowed. Please contact the Bravura Security Fabric administrator. Due to a problem in the replication environment all pages except Database replication and System logs are temporarily disabled.

The DB COMMIT SUSPEND event (Manage the system > Maintenance > Options) is configured by default to send an email as soon as the Bravura Security Fabric server has entered this state.

What stops working

What continues to work

Possible Causes

Data loss

Resolution

  • Users can no longer log into this server.

  • Users can no longer retrieve passwords from this server.

  • This server can no longer push password updates to target systems for which it is responsible.

  • Other servers detect that replication is impossible to this server, so start queuing updates to this server and displaying alarm messages, indicating that when the queue fills, they will stop functioning normally.

  • If the queue is allowed to fill – which could take several hours to several days, depending on activity level and queue size – other servers will suspend services; users will be unable to log in (since logins are logged in a replicated fashion) and will be unable to check out passwords.

    Effectively, the entire system will go into an alert state until the dead server is repaired or removed from replication. The entire system will eventually switch to a DB COMMIT SUSPEND state if a repair is not made before replication queues on the other servers fill.

Other servers continue to function normally, unless their replication queues reach their limit.

In the event that the queue is full on other servers, they switch to DB COMMIT SUSPEND mode. In that case, removing the non-functional server from replication is the only possible action.

A problem occurs on a single Bravura Security Fabric server. This may be for a variety of reasons, including:

  • Hardware problem, such as a disk crash.

  • Operating system problem, such as a bug, or full disk.

  • Application problem, such as a bug, or misconfiguration.

No data loss or – due to an unavoidable race condition – minimal data loss if updates on target systems were not yet committed to the database when the damaged server went offline

Fix the failed server if it can be done in time. See Time available to fix problems . Other servers will continue to function in the meanwhile. See Troubleshooting Bravura Security Fabric server failures for fixes to some possible failures.

If the server cannot be fixed quickly or is permanently damaged, remove it from the replication configuration on other servers promptly, as described in Removing a node from replication .

If the failed server can be recovered (for example, by installing new hardware), synchronize the node with the already-running replicated nodes, using the process described in Synchronizing a new node with an existing set of Bravura Security Fabric replicas .

If the failing server was acting as the primary, then it may be necessary to promote one of the secondary nodes to allow it to initiate resynchronization. Update the list of scheduled jobs so the most up-to-date replica is acting as the primary, then resynchronize the new replacement node. Once the replacement has been confirmed as functional, it can be promoted to the primary node similarly.