This page last changed on Jan 21, 2009 by ivan@atlassian.com.

Introduction

A mechanism was added in Confluence 2.3 and above to ensure database consistency when running multiple cluster nodes against the same database. This is called the cluster safety mechanism, and is designed to ensure that your wiki cannot become inconsistent because updates by one user are not visible to another. A failure of this mechanism is a fatal error in Confluence and is called cluster panic.

Because the cluster safety mechanism helps prevents data inconsistency whenever any two copies of Confluence running against the same database, it is enabled in all instances of Confluence, not just clusters.

How cluster safety works

A scheduled task, ClusterSafetyJob, runs every 30 seconds in Confluence. In a cluster, this job is run only on one of the nodes. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. It does the following:

  1. Generate a new random number
  2. Compare the existing safety numbers, if there is already a safety number in both the database and the cache.
  3. If the numbers differ, publish a ClusterPanicEvent. Currently in Confluence, this causes the following to happen:
    • disable all access to the application
    • disable all scheduled tasks
    • update the database safety number to a new value, which will cause all nodes accessing the database to fail.
  4. If the numbers are the same or aren't set yet, update the safety numbers:
    • set the safety number in the database to the new random number
    • set the safety number in the cache to the new random number.

How to fix it

Cluster Panic

Usually presents itself with the following error message:

FATAL [DefaultQuartzScheduler_Worker-4] [confluence.cluster.safety.ClusterPanicListener] handleEvent Fatal error in Confluence cluster: 
Database is being updated by an instance which is not part of the current cluster. You should check network connections between cluster nodes, especially multicast traffic.

In almost all cases, cluster panic events are caused by two or more instances of Confluence (in separate clusters) updating the same database. Such events are typically caused by one of the following issues:

JVM paused (e.g. while swapping memory) can break communication between two nodes

Always watch the swapping activity of your server and avoid swapping due to lack of RAM. If there is not enough RAM available, your server may start swapping out some of Confluence's heap data to your hard disk. This will slow down the JVM's garbage collection (GC) considerably and affect Confluence's performance.

  • In clustered installations, swapping can lead to cluster panic. This is because swapping causes the JVM to pause during garbage collection, which in turn can break the inter-node communication required to keep the clustered nodes in sync.
Two instances of Confluence have been started in your application server

This is one of the most commonly encountered issues. The strangest case of this that we have seen so far involved a cloned image of a PC running Confluence that was later used in a remote office in a different city. The people using Confluence on the cloned instance were not aware that the original Confluence instance was also running and that both these Confluence instances were using the same production database server.

  • Solution: Check your application server's configuration to make sure that multiple copies of the application server are not running concurrently. Database transaction logs can help identify the location of other application servers, if client IP addresses are recorded along with each transaction.
Two copies of your application server are running.

Sometimes starting an application server twice will result in two processes running, even though only one can be accessed over the network.

  • Solution: Check a list of running processes (for example, with the 'ps' command in Posix-based operating systems like Linux, Unix and Mac OS X) and make sure your application server is only running once.
Networking failure between nodes in the cluster
  • Solution: Check that multi-cast traffic is being transmitted successfully, and that the network between your nodes is low-latency (<100 ms).
Database server stops responding

If Coherence fails to retrieve the SafetyNumber from the database, the comparison will fail. If it fails to update it, the next comparison will fail, 30 seconds later.
Many things can cause this, including a scheduled shutdown for backups, network failure, a filled-up transaction-log partition and a changed password on the account used by Confluence to connect to the database.

  • Solution: resolve the problem with the database (or network), then restart Confluence
In all cases, when starting Confluence after a cluster panic, you must ensure all cluster nodes have been shut down completely. If necessary, use commands like ps and kill to get a list of Java processes and terminate them manually.
Please visit this document for troubleshooting advice if you encounter any of the above situations.

Technical details

The cluster safety number in the database is stored in the CLUSTERSAFETY table. This table has just one row: the current safety number.

Document generated by Confluence on Nov 05, 2009 23:34