Confluence Docs 3.0 : Cluster safety mechanism
This page last changed on Jan 21, 2009 by ivan@atlassian.com.
IntroductionA mechanism was added in Confluence 2.3 and above to ensure database consistency when running multiple cluster nodes against the same database. This is called the cluster safety mechanism, and is designed to ensure that your wiki cannot become inconsistent because updates by one user are not visible to another. A failure of this mechanism is a fatal error in Confluence and is called cluster panic. Because the cluster safety mechanism helps prevents data inconsistency whenever any two copies of Confluence running against the same database, it is enabled in all instances of Confluence, not just clusters. How cluster safety worksA scheduled task, ClusterSafetyJob, runs every 30 seconds in Confluence. In a cluster, this job is run only on one of the nodes. The scheduled task operates on a safety number – a randomly generated number that is stored both in the database and in the distributed cache used across a cluster. It does the following:
How to fix itCluster PanicUsually presents itself with the following error message: FATAL [DefaultQuartzScheduler_Worker-4] [confluence.cluster.safety.ClusterPanicListener] handleEvent Fatal error in Confluence cluster: Database is being updated by an instance which is not part of the current cluster. You should check network connections between cluster nodes, especially multicast traffic. In almost all cases, cluster panic events are caused by two or more instances of Confluence (in separate clusters) updating the same database. Such events are typically caused by one of the following issues: JVM paused (e.g. while swapping memory) can break communication between two nodesAlways watch the swapping activity of your server and avoid swapping due to lack of RAM. If there is not enough RAM available, your server may start swapping out some of Confluence's heap data to your hard disk. This will slow down the JVM's garbage collection (GC) considerably and affect Confluence's performance.
Two instances of Confluence have been started in your application serverThis is one of the most commonly encountered issues. The strangest case of this that we have seen so far involved a cloned image of a PC running Confluence that was later used in a remote office in a different city. The people using Confluence on the cloned instance were not aware that the original Confluence instance was also running and that both these Confluence instances were using the same production database server.
Two copies of your application server are running.Sometimes starting an application server twice will result in two processes running, even though only one can be accessed over the network.
Networking failure between nodes in the cluster
Database server stops respondingIf Coherence fails to retrieve the SafetyNumber from the database, the comparison will fail. If it fails to update it, the next comparison will fail, 30 seconds later.
Technical detailsThe cluster safety number in the database is stored in the CLUSTERSAFETY table. This table has just one row: the current safety number. |
![]() |
Document generated by Confluence on Nov 05, 2009 23:34 |