This page last changed on Jan 29, 2009 by ivan@atlassian.com.

Cluster Panic

Usually presents itself with the following error message:

FATAL [DefaultQuartzScheduler_Worker-4] [confluence.cluster.safety.ClusterPanicListener] handleEvent Fatal error in Confluence cluster: 
Database is being updated by an instance which is not part of the current cluster. You should check network connections between cluster nodes, especially multicast traffic.

In almost all cases, cluster panic events are caused by two or more instances of Confluence (in separate clusters) updating the same database. Such events are typically caused by one of the following issues:

JVM paused (e.g. while swapping memory) can break communication between two nodes

Always watch the swapping activity of your server and avoid swapping due to lack of RAM. If there is not enough RAM available, your server may start swapping out some of Confluence's heap data to your hard disk. This will slow down the JVM's garbage collection (GC) considerably and affect Confluence's performance.

  • In clustered installations, swapping can lead to cluster panic. This is because swapping causes the JVM to pause during garbage collection, which in turn can break the inter-node communication required to keep the clustered nodes in sync.
Two instances of Confluence have been started in your application server

This is one of the most commonly encountered issues. The strangest case of this that we have seen so far involved a cloned image of a PC running Confluence that was later used in a remote office in a different city. The people using Confluence on the cloned instance were not aware that the original Confluence instance was also running and that both these Confluence instances were using the same production database server.

  • Solution: Check your application server's configuration to make sure that multiple copies of the application server are not running concurrently. Database transaction logs can help identify the location of other application servers, if client IP addresses are recorded along with each transaction.
Two copies of your application server are running.

Sometimes starting an application server twice will result in two processes running, even though only one can be accessed over the network.

  • Solution: Check a list of running processes (for example, with the 'ps' command in Posix-based operating systems like Linux, Unix and Mac OS X) and make sure your application server is only running once.
Networking failure between nodes in the cluster
  • Solution: Check that multi-cast traffic is being transmitted successfully, and that the network between your nodes is low-latency (<100 ms).
Database server stops responding

If Coherence fails to retrieve the SafetyNumber from the database, the comparison will fail. If it fails to update it, the next comparison will fail, 30 seconds later.
Many things can cause this, including a scheduled shutdown for backups, network failure, a filled-up transaction-log partition and a changed password on the account used by Confluence to connect to the database.

  • Solution: resolve the problem with the database (or network), then restart Confluence
In all cases, when starting Confluence after a cluster panic, you must ensure all cluster nodes have been shut down completely. If necessary, use commands like ps and kill to get a list of Java processes and terminate them manually.
Please visit this document for troubleshooting advice if you encounter any of the above situations.
Document generated by Confluence on Nov 05, 2009 23:26