2 Replies Latest reply on Nov 29, 2007 2:55 AM by Guus der Kinderen

    Cluster member node doesn't join the cluster senior node

    Guus der Kinderen

      One particular nasty problem has been keeping us busy all day. I'm hoping that someone can help us out, because we're running out of ideas on how to proceed.

       

      Our current Openfire setup (two cluster nodes running Openfire 3.4.1 with a couple of patches) runs fine in my test environment. Moving the exact same code to two other machines (which form another XMPP domain) gives us a strange problem: The fist node starts up fine. The second node however fails to join the cluster, ging us this exception:

       

      2007-11-28 12:30:08.651 Oracle Coherence 3.3/387 <Info> (thread=pool-12-thread-1, member=n/a): Loaded operational configuration from resource jar:file:/srv/
      openfire/plugins/enterprise/lib/coherence.jar!/tangosol-coherence.xml"
      2007-11-28 12:30:08.662 Oracle Coherence 3.3/387 <Info> (thread=pool-12-thread-1, member=n/a): Loaded operational overrides from resource "jar:file:/srv/open
      fire/plugins/enterprise/lib/coherence.jar!/tangosol-coherence-override-dev.xml"
      2007-11-28 12:30:08.665 Oracle Coherence 3.3/387 <Info> (thread=pool-12-thread-1, member=n/a): Loaded operational overrides from resource "file:/srv/openfire
      /enterprise/tangosol-coherence-override.xml"
       
      Oracle Coherence Version 3.3/387
       Grid Edition: Development mode
      Copyright (c) 2000-2007 Oracle. All rights reserved.
       
      0.1627200 secs]
      2007-11-28 12:30:09.537 Oracle Coherence GE 3.3/387 <Warning> (thread=pool-12-thread-1, member=n/a): UnicastUdpSocket failed to set receive buffer size to 14
      28 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proc
      eeding with the actual value may cause sub-optimal performance.
      2007-11-28 12:30:09.876 Oracle Coherence GE 3.3/387 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
      2007-11-28 12:30:09.892 Oracle Coherence GE 3.3/387 <Error> (thread=Cluster, member=n/a): Assertion failed:
              at com.tangosol.coherence.component.net.Member.configure(Member.CDB:6)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService$NewMemberAnnounceReply.onReceived(ClusterService.CDB:66)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onMessage(Service.CDB:9)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onNotify(Service.CDB:123)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService.onNotify(ClusterService.CDB:3)
              at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:35)
              at java.lang.Thread.run(Thread.java:619)
       
      2007-11-28 12:30:09.892 Oracle Coherence GE 3.3/387 <Error> (thread=Cluster, member=n/a): Terminating ClusterService due to unhandled exception: com.tangosol
      .util.AssertionException
      2007-11-28 12:30:09.892 Oracle Coherence GE 3.3/387 <Error> (thread=Cluster, member=n/a):
      com.tangosol.util.AssertionException:
              at com.tangosol.coherence.Component._assertFailed(Component.CDB:12)
              at com.tangosol.coherence.Component._assert(Component.CDB:3)
              at com.tangosol.coherence.component.net.Member.configure(Member.CDB:6)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService$NewMemberAnnounceReply.onReceived(ClusterService.CDB:66)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onMessage(Service.CDB:9)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onNotify(Service.CDB:123)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService.onNotify(ClusterService.CDB:3)
              at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:35)
              at java.lang.Thread.run(Thread.java:619)
      2007-11-28 12:30:09.894 Oracle Coherence GE 3.3/387 <D5> (thread=Cluster, member=n/a): Service Cluster left the cluster
      

       

       

      We've checked about all that we could think of, which includes:

      • Firewalls are disabled;

      • We used udpcast to make sure that multicasting between the hosts works;

      • We've verified that (one) udp packet arrives at the 'master' node every time we try to join the second node to the cluster;

      • JVMs are identical.

       

      Does anyone have suggestions?