1 Reply Latest reply on May 13, 2016 10:55 AM by Nhan Pham T.

    Many problems with Cluster / Hazelcast plugin

    CSH

      Hello,

       

      we have an Openfire 3.10.3 cluster (with HazelCast 2.1.2 plugin) with 4 nodes running on Linux behind a load balancer (HAProxy). Unfortunately, it happens quite often, that our servers crash unexpectedly about once a week :-(. Users can't login anymore, Openfire does not respond, messages get lost, etc...

      Our current solution is to restart all Openfire nodes, if we or a customer detect the errorneous behavior.

       

      Crawling through the log files, I've found many HazelCast errors. Usually it starts with an OperationTimeoutException:

       

      (here on server 172.25.3.64, which seems to get no response from 172.25.3.63)

       

       

      2016.01.26 08:12:43 org.jivesoftware.openfire.handler.IQHandler - Interner Serverfehler
      com.hazelcast.core.OperationTimeoutException: No response for 60000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=GetOperation{Routing Users Cache}, partitionId=95, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=30000, target=Address[172.25.3.63]:5701, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
          at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
          at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:435)
          at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:254)
          at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
          at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.get(ClusteredCache.java:92)
          at org.jivesoftware.util.cache.CacheWrapper.get(CacheWrapper.java:121)
          at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java:805)
          at org.jivesoftware.openfire.SessionManager.getSessions(SessionManager.java:893)
          at org.jivesoftware.openfire.handler.IQPrivacyHandler.updateOrCreateList(IQPrivacyHandler.java:356)
          at org.jivesoftware.openfire.handler.IQPrivacyHandler.handleIQ(IQPrivacyHandler.java:124)
          at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:65)
          at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:380)
          at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:123)
          at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:76)
          at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:108)
          at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:69)
          at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:655)
          at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1275)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
          at ------ End remote and begin local stack-trace ------.(Unknown Source)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
          at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:435)
          at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:254)
          at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
          at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.get(ClusteredCache.java:92)
          at org.jivesoftware.util.cache.CacheWrapper.get(CacheWrapper.java:121)
          at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java:805)
          at org.jivesoftware.openfire.SessionManager.getSessions(SessionManager.java:893)
          at org.jivesoftware.openfire.handler.IQPrivacyHandler.updateOrCreateList(IQPrivacyHandler.java:356)
          at org.jivesoftware.openfire.handler.IQPrivacyHandler.handleIQ(IQPrivacyHandler.java:124)
          at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:65)
          at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:380)
          at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:123)
          at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:76)
          at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:108)
          at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:69)
          at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:655)
          at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1275)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
      

       

      The followed by:

       

      2016.01.26 08:13:48 com.hazelcast.concurrent.lock.operations.UnlockOperation - [172.25.3.64]:5701 [openfire] [3.5.1] Current thread is not owner of the lock! -> Owner: 10bfd686-5e66-4d0f-bb77-d1c28f4f143b, thread-id: 2251
      java.lang.IllegalMonitorStateException: Current thread is not owner of the lock! -> Owner: 10bfd686-5e66-4d0f-bb77-d1c28f4f143b, thread-id: 2251
          at com.hazelcast.concurrent.lock.operations.UnlockOperation.ensureUnlocked(UnlockOperation.java:71)
          at com.hazelcast.concurrent.lock.operations.UnlockOperation.unlock(UnlockOperation.java:65)
          at com.hazelcast.concurrent.lock.operations.UnlockOperation.run(UnlockOperation.java:57)
          at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:137)
          at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:309)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processPacket(OperationThread.java:142)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:115)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:101)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:76)
      

       

      Later:

       

      2016.01.26 08:26:17 org.jivesoftware.openfire.handler.PresenceUpdateHandler - Interner Serverfehler. Triggered by packet: 
      <presence from="..." to="..." type="unavailable"/>
      com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=GetOperation{Roster}, partitionId=59, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=6, callTimeout=30000, target=Address[172.25.3.63]:5701, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
      

       

      Followed by:

       

      2016.01.26 08:26:26 org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
      java.lang.IllegalArgumentException: Requested node 10bfd686-5e66-4d0f-bb77-d1c28f4f143b not found in cluster
          at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:307)
          at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:574)
          at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:46)
          at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:363)
          at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:239)
          at org.jivesoftware.openfire.roster.Roster.broadcastPresence(Roster.java:609)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.broadcastUpdate(PresenceUpdateHandler.java:308)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:162)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:137)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:201)
          at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:148)
          at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:84)
          at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84)
          at org.jivesoftware.openfire.SessionManager$ClientSessionListener.onConnectionClose(SessionManager.java:1242)
          at org.jivesoftware.openfire.nio.NIOConnection.notifyCloseListeners(NIOConnection.java:261)
          at org.jivesoftware.openfire.nio.NIOConnection.close(NIOConnection.java:243)
          at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:182)
      

       

      And finally resulting in:

       

      2016.01.26 08:27:53 org.jivesoftware.openfire.handler.PresenceUpdateHandler - Interner Serverfehler. Triggered by packet: 
      <presence from="..." to="..." type="unavailable"/>
      com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
          at com.hazelcast.spi.AbstractDistributedObject.throwNotActiveException(AbstractDistributedObject.java:81)
          at com.hazelcast.spi.AbstractDistributedObject.lifecycleCheck(AbstractDistributedObject.java:76)
          at com.hazelcast.spi.AbstractDistributedObject.getNodeEngine(AbstractDistributedObject.java:70)
          at com.hazelcast.map.impl.proxy.MapProxyImpl.unlock(MapProxyImpl.java:239)
          at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.unlock(ClusteredCache.java:189)
          at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$ClusterLock.unlock(ClusteredCacheFactory.java:465)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.broadcastUnavailableForDirectedPresences(PresenceUpdateHandler.java:480)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:163)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:137)
          at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:201)
          at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:148)
          at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:84)
          at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84)
          at org.jivesoftware.openfire.SessionManager.removeSession(SessionManager.java:1135)
          at org.jivesoftware.openfire.plugin.util.cache.ClusterListener.cleanupNode(ClusterListener.java:277)
          at org.jivesoftware.openfire.plugin.util.cache.ClusterListener.memberRemoved(ClusterListener.java:684)
          at com.hazelcast.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:1422)
          at com.hazelcast.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:116)
          at com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher.run(LocalEventDispatcher.java:63)
          at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:190)
          at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:174)
      

       

       

      At the same time server 172.25.3.63 gets similar errors, when targeting 172.25.3.64:

       

      2016.01.26 08:12:42 org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
      com.hazelcast.core.OperationTimeoutException: No response for 60000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=GetOperation{Routing Users Cache}, partitionId=168, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=30000, target=Address[172.25.3.64]:5701, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
          at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
      

       

      Even a weird one like (how can the domain be null suddenly?):

       

      2016.01.26 08:26:43 org.jivesoftware.openfire.handler.IQHandler - Interner Serverfehler
      java.lang.NullPointerException: Domain cannot be null
          at org.xmpp.packet.JID.<init>(JID.java:512)
          at org.xmpp.packet.JID.<init>(JID.java:496)
          at org.jivesoftware.openfire.SessionManager.userBroadcast(SessionManager.java:1066)
          at org.jivesoftware.openfire.handler.IQPrivacyHandler.updateOrCreateList(IQPrivacyHandler.java:369)
          at org.jivesoftware.openfire.handler.IQPrivacyHandler.handleIQ(IQPrivacyHandler.java:124)
          at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:65)
          at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:380)
          at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:123)
          at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:76)
          at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:108)
          at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:69)
          at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:655)
          at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1275)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
      

       

      After these errors, the whole cluster is unusable and we have to restart erverything.

       

       

      I've also found these from another crash:

       

      2015.12.18 15:48:25 com.hazelcast.spi.impl.operationservice.impl.Invocation - [172.25.3.63]:5701 [openfire] [3.5.1] While asking 'is-executing': Invocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.impl.operation.QueryPartitionOperation{serviceName='null', partitionId=238, callId=0, invocationTime=1450445461399, waitTimeout=-1, callTimeout=30000}, partitionId=238, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=30000, target=Address[172.25.3.62]:5701, backupsExpected=0, backupsCompleted=0}
      com.hazelcast.core.MemberLeftException: Member [172.25.3.62]:5701 has left cluster!
          at com.hazelcast.spi.impl.operationservice.impl.InvocationRegistry$OnMemberLeftTask.run(InvocationRegistry.java:398)
          at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:209)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
          at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
          at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
          at ------ End remote and begin local stack-trace ------.(Unknown Source)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.access$200(InvocationFuture.java:51)
          at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:132)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
          at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
          at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
      


       

      2016.01.21 15:52:12 com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation - [172.25.3.63]:5701 [openfire] [3.5.1] Op: com.hazelcast.concurrent.lock.operations.LockOperation{serviceName='null', partitionId=186, callId=3887632, invocationTime=1453387902620, waitTimeout=-2310001, callTimeout=30000}, Error: Cannot send response: CallTimeoutResponse{callId=3887632, urgent=false} to Address[172.25.3.65]:5701
      com.hazelcast.core.HazelcastException: Cannot send response: CallTimeoutResponse{callId=3887632, urgent=false} to Address[172.25.3.65]:5701
          at com.hazelcast.spi.impl.ResponseHandlerFactory$RemoteInvocationResponseHandler.sendResponse(ResponseHandlerFactory.java:131)
          at com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation.run(WaitingOperation.java:139)
          at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:137)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processOperation(OperationThread.java:154)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:110)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:101)
          at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:76)
      

       

      as well as this one in the info.log

       

      2016.01.21 15:52:11 com.hazelcast.core.LifecycleService - [172.25.3.65]:5701 [openfire] [3.5.1] Address[172.25.3.65]:5701 is SHUTTING_DOWN
      2016.01.21 15:52:11 com.hazelcast.instance.Node - [172.25.3.65]:5701 [openfire] [3.5.1] Shutting down connection manager...
      

       

      Before such a crash, the CPU also goes high as far as I can tell (couldn't verify, that it's hight for every crash).

       

      The above scenario happens too often, about every week...

       

      Please help! Do you have any ideas, what might be the cause or what we can try to prevent the crashes? Thanks!