0 Replies Latest reply on Jul 6, 2017 5:30 AM by Daniel Hams

    Issues with 2 node cluster + S2S to other domain (git HEAD)

    Daniel Hams

      Dear Devs,

       

      We are attempting to set up a test cluster of two nodes with a third host talking to the cluster via S2S.

       

      When running the two nodes as a standlone cluster, XMPP clients talking to the cluster do the expected thing when the node they are currently connecting to goes down - i.e. Receive a disconnect and on rejoin everything works as expected.

       

      We've encoutered issues when using a separate XMPP host via S2S - we get dangling / loss of communications when the endpoint of the S2S connection goes down within the cluster -> when we attempt to send further "groupchat" messages (causing the creation of new S2S connections) we are in a bad state.

       

      Example scenario:

       

      • Client 1 Spark connects as dan@xmpp.domain to "testroom@conference.xmpp.domain" -> directed to cluster node lh01.xmpp.domain
      • Client 2 Spark connects as test@dh01.standalone.domain to "testroom@conference.xmpp.domain" -> direct connection to host, S2S connection created to lh01.xmpp.domain via the load balancer.

       

      At this point, both clients see each other in the room and can exchange group chat messages.

       

      • Halt of lh01.xmpp.domain node

       

      The server shuts down, the cluster promotes the junior to senior (lh02) and Client 1 Spark is forced to reconnect - and reconnects successfully to the room. No other participants are visible in the room.

       

      Client 2 Spark does not receive any notice or visible indication that an error has occured. The logs of "dh01.standalone.domain" show the disconnection of the S2S connection.

       

      When typing further messages in Client Spark 2, the following is received:

       

      <message id="62YYc-88" to="test@dh01.standalone.domain/Spark" from="testroom@conference.xmpp.domain" type="error">

        <body>qut</body>

        <thread>7Df031</thread>

        <error code="406" type="MODIFY">

          <not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/>

        </error>

        <x xmlns="jabber:x:event">

          <offline/>

          <delivered/>

          <displayed/>

          <composing/>

        </x>

      </message>

       

      Version / Setup information:

       

      Openfire version: Git checkout of https://github.com/igniterealtime/Openfire/commit/34971f9562fbe07cb7befebb120f88 3f66493850

      Platform: Linux Centos 6.8

      Database: Oracle 12.1

      Load balancer: HA proxy for 5222, 5269

       

      Cluster Node1 host: lh01.xmpp.domain

      Cluster Node1 XMPP domain: xmpp.domain

       

      Cluster Node2 host: lh02.xmpp.domain

      Cluster Node2 XMPP domain: xmpp.domain

       

      Node3 host: dh01.standalone.domain

      Node3 XMPP domain: dh01.standalone.domain

       

      Relevant DNS entries (others like the oracle host are not shown):

       

      lh01.xmpp.domain.   IN  A   10.0.0.11

      lh02.xmpp.domain.   IN  A   10.0.0.21

       

      xmpp.domain.        IN  A   10.0.0.50

      conference.xmpp.domain. IN  CNAME   xmpp.domain

       

      dh01.standalone.domain. IN  A   10.0.0.60

      conference.dh01.standalone.domain. IN  CNAME    dh01.standalone.domain.

       

      _xmpp-client._tcp.xmpp.domain.      IN  SRV 0   0   5222    xmpp.domain.

      _xmpp-server._tcp.xmpp.domain.      IN  SRV 0   0   5222    xmpp.domain.

      _xmpp-server._tcp.conference.xmpp.domain.      IN  SRV 0   0   5222    conference.xmpp.domain.

       

      NOTE: The dh01 IP address as listed above is the HA proxy IP address - so that incoming connections to dh01 look like they are coming from the "xmpp.domain" IP address rather than individual cluster nodes.

       

      I have generated trusted certs that have all appropriate alternate names and imported them into the necessary nodes.

       

      I realise I'm potentially asking for a world of pain using HEAD from git - if there's a specific version I should be trying this with, please let me know.

       

      I have the lab cluster still up and running for further investigations / testing.

       

      Thanks for any pointers that can be given - even if it's "add more debugging to the server connections _here_ and show us the logs".

       

      Kind regards,

       

      Dan