We are attempting to set up a test cluster of two nodes with a third host talking to the cluster via S2S.
When running the two nodes as a standlone cluster, XMPP clients talking to the cluster do the expected thing when the node they are currently connecting to goes down - i.e. Receive a disconnect and on rejoin everything works as expected.
We've encoutered issues when using a separate XMPP host via S2S - we get dangling / loss of communications when the endpoint of the S2S connection goes down within the cluster -> when we attempt to send further "groupchat" messages (causing the creation of new S2S connections) we are in a bad state.
- Client 1 Spark connects as email@example.com to "firstname.lastname@example.org" -> directed to cluster node lh01.xmpp.domain
- Client 2 Spark connects as email@example.com to "firstname.lastname@example.org" -> direct connection to host, S2S connection created to lh01.xmpp.domain via the load balancer.
At this point, both clients see each other in the room and can exchange group chat messages.
- Halt of lh01.xmpp.domain node
The server shuts down, the cluster promotes the junior to senior (lh02) and Client 1 Spark is forced to reconnect - and reconnects successfully to the room. No other participants are visible in the room.
Client 2 Spark does not receive any notice or visible indication that an error has occured. The logs of "dh01.standalone.domain" show the disconnection of the S2S connection.
When typing further messages in Client Spark 2, the following is received:
<message id="62YYc-88" to="email@example.com/Spark" from="firstname.lastname@example.org" type="error">
<error code="406" type="MODIFY">
Version / Setup information:
Openfire version: Git checkout of https://github.com/igniterealtime/Openfire/commit/34971f9562fbe07cb7befebb120f88 3f66493850
Platform: Linux Centos 6.8
Database: Oracle 12.1
Load balancer: HA proxy for 5222, 5269
Cluster Node1 host: lh01.xmpp.domain
Cluster Node1 XMPP domain: xmpp.domain
Cluster Node2 host: lh02.xmpp.domain
Cluster Node2 XMPP domain: xmpp.domain
Node3 host: dh01.standalone.domain
Node3 XMPP domain: dh01.standalone.domain
Relevant DNS entries (others like the oracle host are not shown):
lh01.xmpp.domain. IN A 10.0.0.11
lh02.xmpp.domain. IN A 10.0.0.21
xmpp.domain. IN A 10.0.0.50
conference.xmpp.domain. IN CNAME xmpp.domain
dh01.standalone.domain. IN A 10.0.0.60
conference.dh01.standalone.domain. IN CNAME dh01.standalone.domain.
_xmpp-client._tcp.xmpp.domain. IN SRV 0 0 5222 xmpp.domain.
_xmpp-server._tcp.xmpp.domain. IN SRV 0 0 5222 xmpp.domain.
_xmpp-server._tcp.conference.xmpp.domain. IN SRV 0 0 5222 conference.xmpp.domain.
NOTE: The dh01 IP address as listed above is the HA proxy IP address - so that incoming connections to dh01 look like they are coming from the "xmpp.domain" IP address rather than individual cluster nodes.
I have generated trusted certs that have all appropriate alternate names and imported them into the necessary nodes.
I realise I'm potentially asking for a world of pain using HEAD from git - if there's a specific version I should be trying this with, please let me know.
I have the lab cluster still up and running for further investigations / testing.
Thanks for any pointers that can be given - even if it's "add more debugging to the server connections _here_ and show us the logs".