Adding to what the sibling comment say, be careful about buying into RabbitMQ's ...

shoo · on May 21, 2020

> lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was

I can offer a similar anecdote: we started seeing rabbitmq reporting alleged cluster partitions in production after enabling TLS between rabbitmq nodes, where manual recovery was needed each time.

After a bit of investigation we noticed that cluster partition seemed to correlate with sending an unusually large message (think something dumb like 30 megs) through rabbitmq when TLS between rabbitmq nodes was enabled. What I believe was happening was Rabbitmq was so busy encrypting/decrypting large message that it delayed sending or receiving heartbeat & then the cluster falsely assumed there has been a network partition.

Mitigated that issue by rewriting system to not send 30 meg messages- there was only one message producer that sent messages anywhere near that large, and after a bit of thought realised it was not necessary to send any message at all in that case (sending large message was to hack around some other old system performance problem that had gotten fixed properly a year back, but the hack that generated a huge message was still in place)

ramchip · on May 21, 2020

Erlang/OTP-22 (released last year) introduced TLS distribution optimizations and message fragmentation which sound very related to the problem you saw:

http://blog.erlang.org/OTP-22-Highlights/

The fragmentation in particular addresses the problem where a large message would block all other messages, including heartbeats, and cause nodes to look “down” when they’re not.

shoo · on May 21, 2020

fantastic. thank you for sharing that -- my anecdote about this problem is slightly dated -- it would have been late 2017 early 2018 we were seeing the issue, which indeed predates OTP 22 release.

airfreak · on May 21, 2020

The old network partition problems people remember about RabbitMQ are solved by quorum queues.

atombender · on May 21, 2020

Yes, but quorum queues don't have many of the features of classic mirrored queues.

fernandotakai · on May 21, 2020

it used to be really bad, that's super true.

nowadays? it's actually quite simple to setup and works pretty well (source: i know two different companies that setup clustering recently and both had good experiences with no downtime).