Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Adding to what the sibling comment say, be careful about buying into RabbitMQ's clustering; having run it for years, I found it to be extremely brittle.

We often lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was. It has a bunch of hacks to mitigate this, but they don't solve the core problem; the only way to run mirrored queues ("classic mirrored queues", as they're not called) reliably is to disable automatic recovery, and then you have to manually repair RabbitMQ every time this happens. If you care about integrity, you can use the new quorum queues instead, which use a Raft-based consensus system, but they lack a lot of the features of the "classic" queues. No message priorities, for example.

I've never used federation or Shovel, which are different features with other pros/cons.

If you're willing to lose the occasional message under very high load, NATS [3] is absolutely fantastic, and extremely fast and easy to cluster. Alternatively, NATS Streaming [4] and Liftbridge [5] are two message brokers built on top of NATS that implement reliable delivery. I've not used them, but heard good things.

[1] https://www.rabbitmq.com/partitions.html

[2] https://www.rabbitmq.com/quorum-queues.html

[3] https://nats.io/

[4] https://docs.nats.io/nats-streaming-concepts/intro

[5] https://github.com/liftbridge-io/liftbridge



> lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was

I can offer a similar anecdote: we started seeing rabbitmq reporting alleged cluster partitions in production after enabling TLS between rabbitmq nodes, where manual recovery was needed each time.

After a bit of investigation we noticed that cluster partition seemed to correlate with sending an unusually large message (think something dumb like 30 megs) through rabbitmq when TLS between rabbitmq nodes was enabled. What I believe was happening was Rabbitmq was so busy encrypting/decrypting large message that it delayed sending or receiving heartbeat & then the cluster falsely assumed there has been a network partition.

Mitigated that issue by rewriting system to not send 30 meg messages- there was only one message producer that sent messages anywhere near that large, and after a bit of thought realised it was not necessary to send any message at all in that case (sending large message was to hack around some other old system performance problem that had gotten fixed properly a year back, but the hack that generated a huge message was still in place)


Erlang/OTP-22 (released last year) introduced TLS distribution optimizations and message fragmentation which sound very related to the problem you saw:

http://blog.erlang.org/OTP-22-Highlights/

The fragmentation in particular addresses the problem where a large message would block all other messages, including heartbeats, and cause nodes to look “down” when they’re not.


fantastic. thank you for sharing that -- my anecdote about this problem is slightly dated -- it would have been late 2017 early 2018 we were seeing the issue, which indeed predates OTP 22 release.


The old network partition problems people remember about RabbitMQ are solved by quorum queues.


Yes, but quorum queues don't have many of the features of classic mirrored queues.


it used to be really bad, that's super true.

nowadays? it's actually quite simple to setup and works pretty well (source: i know two different companies that setup clustering recently and both had good experiences with no downtime).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: