Wednesday, June 25, 2014

How did we recover from RabbitMQ network partition


We have a RabbitMQ cluster of 3 nodes hosting with Storm. Everything has been very smooth for a years without a problem until Storm upgraded their switches recently, twice actually. Unfortunately, both times they caused network partition problem to our cluster even though Storm said that the upgrade wouldn't affect their customers, just a little bit latency between servers :).


Anyway, our mode in the cluster is pause_minority which based on an incorrect assumption. It worked great when a single node crashed for some reason but it's absolutely a bad choice in network partition. In this case, all these 3 servers couldn't communicate with any of the other 2 so they all thought they were minority and paused. As a result, the whole cluster stopped working even after the server could ping each others again.


So what did I do? It was a little panic actually because every second we don't have the queue, we lost real time data for that moment. I tried to rabbitmqctl stop_app, none of these servers responded immediately. So I tried to CTRL C and rabbitmqctl start_app again, same thing, it didn't seem to response. I thought the backlog on the queue/disk needed time to start the app. Fortunately, I had a backup json file of the whole cluster so I decided to reset everything and rebuild the cluster from scratch accepting that we would lost all the backlog existing in those queues.


With this approach, I had to kill the beam.smp process, delete all files/folders in mnesia directory, reboot the nodes and restored the backup file. That was cool and whole cluster configuration was load successfully and we had the cluster back. It took me a few hours to make this happen, we lost the real time data and the backlog but we had a way to get those data back from other sources so It was just some more hours for us to get the system back to normal.


One thing to note here about the second network partition, those servers recovered quickly after the switch upgrade but they seemed to lost all the backlog on their queues, apparently they were all in pause_minotiry mode but I would expect they wouldn't lost data on the queue. I was pretty sure those data was published to the queue with persistence mode 1 which made the servers write the data to disk. I guess RabbitMQ was right when saying that "Other undefined and weird behaviour may occur" during a network partition.


After this disaster, I've changed our cluster mode to autoheal and wait to see how it would cope with the next network partition. With this setting, we can have 2 nodes cluster and we could have 2 nodes running separately after the next network partition but at least we wouldn't lost the backlog. I've been thinking about another approach using federation which possibly a better approach for us in such not so reliable network situation where the second nodes will have everything published to the first node via a upstream. Anyway, we'll wait and see.



Hope this information help someone ;)

7 comments:

shinobu said...

This information is really really helpful to me. I know that many people have been starting to use rabbit since it's very powerful if they understand how it works.

But not so many people, in other worlds, just few guys fully understand it not like me -;

Would you mind telling me how you back it up daily, weekly or some day?

Unknown said...

Backup the definition incase you want to quickly restore the exchanges/queues like I mentioned. So probably you should do it when you change any settings such as add a new queue, a new exchange, etc

shinobu said...

Thanks!!

tanee.mdz said...

Hey, Can you please tell me how can you re-create Network partition scenario?

Unknown said...

Thís happened because nodes in the cluster cannot communicate to each other but still be assessible from internet. Therefore, I think you can try ssh to 1 of the box ( may be test first with 2 nodes cluster) and block all traffic to/from the other node.

Nagendra said...

How is autoheal working for you ?

Unknown said...

Worked pretty well :)

Post a Comment