Hi HN! I'm trying to learn more about Erlang and how it achieves fault-tolerance. I am pretty up-to-date on the talks that Joe Armstrong gave, and I've read his thesis.
This is already quite informative, but I wonder if there are any codebases out there which are good examples of fault-tolerance.
I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.
Any suggestions?
There's two issues you might rapidly encounter.
1) if a supervised process restarts too many times in an interval, the supervisor will stop (and presumably restart), and that cascades up to potentially your node stopping. This is by design, and has good reasons, but might not be expected and might not be a good fit for larger nodes running many things.
2) if your process crashes, its message queue (mailbox) is discarded, and if you were sending to a process registered by name or process group (pg), the name is now unregistered. This means a service process crashing will discard several requests; the one in progress which is probably fine (it crashed after all), but also others that could have been serviced. In my experience, you end up wanting to catch errors in service processes, log them, and move on to the next request, so you don't lose unrelated requests. Depending on your application, a restart might be better, or you might run each request in a fresh process for isolation... Lots of ways to manage this.