Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Good examples of fault-tolerant Erlang code?
147 points by roeles on Dec 29, 2023 | hide | past | favorite | 45 comments
Hi HN! I'm trying to learn more about Erlang and how it achieves fault-tolerance. I am pretty up-to-date on the talks that Joe Armstrong gave, and I've read his thesis. This is already quite informative, but I wonder if there are any codebases out there which are good examples of fault-tolerance.

I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.

Any suggestions?



If you follow OTP design principles, you end up with a supervision tree, and a lot of code like...

   ok = do_something_that_might_fail()

If it returns ok: great, it worked and you move on. If it doesn't return ok, the process crashes, you get a crash report, and the supervisor restarts it, if that's how the supervisor is configured. Presumably it starts properly and deals with future requests.

There's two issues you might rapidly encounter.

1) if a supervised process restarts too many times in an interval, the supervisor will stop (and presumably restart), and that cascades up to potentially your node stopping. This is by design, and has good reasons, but might not be expected and might not be a good fit for larger nodes running many things.

2) if your process crashes, its message queue (mailbox) is discarded, and if you were sending to a process registered by name or process group (pg), the name is now unregistered. This means a service process crashing will discard several requests; the one in progress which is probably fine (it crashed after all), but also others that could have been serviced. In my experience, you end up wanting to catch errors in service processes, log them, and move on to the next request, so you don't lose unrelated requests. Depending on your application, a restart might be better, or you might run each request in a fresh process for isolation... Lots of ways to manage this.


What we did in a project was storing message data in a JSONB field of a database table, with an id of the recipient process in another field. We used the module name for that. Then we sent a normal message with the id of the database record. If the process succeeded it marked the record as done. If it crashed, any of those proceesses would read all of its unprocessed records upon restart and start working on them. We took care never to restart a failed process too fast. Sometimes they kept failing for hours or days until we had a fix, then they processed their backlog. If some jobs were not important anymore or must not be performed because something else took care of them (even people manually) the fix had to skip them. Manual UPDATEs, code, etc.


@toast0, would you still use Erlang/OTP today?

If so, for what kind of apps.

If not anymore, what would you use instead.


Yes. Certainly.

In my mind, there's two really good fits (and some cross over between them).

a) binary matching syntax is really nice for dealing with bit-packed things; although it's not pretty if dealing with little endian values where there's a couple bits in one byte and a couple more in the neighboring byte --- you've got to get the pieces and put them together if it's not just whole bytes. Big endian bit packed structures are easy, but little endian is dominant these days. I don't know if performance is good, but it's easy to read and write for developers.

b) anything with a large number of connections and significant state per connection. A chat server, video conference, etc.

This is why you see 'everyone' build chat from ejabberd. Erlang makes code for this kind of service fairly simple, and hotloading means you can fix bugs without kicking everyone off to restart. Observability features make it reasonable to see what's going on in your system and where bottlenecks are.


(a) has been copied in quite a few languages now. Here's my OCaml copy: https://bitstring.software/documentation/ (sorry about the bad website cert, it's not under my control) https://github.com/xguerin/bitstring We handle endianness very naturally. Performance of the OCaml version depends on whether everything is byte-aligned. It will compile-time optimize byte-aligned stuff to be fast, and do slow bit shifting for everything else. I believe this is an advantage over the Erlang original.


This is pretty interesting. Thanks! I didn't have time to dig into it [1], but something that's a bit ugly is something like a PCI BAR, defined as

   Bits 31-4: 16-Byte Aligned Base Address
   Bit 3:     Prefetchable (flag)
   Bits 2-1:  Type
   Bit 0:     Always zero
In Erlang, this gets parsed with something like:

   <<BaseLS:4, Prefetch:1, Type:2, 0:1, BaseMS:24/little>> = BAR
and then you have to put the Most Significant and Least Significant parts back together somehow, and there's the zero filling too, like maybe

   <<Base:32>> = <<BaseMS:24, BaseLS:4, 0:4>>
If the structure were big endian, you'd probably have bits 0-28 be the address and bits 29-31 be the flags, and you could parse it as

   <<ShiftedBase:28, Prefetch:1, Type:2, 0:1>> = BAR,
   Base = ShiftedBase bsl 4

You could also just do

   <<BaseAndFlags:32/little>> = BAR
and then mask out the flags and stuff, but then you lose the joy of pattern matching, or you defer the pattern matching which means you'd lose out on matching in function heads or have to have an intermediate parsing function that calls the real function with matching in its heads.

[1] I looked a bit more and the example seems inline with Erlang syntax https://bitstring.software/documentation/#constructing-bitst... shows something similar to what I would expect with

  Version = 1,
  Data = 10,
  <<Version:4, Data:12>>.
You could say Data:12/little, and that does flip bits around, but a little endian C developer isn't going to be happy with what results. It's not natural for them to combine the lower bits of byte 0 with the bits of byte 1.


> This is why you see 'everyone' build chat from ejabberd. Erlang makes code for this kind of service fairly simple, and hotloading means you can fix bugs without kicking everyone off to restart.

Sadly in practice I saw people avoid hotloading. I saw this behavior with Riak data store, in my company's product and also ejabberd in FreeBSD ports (this one is especially annoying as the port demands uninstalling the package before it even starts building it[1] which is even worse experience than with alternatives). Why almost everyone (from my experience) is doing this?

[1] https://github.com/freebsd/freebsd-ports/blob/9d74858de3b5e4...


Well, hotloading is very powerful, and can really mess things up in a hurry. You've got to plan for it. I wouldn't expect ejabberd to be built to hotload between releases (unless the docs say you can)

But for hotfixes, it's great. And for code you write, it's often possible to structure even large changes for hotloading.

A lot of people have built deployment around the limitations of other technology, and it's a hard thing to get them to hotload. But if you can, it really reduces the cycle time between when you detect an issue and when its fixed and deployed.

Short cycles are great because for many issues you can pick the issue up, fix it, deploy it, verify the fix, and be done without having to switch tasks. Additionally, a short cycle reduces the duration of impact for simple bugs, which means you need to spend less effort on finding them before deploy (probably not no effort though). Complex bugs that persist broken state are still difficult to remedy, so you should still spend significant effort on finding those before deployment.

(Also, I'm really not sure why having the port installed would interfere with building a new one, but I've always built Erlang with make and erlc which is maybe a bit unusual)


> (Also, I'm really not sure why having the port installed would interfere with building a new one, but I've always built Erlang with make and erlc which is maybe a bit unusual)

It looks like this happens because the ejabberd is installed into system's erlang lib directory and that impacts the build process. I'm wondering why author of the port opted to force un-installation instead of just install it in another directory? Is there a benefit of doing it this way?

I was thinking to suggest that, but I don't know enough about erlang if that even makes sense.


Thanks for your insights as always.

Related to (b), any thoughts on Rustler?

My understanding it’s used extensively at Discord to eliminate Erlang/OTP performance bottlenecks.

https://github.com/rusterlium/rustler


I like it a lot.

The big advantage of Rustler vs. say, writing C or using Zig's equivalent, is that you can just use any Rust libraries. For example, I wrote Rust<->Elixir bindings to xxhash and CANBus for our industrial control/embedded applications, and it was quick to do. Very little ceremony.

Rustler provides these Encoder and Decoder traits (and implementations for most stdlib data structures) that make it easy to pass data back and forth across the FFI boundary without you having to manually do it, and they've done all of the work of writing/generating the bindings to Erlang's C API. It also includes a generator to easily create a new NIF that integrates into your Elixir project.

Of course, you need to know both Elixir and Rust to a high enough level, but if you do, it's really pretty straightforward to just either wrap an existing Rust library, or to rewrite some hot path in Rust and schedule it on a dirty CPU or IO thread.


I haven't used it. I've written and used NIFs, but mostly they're simple enough that C is fine, and Rust isn't really needed. I think discord ends up with complex data structures in their native layer; Rust seems reasonable for that, but I never experienced that need.

That said, I did have a concurrency issue with a NIF I used, but I don't know if Rust would have been helpful or just made it really hard to express.


We are running Erlang/OTP for telemetry backend in the industrial automation field.

Basically, we have hundreds of thousands sensors connected through gateways that keep open TCP/IP connections to the Erlang/OTP distributed backend. We do bidirectional communication as we have many control functions, OTA firmware updates, etc.

There are frequent failures which we handle with supervision trees and “let it fail” design principles. Failures are due to:

* Gateways which are connected through cellular networks with varying signal strength conditions.

* Sensors failing and providing incorrect data (eg. invalid float binaries).

* Buggy firmware in some 3rd party gateways / sensors.

* Buggy firmware in our own gateways and sensors ;-)

I highly recommend Elrang/OTP due to:

* Fault tolerance - processes fail, nodes carry on.

* Concurrency model (mailboxes, linking processes via supervision trees, monitors, trapping, etc)

* Built-in distribution and related modules in standard library

* Pattern matching which makes processing binary data super convenient

* Mnesia database (if used for right things)


Why wouldn't you be able to implement this in Rust or Python?

Seems like typical exception handling. Erlang isn't even type checked


1. You can implement supervisor trees in rust or python, but neither of them have a runtime that supports that out of the box and an exception handling system that works harmoniously with that runtime, so you'd have to implement that all yourself, and it turns out to be a non-trivial task.

2. erlang has a variety of type checks at a number of conceptual levels, so strongly recommend you go plow through something like https://learnyousomeerlang.com/ in order to improve your understanding here.


That's the basics. The next level is, what if you want one process dying to interrupt and trigger killing another process that is associated with it -- perhaps it's the parent in an rpc -- that's the easiest case (and trigger the correct exception message in the logs of both nodes)

Oh. And it's on the other side of a cluster.

Sure, you could do it in Python or rust. It won't be zero lines of code. You're probably gonna get it wrong.


Certainly, you can do most things in most languages.

Rust and Python don't come with supervision trees, so you'd have to build or find that. They also don't come with async messaging, especially not cross node async messaging, so you'd have to build or find that.

Building procsess linking and monitoring where the death of one process notifies or kills other processes and all of that is tricky, and I suspect you won't find that; so you'll have to build it, which is going to be tricky.

But even if you have all of that, now you're writing Erlang style in another language, and it's not idiomatic and people won't understand or like your code.

You can even build in hotloading. I did it (poorly) in Perl in the early 2000s without knowing it was available elsewhere, and I did it more recently in C with dlopen and friends. It makes your code look real funny though.

I also think you missed the ease of raising exceptions... ok = ... is very powerful and concise. You don't write a throw, you don't worry about what failure looks like, you just pattern match success, and if/when failure happens, you usually have what you need to figure out what when wrong and if it's better to do something else, it's easy to update your code (and you can hotload the update to the running system)


The Erlang way doesn’t require exception handling code because processes are isolated, so when one dies the runtime can clean up after it (reclaiming memory, file handles, etc.) as it knows what process owns what resources. Links and monitors make it possible to expand this to custom types of resources e.g. DB connections, locks, queues…

The idea is to implement error handling in the core (VM, supervisors, DB connection pool) while the vast majority of the code can just crash at anytime and not worry about closings its files or whatever.


I think the let it fails slogan is taken too literally, especially by people who don't use OTP.

You can handle errors and exceptions if you want, the supervisors etc are more here for the unexpected failures that happens for all sort of reasons in any software.

Sometimes it also hides bugs because your solution appears to work correctly;)


> I think the let it fails slogan is taken too literally, especially by people who don't use OTP

I like the philosophy - when taken literally, it saves writing error-handling code at unit-level for an entire class of errors (equivalent to JVMs unchecked exceptions). A well-architected supervision tree will handle that for you, when paired with the excellent out-of-the-box instrumentation for observability.

> You can handle errors and exceptions if you want

Yes, which is why it's standard to pattern-match on a function's results to check if returned a known err or not - after all, (checked) errors are just a return value in Erlang/BEAM.


Thanks for your insights!


RabbitMQ is IMHO probably the best open source example tackling a large, complicated real world problem with graceful degradation (e.g. if a queue keeps crashing).

Elixir has a lot of smaller but very high quality libraries to learn from. You may be interested in how Ecto & Postgrex manage DB connections, in particular how connection sockets are “borrowed” so data doesn’t get repeatedly messaged (read: copied) between processes. Bandit / Thousand Island also make interesting decisions for process structure in HTTP1.1 vs HTTP2.

I think a common mistake is to create processes mimicking classic OOP structure, like an OrderProcessor, ShippingManager, etc. Processes in Erlang are a unit of fault tolerance, not code organization. This means more usually you’ll have one process per request, potentially calling code from many different modules; since requests are the things you want to fail separately from each other.

In RabbitMQ’s case for instance connections and queues are processes, but exchanges are not. It would feel natural to model the problem as three processes with messages going Connection -> Exchange -> Queue, but in reality an exchange is a set of routing rules that can be applied by a connection directly, which avoids a lot of complexity and overhead.

Last thing I’d note is supervision trees etc. are really about handling _unexpected_ errors (Joe uses the terms faults and errors with different meanings iirc). If you want a web request to be retried a few times with a delay, don’t use a supervisor for that, just loop with a sleep. Same for things like validating inputs from a form, usually you’d want to give the user a hint and not just crash.

Some other useful links:

- https://aosabook.org/en/v1/riak.html (bit old, but another large codebase)

- https://ferd.ca/the-zen-of-erlang.html

- https://www.theerlangelist.com/article/spawn_or_not


The simple answer: supervision trees. And fault-tolerance usually means that the failing process would be restarted. It would not handle stuff like netsplits or node going down though.

Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples. There are many factors on decision when to make a new process. Data locality, the pattern of interaction with other processes, performance considerations. Good idea is to have one process per TCP connection, but not one process per each routed message. And be careful with blocking gen_server calls - these could block or fail.


> Check code of cowboy, ejabberd, MongooseIM, RabbitMQ for examples.

+ CouchDB?


Just to add to this, there are some implementations of things like consensus algorithms in Erlang such as Ra: https://github.com/rabbitmq/ra


Ra is pretty cool. Making cluster fault tolerant is an art on its own in Erlang ;)

Riak Core is extremely cool, but Riak is dead by now. It was a child of the times when NoSQL was cool. Still, basho code is interesting to read. (https://github.com/basho/riak_core)

Self-ads: we've tried to remove Mnesia from our project, HN post incoming, once the library is prettified and tested hard (https://github.com/esl/cets).


Thanks! Will check those projects out.


It's been a while since I worked actively with Erlang but I remember reading the cowboy-source code was educational.


Step zero is definitely the OTP Design Principles doc (part of the OTP distribution):

https://www.erlang.org/doc/design_principles/users_guide

There are some good texts that have more examples:

Erlang & OTP in Action - https://www.manning.com/books/erlang-and-otp-in-action

Designing for Scalability with Erlang/OTP - https://www.oreilly.com/library/view/designing-for-scalabili...

One big example of distributed Erlang is Riak:

https://github.com/basho/riak


And the classic "Learn You Some Erlang for Great Good!":

https://learnyousomeerlang.com/introduction


Armstrong's Erlang: Software for a concurrent world is my recommendation.


Yes, Erlang & OTP in Action is one I would recommend in particular. It's a few years old now and I don't know if they've updated it but the basics of OTP, supervision, etc. have not really changed.

Armstrong's Programming Erlang is another one to look at.


Thanks for the book recommendations!


I have written a lot of Erlang code over the years, including an Erlang Redis clone which had some interesting performance characteristics[1] ... though not recently, I went too far down the engineering management track at Snap and elsewhere... but I worked closely with Fernando "El Brujo"[2] when he was CTO of my consultancy. If you want to see beautiful, canonical Erlang code, he's still slinging it out. Dig through his repos on Github, or better yet, ask him to provide his suggestions.

[1] https://github.com/cbd/edis [2] https://github.com/elbrujohalcon


If you were to consolidate all the info (including links published by Joe) into an informative blog post, it could become the “2024 reference bookmark” for a lot of people.

I have thought of writing this! It would be quite useful to a lot of people.


That would be great!


You don't achieve fault tolerance solely by using Erlang. Erlang does not inherently 'achieve fault tolerance.' Instead, you make your system fault-tolerant through deliberate engineering. While Erlang provides tools and design guidelines, the responsibility for achieving fault tolerance ultimately lies with you. Source: I implemented and operated a large Erlang system for approximately 3 years.


This is exactly why they're asking for example projects...


that's always true. i think the author is interested in code examples of such. and unlike many other frameworks/tools, erlang provides a great pit-of-success for implementing fault tolerance - e.g if you follow common/best practices - you'll achieve a fairly good fault tolerance.


The big benefit in my experience was that I could have a program with real users, that did have errors (from me being new to Elixir and not knowing better) and still not experiencing downtime.

Instead, CPU or Memory would increase over time, hit the VM limit, kill and restart.

So later when I noticed this, I could debug and fix it without simultaneously fighting a prod incident.


> I'm particularly curious about when to split off a new process, and what "if things fail, do something simpler" means in practice.

Processes are failure and concurrency barriers.

Failure: one process crashing does not crash another process, unless you explicitly want it to (e.g., via Erlang's `link` functionality). So, if you have multiple operations that must not interfere with each other in the case of one of them misbehaving (e.g., your application makes multiple HTTP requests in parallel), you want them in separate processes.

Concurrency: processes are independently and preemptively scheduled by the VM. If you have multiple operations that are not necessarily sequentially ordered, and you want to run them at the same time, you put each of them in a process. One example problem where this applies would be the handling of incoming TCP messages, where each message is not related to the previous or subsequent messages, and you want to be able to process multiple messages at the same time.

If you handle each new message in its own process, the VM will schedule the processing of those messages such that the processing of one message will not interfere with the processing of another. It accomplishes this by tracking a rough proxy of CPU time each process uses (called "reductions" in Erlang) and descheduling processes that consume too many resources and giving other processes a chance to run for a bit. (Note that this is just one example and ignores any performance considerations. There are other approaches but I am omitting them for simplicity's sake)

There are a number of good libraries to look at for these in practice. I'd personally go look at Cowboy and/or Ranch as they deal with lots of IO. Oban is an Elixir job queue library that is fantastic and has very high code quality. Another good one would be Poolboy, which is a worker pool library.


when to split off a new process

Always?

Times some factor so you have several instances of the same thing in case one fails.

Good luck.


I don't know Erlang so take it with a grain of salt, but maybe take a look at the CouchDB code base?

It's a NoSQL DB written in Erlang. I looked at it a few years ago, its master to master replication seemed cool.


Not code but an excellent presentation of design/architecture of a Real-World fault-tolerant and distributed System in Erlang/Elixir - https://www.youtube.com/watch?v=pQ0CvjAJXz4


My advise is don't take Erlang's fault-tolerance promises too seriously. It's just a little framework that helps in some cases and gets in the way in other cases.

I've seen many Erlang systems fail in funny ways, including some of the big examples given here. Supervision trees are cool but it's clearly nonsense to hardcode restart strategy and timing numbers for workers as if all failure modes are the same and deployed in the same network/capacity/resource/conditions with any number of workers. The strategy and schedule for recovering 10 crashed resource workers will clearly be different when you have 1M workers. The strategy will be different if you are timing out on network or if you are getting a resource error and have better things to do than restarting workers.

Focus on fault-tolerance outside erlang - have standby capacity in isolation and load-balance properly, shard the system in isolated pieces as much as you can.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: