As the paper notes in the introduction, "volatile" is heavily used in embedded software where synchronization primitives like kernel's spinlocks aren't readily available.
The "buffer_ready" in the paper is a very good example that I have seen many times in the real world. If anyone can share the "better solution" that avoids "volatile" (and works on a bare-bone ARM microcontroller for example), I would love to see it.
The solution to this problem is a memory barrier. The short version of such a barrier is: Define a class of transactions (e.g. memory writes). Then all transactions before the barrier must conclude before any transaction after the barrier starts. But this doesn't give you any guarantee about transactions of other types.
I haven't programmed ARM myself and there are different ARMs, but as far as I understand (inspired by post from cnvogel), on most CPU's even non-ARM ones, you need to use intrinsics like these which exist on ARM:
As far as I understand, volatile doesn't give you the needed "barrier" semantics (what is not allowed to happen before or after, on the deep hardware level) if the code with "volatile" works, it can be just an accident.
I don't know the details for ARM, but generally, microcontroller type CPUs don't do any reordering, so no need to prevent it with barriers. And if you do use volatile correctly (and the compiler is not broken), you don't need compiler barriers either. You also might not need any CPU barriers when the cache controller is configured to treat certain regions of the address space as uncacheable (which might be used for MMIO regions).
Modern high-performance ARM cores such as the Cortex-A15 do plenty of reordering. Even when using strongly ordered memory mappings, you still need barriers to prevent reordering between normal and strongly ordered accesses. No amount of volatile will help with this.
The parent's parent was talking about microcontroller class ARMs, though. Also, you don't necessarily have to order "normal" with regards to strongly ordered accesses. If it's only about ordering accesses to MMIO, it doesn't matter when some arithmetic result gets stored to RAM relative to those accesses, all that matters is that the hardware sees the register reads and writes in the right order. Ordering only matters for stuff that is shared in some way, for private data, the illusion of naive serial execution is guaranteed anyhow.
A common scenario is filling a data buffer in normal memory (cached or non-cached doesn't matter on ARM) before initiating a DMA operation by writing to a device register. In this situation, a barrier is required to prevent any of the normal writes being reordered around the DMA initiation which would then see stale data in the buffer.
Discussions about barriers (or volatile) are only meaningful in the context of related accesses, so explicitly mentioning this isn't really necessary.
I don't really get what you are trying to say. Sure, there are cases where you need barriers, both the CPU and the compiler kind, nobody denied that. Still, in low-end stuff, PIO and cache-free in-order execution still are well and alive, and volatile can be perfectly sufficient under such circumstances.
volatile is an antipattern. Don't use it. Don't encourage other people to use it. It is not a thing which should be used, by you. To use it would be wrong, because not using it is correct.
The ARMv7-M architecture (which the popular Cortex-M series microcontrollers implement) allows reordering as I described. Even if a Cortex-M3 probably doesn't ever do it, nobody is making any promises. If a particular behaviour is not documented, it is wrong to rely on it.
Been writing embedded software for decades. Volatile is certainly NOT the feature of choice. Atomic operations including memory barrier are far more commonly used.
Embedded programmers are paranoid. They check such things in the debugger (examine assembly code). Not the big issue the OP seems to think.
The "buffer_ready" example was given as an incorrect use, though, which won't necessarily work even in a standards-conforming compiler, because the semantics of volatile don't forbid reordering the loop (which has no volatile accesses) to go after the buffer_ready write.