This is interesting, but not nearly as bad as it sounds:
No, the problems start when you disable the kernel's OOM killer and try to act on your own.
Surprise, surprise: when you disable the time-tested built-in mechanisms that have been put to use in a multitude of cases that boggles the mind and try to implement your own, you're gonna have a bad time.
1) something like systemd starts a process in a cgroup, to limit memory usage
2) it disables the kernel OOM killer, and expects to
receive notification itself and kills the offending process
3) systemd crashes? This it can no longer process the
notifications, so you have a child process which is at
the memory limit but not killed.
This is a great example of why an init system should be as simple as possible and not die under any circumstance :) Why would your init system die? If you something like daemontools, they are designed to not even allocate memory after startup, because that could fail.
Was she talking about systemd? Which process in systemd does the killing? Is it systemd-nspawn or something? (I'm not using systemd now.)
systemd is the most notable process manager that has cgroups support, and the article is talking about process managers that "fail".
It doesn't matter if it's not the PID 1 that fails. The article is saying: if the process that is supposed to receive OOM notification and kill the process fails, you will observe symptoms that are very hard to diagnose.
PID 1 dying is of course really bad, but if other processes in the init system die, your system can still be hosed, as pointed out here.
I don't think the standard Linux kernel supports user-space OOM handling. That's a patch set that Google has been pushing for a few years, but that hasn't been accepted yet (as far as I know). So I'm guessing she's talking about Google's server infrastructure.
My understanding is that the program running in the cgroup is disabling the OOM killer and doing it itself, so only whatever is inside of the cgroup is frozen (Presumably not systemd). Most programs have no need to disable the OOM killer anyway so it's unlikely they'd run into this problem anyway (They'd just get killed off instead of hanging things).
I don't think that's how that works. A process inside a cgroup has consumed all of the cgroup's RAM. Outside that cgroup you run `ps aux`, which will list all processes, including those inside the container, and this will hang forever. Unless, I am misunderstanding it, and each cgroup gets its own /proc filesystem.
This is only an issue when you disable OOM killer. If your OOM killer replacement dies, then you have a problem, and you shouldn't have disabled it in the first place.
The kernel isn't going to protect you from turning off critical bits.
Yes, I read the article. My point is that a container that ran out of RAM and is running a buggy process manager affects the entire system, not just the container.
'ps' can hang for all sorts of reasons. If you have a process on an nfs-mount that's hung, 'ps' will stat the 'exe' instead of the link itself and wait forever for stat to return. solution? read /proc/<pid>/status for the program name and lstat the exe (there's really no reason to stat it in the first place)
Yes, I was a little surprised that this article is "news". I was also amused that the author references a previous article that apparently "made a big splash" about fork() failing, when... duh. Any competent C programmer or sysadmin should know both of these things.
Why is that? Doesn't it make more sense to have a deadline of, say, 10 seconds after which the stat/read/etc will simply fail? I remember trying to fix this for broken NFS mounts and not succeeding.
Basically, in order to maintain the integrity of the filesystem state, it is assumed that all NFS operations are only temporarily unavailable, and the system generally waits forever for the server to respond. If the kernel interrupted the operation with the client, the client might decide to act in a way that negatively affects the state of the filesystem.
Of course there's no reason for 'ps' not to build in its own timeout for i/o. It could cause premature failure on loaded boxes, but it wouldn't hurt anything.
So in the short term this can be bad. In the long term, I don't think it matters. Basically, there are four cases:
1. oomkiller is on, and has no bugs in it. Great. We are all set.
2. oomkiller is off, process manager is on, and is bug free. Ditto.
3. oomkiller is on, but has bugs in it. Bad shit happens.
4. oomkiller is off, process manager is on, but has bugs. Bad shit happens.
Basically, the question is this: is it possible to develop a process manager that would do the job of oomkiller that has as high code quality as the Linux kernel. I am guessing the answer is yes, so we will always be hitting cases 1 and 2, or at least we'd be hitting cases 3 and 4 with roughly the same frequency.
Now, it's possible that a process manager can fail for reasons other than bugs. Not knowing enough about how cgroups, oom, etc., I can't say that you can write a 100% reliable (assuming it is 100% bug free) process manager without having kernel-level access. Perhaps there is something in the architecture of the whole thing that would prevent that.
>Basically, the question is this: is it possible to develop a process manager that would do the job of oomkiller that has as high code quality as the Linux kernel. I am guessing the answer is yes, so we will always be hitting cases 1 and 2, or at least we'd be hitting cases 3 and 4 with roughly the same frequency.
You can't assume the people who would replace oomkiller with their own process manager are the same ones who could write it well.
Basically, the question is this: is it possible to develop a process manager that would do the job of oomkiller that has as high code quality as the Linux kernel. I am guessing the answer is yes
I'm not trying to put the kernel devs on a pedestal or anything, but how can you compare a tool that's been tested in a myriad of use cases in literally everything from microcontrollers to clusters and big iron to something that's much less mature and probably won't see a tenth of the use cases? My bet is that you'll see more and more problems as people try to reinvent the wheel with their own "not-oom-killers", and blame it on the Linux kernel (ie, 99% of problems will be case 4, not nearly the same frequency). In that respect this article is incredibly informative as it will hopefully point people to check out their own code first. And the beauty is, if they do come up with a better oom-killer, they can always contribute a patch to the kernel.
Well, that's why I said that in the long term this might be solved. There is nothing as good right now, but in principle it could exist.
Secondly, I doubt that a process manager with support for cgroups would need to run on microcontrollers. At least up to this point, I have not seen many microcontrollers running Linux containers.
Lastly, a hybrid solution could be good: process manager identifies what processes to kill and in what order, oomkiller does the killing.
My point about running on microcontrollers is that the Linux kernel (and oom-killer in particular) has been tested on more use cases than most developers usually consider. This testing/usage has revealed mistaken design decisions and bugs. Very few process managers created from scratch will have been tested on nearly as many use cases and will naturally not be as robust.
I do agree with you about the hybrid solution, with a twist: since the oom-killer has been so widely tested, I would think that if someone needed different performance parameters, it might behoove them to start with the oom-killer and tune it, modifying the source if necessary, instead of re-inventing the wheel from scratch.
Would it be hard to fix this specifically in the kernel? I don't know the details of the implementation, but I suppose those regions of memory holding the required information for /proc to work could be made always accessible, even if a process has hit its memory limit.
If something running within a cgroup can exceed what the cgroup allows then the cgroup is completely worthless as a concept.
Plus you have to assume some cgroups might be TRYING to DOS the machine its running on (e.g. shared hosting). If you give them a way to bypass the protections of cgroup and use up additional memory then they WILL take out the machine.
That the process is being memory limited shouldn't make it impossible for another process to read its cmdline. The cmdline data of the process should already be in memory.
EDIT: I suspect that this problem is a consequence of a silly design that argv in the process holds the cmdline. Since apparently you can change the process name by changing argv, see:
I've yet to find the solution. Pretty much CentOS 6 install so no cgroups. It's pretty much IO resource starved and for the life of me can't figure out why it's only doing into /proc/pid# of a java process.
This makes no sense. With the OOM killer disabled, the kernel shouldn't hang when memory allocation fails; it should return -ENOMEM. If something isn't handling that correctly, it needs fixing. In particular, if something in the kernel doesn't handle that correctly and propagate -ENOMEM back to userspace, it needs fixing.
This is not about memory allocation. OOM occurs when too much of the allocated memory is being used. For example, forking happens with copy-on-write memory, so if I have a 3G process on a 4G machine, I can fork it 10 times without problems. It's only when those ten processes each start to write to their memory that the physical memory usage will quickly become too much to handle. In the scenario in the blog post, the OOM killer is disabled, and instead Linux can only prevent the memory from being accessed to avoid things getting worse. I would agree that this setup does sound like it needs fixing, but it's what we got in exchange for having cheap process forks.
Ah, thanks for the clarification; you're right, there's no way around that, short of blocking forks and speculative allocations unless there's enough memory to back them. That's theoretically possible, but extremely harsh, given the common case of fork/exec for instance.
I think it would be worthwhile to find a way around it, but it would require all applications to be conservative about their allocations. For example, instead of forking, sharing memory between processes could become something that has to be explicitly requested. It's probably never going to happen, given that these primitives such as fork and malloc are so ingrained, but when you consider what Linux is being used for nowadays, from mission critical servers to satellites, it might make sense to make things more robust at the expense of compatibility.
>For example, instead of forking, sharing memory between processes could become something that has to be explicitly requested
A fine grained version of fork, clone, clone already exists.
But it's not just fork, call stacks are also part of this problem.
They shouldn't, but they do tend to have a large quantity of memory backing them that gets added to when grown into. The kernel uses 4k, 8k, or 16k stacks; a quick test on a Linux x86-64 system suggests that userspace has 8M stacks by default.
Hackish test code to recurse infinitely and print the stack pointer until a segfault:
I didn't see this until now, the problem is that when you spawn a new thread space for the stack needs to be allocated. Since the stack can't be reallocated easily in C/C++ it also needs to be "big enough".
The current implementation is to allocate several megabytes of stack for each thread, since memory accounting isn't strict this doesn't create problems because unused stack doesn't really take up space.
If you start accounting memory you will also have to be stricter with allocating stacks for new threads, limiting yourself to just a page or two wherever possible. This is not an easy task.
I do not understand. Where is the memory backing /proc/pid/cmdline and who in the container has a COW copy of it? If nobody, how can reading it cause more allocations in the container?
COW due to forking was just an example of how use not allocation leads to problems. I got the impression that if the container does not solve its own OOM condition, the kernel simply resorts to blocking all memory access. Perhaps blocking reads is not strictly necessary, as they do not have to increase memory use, but on the other hand it's possible that some of the memory already had to be discarded due to the OOM condition so it could have become inaccessible.
It may make sense for other things in the OOMed memcg to wait for the OOM killer. Otherwise the OOM condition is likely to crash everything (since nothing handles malloc failures correctly) instead of just crashing the OOMed process.
ISTM the bug is that things outside the memcg are waiting instead of getting ENOMEM.
Is this a hard problem for kernel devs to solve? Or could it be as simple as using a timeout when the kernel's OOM killer is disabled? That is, when a cgroup is at the limit, the kernel waits for some finite amount of time and then starts killing things itself. Or would this cause other problems?
The oom-killer is the kernel devs solution; if you're disabling it, you should know better, and more importantly, you're on your own. Also, the kind of people who would disable the oom-killer are probably the same kind who wouldn't want the kernel messing around with their cgroup; in many senses, that just looks like another oom-killer, only tuned to wait a bit longer.
No, the problems start when you disable the kernel's OOM killer and try to act on your own.
Surprise, surprise: when you disable the time-tested built-in mechanisms that have been put to use in a multitude of cases that boggles the mind and try to implement your own, you're gonna have a bad time.