Debugging riddle of the day

One of our services failed to start on a test system (Ubuntu 12.04 on amd64). The stdout/stderr log streams contained only the string “Permission denied” – less than helpful. strace showed that the service tried to create a file under /run, which it doesn't have write permissions to. This caused the it to bail out:

open("/run/some_service", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = -1
    EACCES (Permission denied)

Grepping the source code and configuration files for /run didn't turn up anything that could explain this open() call. Debugging with gdb gave further hints:

Breakpoint 2, 0x00007ffff73e3ea0 in open64 () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff73e3ea0 in open64 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7bd69bf in shm_open () from /lib/x86_64-linux-gnu/librt.so.1
#2  0x0000000000400948 in daemonize () at service.cpp:93
#3  0x00000000004009ac in main () at main.cpp:24
(gdb) p (char*)$rdi
$1 = 0x7fffffffe550 "/run/some_service"
(gdb) frame 2
#2  0x0000000000400948 in daemonize () at service.cpp:93
9           int fd = shm_open(fname.c_str(), O_RDWR | O_CREAT, 0644);
(gdb) p fname
$2 = {...., _M_p = 0x602028 "/some_service"}}

The open("/run/some_service", ...) was caused by an shm_open("/some_service", ...).

This code is working on other machines, why does it fail on this particular one? Can you figure it out? Bonus points if you can explain why it is trying to access /run and not some other directory. You might find the shm_open() man page and source code helpful.

I'll be waiting for you.

The solution is pretty evident after examining the Linux version of shm_open(). By default, it tries to create shared memory files under /dev/shm. If that doesn't exist, it will pick the first tmpfs mount point from /proc/mounts.

In Ubuntu 12.04, /dev/shm is a symlink to /run/shm. On this machine the symlink was missing, which caused shm_open() to go hunting for a tmpfs filesystem, and /run happened to be the first one in /proc/mounts.

Re-creating the symlink solved the problem. Why it was missing in the first place is still unclear. In the aftermath, we're also improving the error messages in this part of the code to make such issues easier to diagnose.


Comments