Boost Range for Humans

Boost Range encapsulates the common C++ pattern of passing begin/end iterators around by combining them into into a single range object. It makes code that operates on containers much more readable. One wonders why such functionality was not included in the C++ standard library in the first place, and indeed, similar ideas could be added to C++17, see N4128 and Ivan Cukic's Meeting C++ presentation. In my opinion, Boost Range is something that every C++ programmer should know about.

The library is reasonably well documented, but I was often missing concrete code examples and an explicit mention what headers are required for which function. Since this presumably happens to other people as well, I invested the time to change that situation.

Thus, I present Boost Range for Humans. It contains working example code for every function in Boost Range, along with required headers and links to the official documentation and the latest source code. I hope it will make Boost Range more accessible and furthers its adoption.

Next week, we'll look into some of the highlights of what Boost Range can offer.

Debugging riddle of the day

One of our services failed to start on a test system (Ubuntu 12.04 on amd64). The stdout/stderr log streams contained only the string “Permission denied” – less than helpful. strace showed that the service tried to create a file under /run, which it doesn't have write permissions to. This caused the it to bail out:

open("/run/some_service", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = -1
    EACCES (Permission denied)

Grepping the source code and configuration files for /run didn't turn up anything that could explain this open() call. Debugging with gdb gave further hints:

Breakpoint 2, 0x00007ffff73e3ea0 in open64 () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007ffff73e3ea0 in open64 () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7bd69bf in shm_open () from /lib/x86_64-linux-gnu/librt.so.1
#2  0x0000000000400948 in daemonize () at service.cpp:93
#3  0x00000000004009ac in main () at main.cpp:24
(gdb) p (char*)$rdi
$1 = 0x7fffffffe550 "/run/some_service"
(gdb) frame 2
#2  0x0000000000400948 in daemonize () at service.cpp:93
9           int fd = shm_open(fname.c_str(), O_RDWR | O_CREAT, 0644);
(gdb) p fname
$2 = {...., _M_p = 0x602028 "/some_service"}}

The open("/run/some_service", ...) was caused by an shm_open("/some_service", ...).

This code is working on other machines, why does it fail on this particular one? Can you figure it out? Bonus points if you can explain why it is trying to access /run and not some other directory. You might find the shm_open() man page and source code helpful.

I'll be waiting for you.

The solution is pretty evident after examining the Linux version of shm_open(). By default, it tries to create shared memory files under /dev/shm. If that doesn't exist, it will pick the first tmpfs mount point from /proc/mounts.

In Ubuntu 12.04, /dev/shm is a symlink to /run/shm. On this machine the symlink was missing, which caused shm_open() to go hunting for a tmpfs filesystem, and /run happened to be the first one in /proc/mounts.

Re-creating the symlink solved the problem. Why it was missing in the first place is still unclear. In the aftermath, we're also improving the error messages in this part of the code to make such issues easier to diagnose.

Mixing C runtime library versions

When compiling code with Microsoft's Visual C/C++ compiler, a dependency on Microsoft's C runtime library (CRT) is introduced. The CRT can be linked either statically or dynamically, and comes in several versions (different version numbers as well as in debug and release variants).

Complications arise when libraries linked into the same program use different CRTs. This happens if they were compiled with different compiler versions, or with different compiler flags (static/dynamic CRT linkage or release/debug switches). In theory, this could be made to work, but in practice it is asking for trouble:

  • If the CRT versions differ (either version number or debug/release flag), you can't reliably share objects generated by CRT A with any code that uses CRT B. The reason is that the two CRTs may have a different memory layout (structure layout) for the that object. The memory location that CRT A wrote the object size to might be interpreted by CRT B as a pointer, leading to a crash when CRT B tries to access that memory.
  • Even if the same CRT version is included twice (once statically, once dynamically linked), they won't share a heap. Both CRTs track the memory they allocate individually, but they don't know anything about objects allocated by the other CRT. If CRT A tries to free memory allocated by CRT B, that causes heap corruption as the two CRTs trample on each others feet. While you can freely share objects between CRTs, you have to be careful whenever memory is allocated or freed. This can sometimes be managed when writing C code, but is very hard to do correctly in C++ (where e.g. pushing to a vector can cause a reallocation of its internal buffer).

Accordingly, having multiple CRTs in the same process is fragile at best. When mixing CRTs, there are no tools to check whether objects are shared in a way that's problematic, and manual tracking is subtle, easy to get wrong, and unreliable. Mistakes will lead to difficult-to-diagnose bugs and intermittent crashes, generally at the most inconvenient times.

To keep your sanity, ensure that all code going into your program uses the same CRT.1 Consequently, all program code, as well as all libraries, need to be compiled from scratch using the same runtime library options (/MD or /MT). Pre-compiled libraries are a major headache, because they force the use of a specific compiler version to get matching CRT version requirements. If multiple pre-compiled libraries use different CRT versions, there may not be any viable solution at all.

This situation will improve with the runtime library refactoring in VS2015, which promises CRT compatibility to subsequent compiler versions. Thus, this inconvenience should mostly be solved in the future.


1 Dependency Walker can be used to list all dynamically linked CRT versions. I'm not sure whether that can be done for statically linked CRTs, I generally avoid those.

Profiling

Profiling is hard. Measuring the right metric and correctly interpreting the obtained data can be difficult even for relatively simple programs.

For performance optimization, I'm a big fan of the poor man's profiler: run the binary to analyze under a debugger, periodically stop the execution, get a backtrace and continue. After doing this a few times, the hotspots will become apparent. This works amazingly well in practice and gives a reliable picture of where time is spent, without the danger of skewed results from instrumentation overhead.

Sometimes it's nice to get a more fine-grained view. That is, not only find the hotspot, but get an overview how much time is spent where. That's where 'real' profilers come in handy.

Under Windows, I like the built-in "Event Tracing for Windows" (ETW), which produces files that can be analyzed with Xperf/Windows Performance Analyzer. It is a really well thought out system, and the Xperf UI is amazing in the analyzing abilities that it offers. Probably the best place to start reading up on this is ETW Central.

Under Linux, I haven't found a profiler I can really recommend, yet. gprof and sprof are both ancient and have severe limitations. OProfile may be nice, but I haven't had a chance to use it yet, as it wasn't available for my Ubuntu LTS release.

I have used Callgrind from the Valgrind toolkit in combination with the KCachegrind GUI analyzer. I typically invoke it like this:

valgrind --tool=callgrind --callgrind-out-file=callgrind-cpu.out ./program-to-profile
kcachegrind callgrind-cpu.out

Callgrind works by instrumenting the binary under test. It slows down program execution, often by a factor of 10. Further, it only measures CPU time, so sleeping times are not included. This makes it unsuitable for programs that wait a significant amount of time for network or disk operations to complete. Despite these drawbacks, it's pretty handy if CPU time is all that you're interested in.

If blocking times are important (as they are for so many modern applications - we generally spend less time computing and more time communicating), gperftools is a decent choice. It includes a CPU profiler that can be run in real-time sampling mode, and the results can viewed in KCachegrind. It is recommended to compile libprofiler.so into the binary to analyze, but using LD_PRELOAD works decently well:

CPUPROFILE_REALTIME=1 CPUPROFILE=prof.out LD_PRELOAD=/usr/lib/libprofiler.so ./program-to-profile
google-pprof --callgrind ./program_to_profile prof.out > callgrind-wallclock.out
kcachegrind callgrind-wallclock.out

If it works, this gives a good overall profile of the application. Unfortunately, it sometimes fails: on amd64, there are sporadic crashes from within libunwind. It's possible to just ignore those and rerun the profile, at least interesting data is obtained 50% of the time.

The more serious problem is that CPUPROFILE_REALTIME=1 causes gperftools to use SIGALARM internally, conflicting with any applications that want to use that signal for themselves. Looking at the profiler source code, it should be possible to work around this limitation with the undocumented CPUPROFILE_PER_THREAD_TIMERS and CPUPROFILE_TIMER_SIGNAL environment variables, but I couldn't get that to work yet.

You'd think that perf has something to offer in this area as well. Indeed, it has a CPU profiling mode (with nice flamegraph visualizations) and a sleeping time profiling mode, but I couldn't find a way to combine the two to get a real-time profile.

Overall, there still seems to be room for a good, reliable real-time sampling profiler under Linux. If I'm missing something, please let me know!