Running strace for multiple processes

Just a quick note about strace, the ancient Linux system-call tracing tool.

It can trace multiple processes at once: simply start it with multiple -p arguments (the numbers give the processes' PIDs):

sudo strace -p 2916 -p 2929 -p 2930 -p 2931 -p 2932 -o /tmp/strace.log

This is great for tracing daemons which use separate worker processes, for example Apache with mpm-prefork.

◆ ◆ ◆

Plotting maps with Folium

Data visualization in Python is a well solved problem by now. Matplotlib and it's prettier cousin Seaborn are widely used to generate static graphs. Bokeh generates HTML files with interactive, JavaScript-based graphs. It's a great way of sharing data with other people who don't have a Python development environment ready. Several other libraries exist for more specialized purposes.

What has been missing for a long time was good map libraries. Plotting capabilities were fine, but basemap support of the existing libraries was very limited. For example, the popular Matplotlib-basemap has great plot types (contour maps, heatmaps, ...) but can't show any high-resolution maps: it only has country/state shapes or whole-world images. Consequently, it's useless for drawing city or street level maps, unless you want to set up your own tile server (you don't).

Along comes Folium, a library that generates interactive maps in HTML format based on Leaflet.js. It supports, among others, OpenStreetMap and MapBox base layers which look great and provide enough details for large-scale maps.

Here is an example that shows some GPS data I cleaned up with a Kalman filter:

def plot(points, center):
    map_osm = folium.Map(location=center, zoom_start=16, max_zoom=23)
    map_osm.line(locations=points)
    map_osm.create_map(path='folium-example.html')

Here's what it looks like. I find it pretty neat, especially given that it took only 3 lines of code to create:

◆ ◆ ◆

Mixing C runtime library versions

When compiling code with Microsoft's Visual C/C++ compiler, a dependency on Microsoft's C runtime library (CRT) is introduced. The CRT can be linked either statically or dynamically, and comes in several versions (different version numbers as well as in debug and release variants).

Complications arise when libraries linked into the same program use different CRTs. This happens if they were compiled with different compiler versions, or with different compiler flags (static/dynamic CRT linkage or release/debug switches). In theory, this could be made to work, but in practice it is asking for trouble:

If the CRT versions differ (either version number or debug/release flag), you can't reliably share objects generated by CRT A with any code that uses CRT B. The reason is that the two CRTs may have a different memory layout (structure layout) for the that object. The memory location that CRT A wrote the object size to might be interpreted by CRT B as a pointer, leading to a crash when CRT B tries to access that memory.
Even if the same CRT version is included twice (once statically, once dynamically linked), they won't share a heap. Both CRTs track the memory they allocate individually, but they don't know anything about objects allocated by the other CRT. If CRT A tries to free memory allocated by CRT B, that causes heap corruption as the two CRTs trample on each others feet. While you can freely share objects between CRTs, you have to be careful whenever memory is allocated or freed. This can sometimes be managed when writing C code, but is very hard to do correctly in C++ (where e.g. pushing to a vector can cause a reallocation of its internal buffer).

Accordingly, having multiple CRTs in the same process is fragile at best. When mixing CRTs, there are no tools to check whether objects are shared in a way that's problematic, and manual tracking is subtle, easy to get wrong, and unreliable. Mistakes will lead to difficult-to-diagnose bugs and intermittent crashes, generally at the most inconvenient times.

To keep your sanity, ensure that all code going into your program uses the same CRT.¹ Consequently, all program code, as well as all libraries, need to be compiled from scratch using the same runtime library options (/MD or /MT). Pre-compiled libraries are a major headache, because they force the use of a specific compiler version to get matching CRT version requirements. If multiple pre-compiled libraries use different CRT versions, there may not be any viable solution at all.

This situation will improve with the runtime library refactoring in VS2015, which promises CRT compatibility to subsequent compiler versions. Thus, this inconvenience should mostly be solved in the future.

¹ Dependency Walker can be used to list all dynamically linked CRT versions. I'm not sure whether that can be done for statically linked CRTs, I generally avoid those.

◆ ◆ ◆

Debian's most annoying warning message: "Setting locale failed"

You ssh into a server or you enter a chroot, and the console overflows with these messages:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = "en_US:en",
        LC_ALL = (unset),
        LC_TIME = "en_US.utf8",
        LC_CTYPE = "de_AT.UTF-8",
        LC_COLLATE = "C",
        LC_MESSAGES = "en_US.utf8",
        LANG = "de_AT.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_MESSAGES to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

Gaaaaarrrrh!

Fortunately the fix is easy (adjust the locale names for your situation):

locale-gen en_US.UTF-8 de_AT.UTF-8

Finally, peace on the console.

◆ ◆ ◆

Profiling

Profiling is hard. Measuring the right metric and correctly interpreting the obtained data can be difficult even for relatively simple programs.

For performance optimization, I'm a big fan of the poor man's profiler: run the binary to analyze under a debugger, periodically stop the execution, get a backtrace and continue. After doing this a few times, the hotspots will become apparent. This works amazingly well in practice and gives a reliable picture of where time is spent, without the danger of skewed results from instrumentation overhead.

Sometimes it's nice to get a more fine-grained view. That is, not only find the hotspot, but get an overview how much time is spent where. That's where 'real' profilers come in handy.

Under Windows, I like the built-in "Event Tracing for Windows" (ETW), which produces files that can be analyzed with Xperf/Windows Performance Analyzer. It is a really well thought out system, and the Xperf UI is amazing in the analyzing abilities that it offers. Probably the best place to start reading up on this is ETW Central.

Under Linux, I haven't found a profiler I can really recommend, yet. gprof and sprof are both ancient and have severe limitations. OProfile may be nice, but I haven't had a chance to use it yet, as it wasn't available for my Ubuntu LTS release.

I have used Callgrind from the Valgrind toolkit in combination with the KCachegrind GUI analyzer. I typically invoke it like this:

valgrind --tool=callgrind --callgrind-out-file=callgrind-cpu.out ./program-to-profile
kcachegrind callgrind-cpu.out

Callgrind works by instrumenting the binary under test. It slows down program execution, often by a factor of 10. Further, it only measures CPU time, so sleeping times are not included. This makes it unsuitable for programs that wait a significant amount of time for network or disk operations to complete. Despite these drawbacks, it's pretty handy if CPU time is all that you're interested in.

If blocking times are important (as they are for so many modern applications - we generally spend less time computing and more time communicating), gperftools is a decent choice. It includes a CPU profiler that can be run in real-time sampling mode, and the results can viewed in KCachegrind. It is recommended to compile libprofiler.so into the binary to analyze, but using LD_PRELOAD works decently well:

CPUPROFILE_REALTIME=1 CPUPROFILE=prof.out LD_PRELOAD=/usr/lib/libprofiler.so ./program-to-profile
google-pprof --callgrind ./program_to_profile prof.out > callgrind-wallclock.out
kcachegrind callgrind-wallclock.out

If it works, this gives a good overall profile of the application. Unfortunately, it sometimes fails: on amd64, there are sporadic crashes from within libunwind. It's possible to just ignore those and rerun the profile, at least interesting data is obtained 50% of the time.

The more serious problem is that CPUPROFILE_REALTIME=1 causes gperftools to use SIGALARM internally, conflicting with any applications that want to use that signal for themselves. Looking at the profiler source code, it should be possible to work around this limitation with the undocumented CPUPROFILE_PER_THREAD_TIMERS and CPUPROFILE_TIMER_SIGNAL environment variables, but I couldn't get that to work yet.

You'd think that perf has something to offer in this area as well. Indeed, it has a CPU profiling mode (with nice flamegraph visualizations) and a sleeping time profiling mode, but I couldn't find a way to combine the two to get a real-time profile.

Overall, there still seems to be room for a good, reliable real-time sampling profiler under Linux. If I'm missing something, please let me know!

Running strace for multiple processes

Plotting maps with Folium

Mixing C runtime library versions

Debian's most annoying warning message: "Setting locale failed"

Profiling

Writings

Recent

Archive

Search