dh_virtualenv and long package names (FileNotFound error)

Interesting tidbit about Linux:

A maximum line length of 127 characters is allowed for the first line in a #! executable shell script.

Should be enough, right? Wrong.

Well, not if you are using dh_virtualenv with long package names, anyway:

Installing pip...
  Error [Errno 2] No such file or directory while executing command /tmp/lo...me/bin/easy_install /usr/share/python-virtualenv/pip-1.1.tar.gz
...Installing pip...done.
Traceback (most recent call last):
  File "/usr/bin/virtualenv", line 3, in <module>
virtualenv.main()
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 938, in main
never_download=options.never_download)
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 1054, in create_environment
install_pip(py_executable, search_dirs=search_dirs, never_download=never_download)
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 643, in install_pip
filter_stdout=_filter_setup)
  File "/usr/lib/python2.7/dist-packages/virtualenv.py", line 976, in call_subprocess
cwd=cwd, env=env)
  File "/usr/lib/python2.7/subprocess.py", line 679, in __init__
errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/usr/bin/dh_virtualenv", line 106, in <module>
sys.exit(main() or 0)
  File "/usr/bin/dh_virtualenv", line 83, in main
deploy.create_virtualenv()
  File "/usr/lib/python2.7/dist-packages/dh_virtualenv/deployment.py", line 112, in create_virtualenv
subprocess.check_call(virtualenv)
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['virtualenv', '--system-site-packages', '--setuptools', 'debian/long-package-name/usr/share/python/long-package-name']' returned non-zero exit status 1
make: *** [binary-arch] Error 1
dpkg-buildpackage: error: fakeroot debian/rules binary gave error exit status 2

dh_virtualenv is used to create Debian packages that include Python virtualenvs. It is one of the better ways of packaging Python software, especially if there are Python dependencies that are not available in Debian or Ubuntu. When building a .deb package, it creates a virtualenv in a location such as:

/<build-directory>/debian/<packagename>/usr/share/python/<packagename>

This virtualenv has several tools under its bin/ directory, and they all have the absolute path of the virtualenv's Python interpreter hard-coded in their #! shebang line:

#!/<build-directory>/debian/<packagename>/usr/share/python/<packagename>/bin/python

Given that <build-directory> often contains the package name as well, it's easy to overflow the 128 byte limit of the #! shebang line. In my case, with a ~30 character package name, the path length grew to 160 characters!

Consequently, the kernel couldn't find the Python executable anymore, and running any of the tools from the bin/ directory gave an ENOENT (file not found) error. This is what happened when virtualenv tried to install pip during the initial setup. The root cause of this error is not immediately obvious, to say the least.

To check whether this affects you, check the line length of any script with wc:

head -n 1 /path/to/virtualenv/bin/easy_install | wc -c

If that's larger than 128, it's probably the cause of the problem.

The fix is to change the package name and/or the build location to something shorter. The alternative would be to patch the Linux kernel, which – depending on your preferences – sounds either fun or really unpleasant. Suit yourself!

Plotting maps with Folium

Data visualization in Python is a well solved problem by now. Matplotlib and it's prettier cousin Seaborn are widely used to generate static graphs. Bokeh generates HTML files with interactive, JavaScript-based graphs. It's a great way of sharing data with other people who don't have a Python development environment ready. Several other libraries exist for more specialized purposes.

What has been missing for a long time was good map libraries. Plotting capabilities were fine, but basemap support of the existing libraries was very limited. For example, the popular Matplotlib-basemap has great plot types (contour maps, heatmaps, ...) but can't show any high-resolution maps: it only has country/state shapes or whole-world images. Consequently, it's useless for drawing city or street level maps, unless you want to set up your own tile server (you don't).

Along comes Folium, a library that generates interactive maps in HTML format based on Leaflet.js. It supports, among others, OpenStreetMap and MapBox base layers which look great and provide enough details for large-scale maps.

Here is an example that shows some GPS data I cleaned up with a Kalman filter:

def plot(points, center):
    map_osm = folium.Map(location=center, zoom_start=16, max_zoom=23)
    map_osm.line(locations=points)
    map_osm.create_map(path='folium-example.html')

Here's what it looks like. I find it pretty neat, especially given that it took only 3 lines of code to create:

 

Returning generators from with statements

Recently, an interesting issue came up at work that involved a subtle interaction between context managers and generator functions. Here is some example code demonstrating the problem:

@contextlib.contextmanager
def resource():
    """Context manager for some resource"""

    print("Resource setup")
    yield
    print("Resource teardown")


def _load_values():
    """Load a list of values (requires resource to be held)"""

    for i in range(3):
        print("Generating value %d" % i)
        yield i


def load_values():
    """Load values while holding the required resource"""

    with resource():
        return _load_values()

This is the output when run:

>>> for val in load_values(): pass
Resource setup
Resource teardown
Generating value 0
Generating value 1
Generating value 2

Whoops. The resource is destroyed before the values are actually generated. This is obviously a problem if the generator depends on the existence of the resource.

When you think about it, it's pretty clear what's going on. Calling _load_values() produces a generator object, whose code is only executed when values are requested. load_values() returns that generator, exiting the with statement and leading to the destruction of the resource. When the outer for loop (for val) comes around to iterating over the generator, the resource is long gone.

How do you solve this problem? In Python 3.3 and newer, you can use the yield from syntax to turn load_values() into a generator as well. The execution of load_values() is halted at the yield from point until the child generator is exhausted, at which point it is safe to dispose of the resource:

def load_values():
    """Load values while holding the required resource"""

    with resource():
        yield from _load_values()

In older Python versions, an explicit for loop over the child generator is required:

def load_values():
    """Load values while holding the required resource"""

    with resource():
        for val in _load_values():
            yield val

Still another method would be to turn the result of _load_values() into a list and returning that instead. This incurs higher memory overhead since all values have to be held in memory at the same time, so it's only appropriate for relatively short lists.

To sum up, it's a bad idea to return generators from under with statements. While it's not terribly confusing what's going on, it's a whee bit subtle and not many people think about this until they ran into the issue. Hope this heads-up helps.

published November 12, 2015
tags python

A better way for deleting Docker images and containers

In one of my last posts, I described the current (sad) state of managing Docker container and image expiration. Briefly, Docker creates new containers and images for many tasks, but there is no good way to automatically remove them. The best practice seems to be a rather hack-ish bash one-liner.

Since this wasn't particularly satisfying, I decided to do something about it. Here, I present docker-cleanup, a Python application for removing containers and images based on a configurable set of rules.

This is a rules file example:

# Keep currently running containers, delete others if they last finished
# more than a week ago.
KEEP CONTAINER IF Container.State.Running;
DELETE CONTAINER IF Container.State.FinishedAt.before('1 week ago');

# Delete dangling (unnamed and not used by containers) images.
DELETE IMAGE IF Image.Dangling;

Clear, expressive, straight-forward. The rule language can do a whole lot more and provides a readable and intuitive way to define removal policies for images and containers.

Head over to GitHub, give it a try, and let me know what you think!

Using Python slice objects for fun and profit

Just a quick tip about the hardly known slice objects in Python. They are used to implement the slicing syntax for sequence types (lists, strings):

s = "The quick brown fox jumps over the lazy dog"

# s[4:9] is internally converted (and equivalent) to s[slice(4, 9)].
assert s[4:9] == s[slice(4, 9)]

# 'Not present' is encoded as 'None'
assert s[20:] == s[slice(20, None)]

slice object can be used in normal code too, for example for tracking regions in strings: instead of having separate start_idx and end_idx variables (or writing a custom class/namedtuple) simply roll the indices into a slice.

# A column-aligned table:
table = ('REPOSITORY   TAG      IMAGE ID       CREATED       VIRTUAL SIZE',
         '<none>       <none>   0987654321AB   2 hours ago   385.8 MB',
         'chris/web    latest   0123456789AB   2 hours ago   385.8 MB',
        )
header, *entries = table

# Compute the column slices by parsing the header. Gives a list of slices.
slices = find_column_slices(header)

for entry in entries:
    repo, tag, id, created, size = [entry[sl].strip() for sl in slices]
    ...

This is mostly useful when the indices are computed at runtime and applied to more than one string.

More generally, slice objects encapsulate regions of strings/lists/tuples, and are an appropriate tool for simplifying code that operates on start/end indices. They provide a clean abstraction, make the code more straight-forward and save a bit of typing.