Tl;dr: If you’re running
apt-get in a script and need to wait for other
apt-get process to finish (i.e. you’re running into DPkg lock errors), set the
DPkg::Lock::Timeout option (e.g.
apt-get -o DPkg::Lock::Timeout=3 dist-upgrade).
If you want to know more about locking in apt and how I got to the solution, then carry on!
A flaky user-data script
This all started when I was looking into why instances in an auto scaling group were sometimes failing to bootstrap correctly.
The machines were using a user data script which ultimately ends up being run by the
scripts-user module of
cloud-init, which is a more enterprise-ready way of saying that it runs when the machine first boots.
/var/log/cloud-init-output.log, it was clear that we were hitting a race condition part way through the script. The
apt-get command we were running was contending with another startup process trying to run a similar command:
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2384 (apt-get) N: Be aware that removing the lock file is not a solution and may break your system. E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
To make the script run reliably, we needed to make it play nicely with anything else that runs at machine boot, which means handling this race condition by retrying until we successfully acquire the lock.
Working around the race
If you search the internet for something like “bash retry apt lock” (minus the quotes), you won’t have to go far to find a workaround. Those workarounds come in two flavours - using a second program (e.g.
lsof) to detect programs holding the DPkg lock:
while fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1 ; do echo "Waiting for other apt-get instances to exit" # Sleep to avoid pegging a CPU core while polling this lock sleep 1 done # -y to assume "yes" at prompts, because we're in a script apt-get dist-upgrade -y
or using grep to look for the error in the output:
# This approach has the unfortunate side effect of eating all output from # apt-get, making our logs less useful later # # |& is a shorthand that pipes both stdout and stderr to the next process # in the chain while apt-get dist-upgrade -y |& grep -q "Could not get lock /var/lib/dpkg/lock-frontend" ; do echo "Waiting for other apt-get instances to exit" sleep 1 done
Neither of these is especially satisfying. The first is susceptible to time-of-check to time-of-use (or TOCTTOU) bugs1, while the second relies on the text output of
apt-get never changing and eats your log output2.
An extra challenge: minimal Ubuntu images
Working around the race condition was made harder in this case by the fact that this was all happening on a minimal Ubuntu image, where neither
lsof are installed by default.
Under any other circumstances I would happily install them (neither would particularly add to the base install size), but debugging an issue in your package installation process by installing more packages leads to a wonderful Catch-22 situation3.
# - Walk through all file descriptors in `/proc` # - Run `ls` against them using the `-ls` flag to output files referred to # - `grep` for the file we care about # - Extract the process ID with `cut`, relying on there being a certain # pattern of forward slashes in the output find /proc/*/fd -ls | grep /var/lib/dpkg/lock-frontend | cut -d"/" -f3
It was in apt all along
The last message in that thread mentions a setting you can pass to
apt-get that I hadn’t come across before -
DPkg::Lock::Timeout - and which does exactly what we want:
# apt-get -o DPkg::Lock::Timeout=3 dist-upgrade Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 8873 (apt-get) Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 8873 (apt-get) Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 8873 (apt-get)... Error! E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
I’ve set it to
3 seconds in the example above, but in production you’d want to set it to something higher - probably measured in minutes4.
It’s worth noting, as mentioned in the mailing list thread, that this option is automatically set by
apt (as opposed to
apt-get), though that tool is aimed at interactive use and warns against usage in scripts:
WARNING : apt does not have a stable CLI interface. Use with caution in scripts.
Closing aside: documentation
If you’ve run into this blog post because you had the exact same issue, hopefully you’ve found it useful (and thank you for continuing to read beyond the point where your problem is presumably fixed).
There was something that really stood out to me when I eventually found the solution: someone took time out of their day to implement retry behaviour inside apt and make it configurable via a flag. Unfortunately, they stopped short of writing docs for the feature.
This isn’t a criticism of the person doing that work (quite likely for free, as getting paid for open source isn’t exactly easy), but I found it quite sad that a missing paragraph in some docs was the difference between their work being used - having the impact they’d hoped for - and hundreds of janky workarounds being written on StackExchange sites.
In contrast with the screenshot above, at the time of writing, searching for the apt option yields no results5 - surprising given the mailing list thread, but I guess those mailing lists aren’t fully indexed by Google.
I’ve got my fingers crossed that those search results will look different by the time you read this - at the very least because of this blog post, but also because I’m aiming to get it into the
apt.conf man page itself6.
Although in practice these should be exceedingly rare unless you have several startup scripts all contending over the lock at the same time. ↩
It’s probably possible to get the log output back with
tee, but I’m glad I didn’t have to. ↩
If I had a problem installing packages, I would simply install some packages to fix it. ↩
You can set it to the special value of
-1to retry lock acquisition forever, though I’m not a huge fan of that and prefer to eventually time out - unbounded retries are generally not a good idea. ↩
The lone result is only there because Google has started including results that ignore your search syntax - in this case the use of double quotes for exact matches. ↩
I’m planning to send them a patch once I figure out if they want that as a pull request on their GitHub mirror, Salsa (their self-hosted GitLab instance which it’s unclear if they want people outside of Debian to use) or on a mailing list thread via