Waiting for apt locks without the hacky bash scripts
Tl;dr: If you’re running apt-get
in a script and need to wait for other apt-get
process to finish (i.e. you’re running into DPkg lock errors), set the DPkg::Lock::Timeout
option (e.g. apt-get -o DPkg::Lock::Timeout=3 dist-upgrade
).
If you want to know more about locking in apt and how I got to the solution, then carry on!
A flaky user-data script
This all started when I was looking into why instances in an auto scaling group were sometimes failing to bootstrap correctly.
The machines were using a user data script which ultimately ends up being run by the scripts-user
module of cloud-init
, which is a more enterprise-ready way of saying that it runs when the machine first boots.
Looking in /var/log/cloud-init-output.log
, it was clear that we were hitting a race condition part way through the script. The apt-get
command we were running was contending with another startup process trying to run a similar command:
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 2384 (apt-get)
N: Be aware that removing the lock file is not a solution and may break your system.
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
To make the script run reliably, we needed to make it play nicely with anything else that runs at machine boot, which means handling this race condition by retrying until we successfully acquire the lock.
Working around the race
If you search the internet for something like “bash retry apt lock” (minus the quotes), you won’t have to go far to find a workaround. Those workarounds come in two flavours - using a second program (e.g. fuser
or lsof
) to detect programs holding the DPkg lock:
while fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1 ; do
echo "Waiting for other apt-get instances to exit"
# Sleep to avoid pegging a CPU core while polling this lock
sleep 1
done
# -y to assume "yes" at prompts, because we're in a script
apt-get dist-upgrade -y
or using grep to look for the error in the output:
# This approach has the unfortunate side effect of eating all output from
# apt-get, making our logs less useful later
#
# |& is a shorthand that pipes both stdout and stderr to the next process
# in the chain
while apt-get dist-upgrade -y |& grep -q "Could not get lock /var/lib/dpkg/lock-frontend" ; do
echo "Waiting for other apt-get instances to exit"
sleep 1
done
Neither of these is especially satisfying. The first is susceptible to time-of-check to time-of-use (or TOCTTOU) bugs1, while the second relies on the text output of apt-get
never changing and eats your log output2.
An extra challenge: minimal Ubuntu images
Working around the race condition was made harder in this case by the fact that this was all happening on a minimal Ubuntu image, where neither fuser
nor lsof
are installed by default.
Under any other circumstances I would happily install them (neither would particularly add to the base install size), but debugging an issue in your package installation process by installing more packages leads to a wonderful Catch-22 situation3.
It turns out it is possible to emulate enough of the functionality of fuser
or lsof
by walking /proc
yourself, though it wasn’t something I was thrilled to do production:
# - Walk through all file descriptors in `/proc`
# - Run `ls` against them using the `-ls` flag to output files referred to
# - `grep` for the file we care about
# - Extract the process ID with `cut`, relying on there being a certain
# pattern of forward slashes in the output
find /proc/*/fd -ls | grep /var/lib/dpkg/lock-frontend | cut -d"/" -f3
It was in apt all along
I was starting to be seriously tempted by one of the approaches above, when I got lucky and ran into a thread on the Debian mailing lists and an associated Ubuntu bug.
The last message in that thread mentions a setting you can pass to apt-get
that I hadn’t come across before - DPkg::Lock::Timeout
- and which does exactly what we want:
# apt-get -o DPkg::Lock::Timeout=3 dist-upgrade
Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 8873 (apt-get)
Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 8873 (apt-get)
Waiting for cache lock: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 8873 (apt-get)... Error!
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
I’ve set it to 3
seconds in the example above, but in production you’d want to set it to something higher - probably measured in minutes4.
It’s worth noting, as mentioned in the mailing list thread, that this option is automatically set by apt
(as opposed to apt-get
), though that tool is aimed at interactive use and warns against usage in scripts:
WARNING : apt does not have a stable CLI interface. Use with caution in scripts.
Closing aside: documentation
If you’ve run into this blog post because you had the exact same issue, hopefully you’ve found it useful (and thank you for continuing to read beyond the point where your problem is presumably fixed).
There was something that really stood out to me when I eventually found the solution: someone took time out of their day to implement retry behaviour inside apt and make it configurable via a flag. Unfortunately, they stopped short of writing docs for the feature.
This isn’t a criticism of the person doing that work (quite likely for free, as getting paid for open source isn’t exactly easy), but I found it quite sad that a missing paragraph in some docs was the difference between their work being used - having the impact they’d hoped for - and hundreds of janky workarounds being written on StackExchange sites.
In contrast with the screenshot above, at the time of writing, searching for the apt option yields no results5 - surprising given the mailing list thread, but I guess those mailing lists aren’t fully indexed by Google.
I’ve got my fingers crossed that those search results will look different by the time you read this - at the very least because of this blog post, but also because I’m aiming to get it into the apt.conf
man page itself6.
Thanks to John Blundell and Murali Suriar for reviewing an earlier draft of this post.
-
Although in practice these should be exceedingly rare unless you have several startup scripts all contending over the lock at the same time. ↩
-
It’s probably possible to get the log output back with
tee
, but I’m glad I didn’t have to. ↩ -
If I had a problem installing packages, I would simply install some packages to fix it. ↩
-
You can set it to the special value of
-1
to retry lock acquisition forever, though I’m not a huge fan of that and prefer to eventually time out - unbounded retries are generally not a good idea. ↩ -
The lone result is only there because Google has started including results that ignore your search syntax - in this case the use of double quotes for exact matches. ↩
-
I’m planning to send them a patch once I figure out if they want that as a pull request on their GitHub mirror, Salsa (their self-hosted GitLab instance which it’s unclear if they want people outside of Debian to use) or on a mailing list thread via
reportbug
. ↩