2017-05-07

Stable Django deployments without downtime

This post describes a deployment and maintenance method for Django projects that was designed with scalability in mind. The goal is to push new releases into production at any moment with minimal or no downtime. Upgrades can be performed with unprivileged access to the production server, and rollbacks are possible. I use Gunicorn, Fabric and Supervisord in the examples.

Dependency management

One important task when automating processes is to make them determinstic. This means that the outcome will always be the same, no matter when the process was started. Deploying a commit to staging should have exactly the same outcome as deploying it to production, even if new versions of dependencies were released in between. If deployments are not consistent, anything can happen.

Most Django projects include a requirements.txt file and use pip with virtualenv to manage dependencies. This is easy, but managing this process manually takes too much time. Pip-tools is a great tool to make this easier. Everything starts with a requirements.in file:

Django<1.12
django-mptt
django-taggit
easy-thumbnails
gunicorn
Pillow==3.4.2

This is everything the project needs to run. Django itself stays on the 1.11 branch, the latest LTS release. Pillow is pinned to the version that's distributed with the OS, to avoid unnecessary builds during deployments. Running pip-compile against this file produces the following output:

django-mptt==0.8.7
django-taggit==0.22.1
django==1.11
easy-thumbnails==2.4.1
gunicorn==19.7.1
pillow==3.4.2             # via easy-thumbnails
pytz==2017.2              # via django

Great, now all packages are pinned to their latest compatible releases. Pip-tools makes it easier to have a deterministic deployment process, and pip-compile makes it very easy to upgrade all requirements at once.

Choosing a deployment method

The deployment method I describe makes a few assumptions:

The project's application is run by a dedicated user inside a date-based directory (e.g. /srv/www/project/20170420/)
A symlink called current points to this directory (/srv/www/project/current/)
A system service automatically restarts the application when it exits
It is possible to access the user account remotely

I use Supervisord and SSH for the latter, but other configurations are possible. You can also name your directories however you like, I append the git tag to the date for example.

Next is an example of a Supervisord config I use. Notice that the project is always accessed through the current symlink, and that the pid file is in a known location:

[program:demo_wsgi]
command=/srv/www/demo/current/repository/virtualenv/bin/gunicorn demo.wsgi:application
    --chdir repository --bind 127.0.0.1:8001 --log-file demo-wsgi.log --pid demo.pid
directory=/srv/www/demo/current/repository/
user=demo
group=demo
autostart=true
autorestart=true
redirect_stderr=true

With this out of the way, let's have a look at the deployment process itself:

A new date-based directory is created in the user's home directory
The code repository is cloned into it
A virtualenv is created, and all the pinned requirements are installed
Static files are collected, database migrations and a few more management commands are run
The current symlink is renamed to previous and a symlink named current to the new date-based directory is created
The previous app server process is killed, Supervisord notices this and starts the newly deployed code

Once the migrations run or the current symlink is updated, the application can break in various ways. The old version of the website might use static files that the webserver can't find any more, or the old code might not be compatible with the migrated database. Solutions for these short-lived problems are described below.

Picking an automation tool

The process I described above could be performed manually, and it's probably a good idea to try it like that a few times. Once familiar with the procedure it's time to automate it.

My primary tool for app-level automation in Django projects is Fabric. Any task runner, scripting language or config management tool would do, but Fabric has the advantage of being written in Python and of integrating nicely with virtualenv using fabric-virtualenv. And it doesn't need any special privileges, it can do anything your user can do. If you aren't using any task runner or automation tool yet I'd recommend you look into Fabric. Fabric is not Python3 ready yet, but as it's only used to push code and not for your Django project that is tolerable, and like Raffaele pointed out in the comments there is a Python3 fork. Another possible tool is Ansible, but it is more complex than Fabric.

Some basic tasks that can be automated as an exercise are:

compiling new requirements files
building documentation and reports
pulling data snapshots and files from production into dev

Below is a Fabric script that performs the described deployment method.

import datetime, os
from fabric.api import run, cd, settings
from fabvenv import make_virtualenv, virtualenv


GIT_REPO = 'user@example.com/path/to/project.git'
GIT_BRANCH = 'production'
HOME = '/srv/www/djangoproject/'


def deploy():
    """
    A basic deployment script for Django projects that minimizes downtime.

    The warn_only setting is used for steps that can fail the first time the script runs.
    """
    version = datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
    deploy_path = os.path.join(HOME, version)
    venv_path = os.path.join(deploy_path, 'virtualenv')
    repository_path = os.path.join(deploy_path, 'repository')
    # I have a src directory inside the git repository that contains the actual Django project
    src_path = os.path.join(repository_path, 'src') 
    pid_file = os.path.join(repository_path, 'demo.pid')
    # Create home directory if necessary
    with settings(warn_only=True):
        run('mkdir {}'.format(HOME))
    # Step 1: Create a new deployment directory
    run('mkdir {}'.format(deploy_path))
    # Step 2: Check out the source code
    run('git clone --branch {} {} {}'.format(GIT_BRANCH, GIT_REPO, repository_path))
    # Step 3: Create the virtualenv and install dependencies
    make_virtualenv(
        venv_path,
        system_site_packages=True
    )
    with cd(repository_path):
        with virtualenv(venv_path):
            run('pip install --upgrade pip')
            run('pip install -r requirements.txt')
    # Step 4: Run management commands
    with cd(src_path):
        with virtualenv(venv_path):
            run('python manage.py check')
            run('python manage.py collectstatic --noinput')
            run('python manage.py compilemessages')
            run('python manage.py migrate')
    # Step 5: Update the links to current and previous deployments
    with cd(HOME):
        with settings(warn_only=True):
            run('rm -f previous')
            run('mv current previous')
        run('ln -s {} current'.format(deploy_path))
    # Step 6: Force a restart
    # Kill the old worker so that supervisord starts the new one
    with settings(warn_only=True):
        run('kill -TERM `cat {}`'.format(pid_file))

This is a complete example of a fabfile.py, you can start the deployment process with fab -H example.com deploy.

Update 2019: Bash deploy script example

Ok, I'm not really proud of this as going from fabric to bash feels like a downgrade. But I have a few older projects that still need changes deployed, but I simply don't have the time to replace the now obsolete fabric with a proper technology. Well, it turns out the method I described in this article is simple enough that a bash script can do it. Here is one version I use at the moment.


    #!/bin/bash

# Fixed settings
commit=$(git rev-parse HEAD)
date=$(date +%Y%m%d_%H%M%S)
name="${date}_${commit}"
git="~/${name}/git"
src="${git}/src"
# settings="${src}/src/conf/settings/"
venv="~/${name}/virtualenv"
manage="${venv}/bin/python ${git}/src/manage.py"
manage_latest="~/latest/virtualenv/bin/python latest/git/src/manage.py"
archive="${name}.tar.gz"
previous="previous"
latest="latest"

# Dynamic settings
python=/usr/bin/python3.7
pidfile="${previous}/git/src/user.pid"
remote_suggestion="user@example.com"
compilemessages=1

# Arg "parsing"
cmd=$1
remote=${2:-${remote_suggestion}}

if [[ ! "${remote}" ]]; then
	echo "No remote given, aborting, try ${remote_suggestion}"
	exit 1
fi
if [[ ! "${cmd}" ]]; then
	echo No command given, aborting, try deploy remoteclean getdata getdatafull
	exit 1
fi

_getdata () {
	exclude="$*"
	set -e
	echo "Dumping prod data"
	echo "exclude ${exclude}"
	ssh "${remote}" "${manage_latest} dumpdata --format json --indent 2 --natural-foreign --natural-primary ${exclude} -o data.json"
	echo "Fetching prod data"
	if [ ! -d data ]; then
		mkdir data/
	fi
	rsync -avz --progress "${remote}:data.json" data/
	cat data/data.json > src/data.json
	rsync -avz "${remote}:media" docker
}

loaddata () {
	# Thorough database reset before loading data
	flush="._sqlflush.sql"
	./manage.py sqlflush --no-color > "$flush"
	./manage.py dbshell < "$flush"
	./manage.py migrate
	./manage.py loaddata src/data.json
	./manage.py update_index
	rm -f "$flush"
	grep '^  "model":' src/data.json  | sort | uniq --count | sort --numeric
}

getdata () {
	# Exclude models we don't usually want
	_getdata -e admin.logentry -e sessions.session
}

getdatafull () {
	# Exclude models we never want
	_getdata -e sessions.session
}

if [[ "${cmd}" == "deploy" ]]; then
	set -e
	echo "Transfer archive..."
	git archive --format tar.gz -o "${archive}" "${commit}"
	scp "${archive}" "${remote}:"
	rm -f "${archive}"

	echo "Install files"
	ssh "${remote}" mkdir -p "${git}"
	ssh "${remote}" tar xzf "${archive}" -C "${git}"

	echo "Updating tor exit node list..."
	ssh "${remote}" "cp ${git}/download_tor.sh ."
	ssh "${remote}" "cp ${git}/download_maxmind.sh ."
	ssh "${remote}" "bash download_tor.sh ${src}/conf/settings/"

	echo "Install virtualenv"
	ssh "${remote}" virtualenv --quiet "${venv}" -p ${python}
	ssh "${remote}" "${venv}/bin/pip" install --quiet --upgrade pip setuptools
	ssh "${remote}" "${venv}/bin/pip" install --quiet -r "${git}/requirements.txt"

	echo "Set up django..."
	ssh "${remote}" "${manage} check"
	ssh "${remote}" "${manage} check --deploy"
	ssh "${remote}" "${manage} migrate --noinput"
	if [[ ${compilemessages} -gt 0 ]]; then
		ssh "${remote}" "cd ${git} && ${manage} compilemessages"
	fi
	ssh "${remote}" "${manage} collectstatic --noinput"

	echo "Switching to new install..."
	ssh "${remote}" rm -fv "${previous}"
	set +e  # first deploy
	ssh "${remote}" mv -v "${latest}" "${previous}"
	set -e
	ssh "${remote}" ln -s "${name}" "${latest}"
	echo "Killing old worker, pidfile ${pidfile}"
	ssh "${remote}" "test -f ${pidfile} && kill -15 \$(cat ${pidfile}) || echo pidfile not found"


	echo "Cleaning up..."
	ssh "${remote}" rm -f "${archive}"
	rm -f "${archive}"
	set +e
elif [[ "${cmd}" == "getdata" ]]; then
	getdata
elif [[ "${cmd}" == "getdatafull" ]]; then
	getdatafull
elif [[ "${cmd}" == "loaddata" ]]; then
	loaddata
elif [[ "${cmd}" == "go" ]]; then
	getdata
	loaddata
elif [[ "${cmd}" == "gofull" ]]; then
	getdatafull
	loaddata
fi

if [[ "${cmd}" == "deploy" || "${cmd}" == "remoteclean" ]]; then
	echo "Deleting obsolete deploys"
	ssh "${remote}" '/usr/bin/find . -maxdepth 1 -type d -name "2*" | ' \
		'grep -v "$(basename "$(readlink latest)")" | ' \
		'grep -v "$(basename "$(readlink previous)")" | ' \
		'/usr/bin/xargs /bin/rm -rf'
	ssh "${remote}" rm -fv 2*tar.gz
fi

This code was originally published on Simple bash deployment script for Django.

Things to keep in mind

I have used this method for a while now, and it does what it was designed to do. There are a few important things I haven't mentioned yet:

Static files and localization

I compile static files and translations on production, during deployment, inside the deployment directory. All those assets are in the source code repository, so this makes sense. However, it's perfectly fine to perform this step on a different machine, and to transfer the compiled assets to production.

Media files

Media files should obviously not be inside the deployment directory, or they would be lost after an upgrade because the webserver doesn't know about their old location. I keep them in the user's home directory or put them on an external storage.

Migrations and rollback

One potential source of conflicts during deployments are database migrations. If your new database scheme is incompatible with the production code, your application will generate errors sooner or later, during the migrations, after rollbacks, etc. One way to avoid this problem is to only deploy non-destructive migrations when you roll out new features. Such a migration doesn't delete any data or rename existing fields and models, it just adds new fields and data.

Doing this also has the benefit that rolling back your production code can be as simple as updating a symlink and restarting the application server. Once your new code has proven to be stable in production you can create additional migrations to get rid of legacy data.

Caches

If you use caching you should think about potential cache conflicts. You can avoid them for example by running the clear_cache management command, or by adding a KEY_PREFIX to your cache config. Clearing the entire cache for every deployment seems a little aggressive though.

Keeping your code portable

You probably want your deployment scripts to be reusable in multiple projects, so think about ways to avoid hardcoding paths etc. inside your Fabric scripts (if that's what you use). I use a custom Fabric package.

Cleaning up

So far we have kept old deployment directories around, which makes rollback possible, but it's not necessary to keep all old deployments. Which cleanup process you choose depends on your requirements. Using the date in the directory name makes managing them easier.

Deploying secrets

Storing secrets like the SECRET_KEY, mail configuration and other sensitive information in the source repository should usually be avoided. Distributing them is another potential task for your script.

Feature switches

Being able to roll back releases is nice, but it's also nice to be able to enable and disable features with a simple configuration switch, or to perform A/B testing. Feature switches can also help to merge code more frequently.

Reducing downtime: More app servers

Usually when people ask how to upgrade Django projects without downtime they don't have established a reproducible deployment process yet, so that's what this post was mostly about. Now that you have such a process, you can work on actually eliminating downtimes. The method described above can lead to a few seconds of downtime after killing the old worker until the new one becomes responsive.

There are different approaches to fix this, one is to use Nginx in front of two or more Gunicorn servers, and to load balance. My post is only about the Django project/app layer, so I won't describe this in detail. Please refer to the Nginx load balancing documentation if you use Nginx, or ask your sysadmin to set it up for you.

Compared to the deployment process described above with just one Gunicorn a few details change:

Static files

If you use more than one application server you obviously don't need to compile two sets of static files, so your deployment script should know how to skip a few steps when necessary.

Restarting app servers

You will need a strategy for restarting the Gunicorn instances. What you do depends on how your project behaves when different versions run simultaneously: you could restart them as quickly as possible, or you could keep different versions running at the same time for a while. This can be useful for A/B testing for example.

Automating the OS level

Everything I described so far works fine on one or a fixed number of servers. But it obviously doesn't scale well or offer a lot of redundancy. If those are features you need you'll want to automate the OS level, so that you can provision entire servers as quickly as you deploy new features.

7 comments

Raffaele wrote this comment on May 7, 2017, 10:36 a.m.

You can use the python3 port of fabric: $ python3 -m pip install fabric3 https://github.com/mathiasertl/fabric/
Nicolas Kuttler wrote this comment on May 7, 2017, 12:57 p.m.

Yeah, but fabric-virtualenv is not compatible with it.. I should probably just get rid of it, it doesn't do too much anyway.
Antoine wrote this comment on May 7, 2017, 5:20 p.m.

Just falling on the same issue, and found the package fabric3-virtualenv (https://github.com/nutztherookie/fabric3-virtualenv) provide the python3 equivalent of fabric-virtualenv. Nice article BTW, I learnt some new tips I didn't know until know. pip-compile command is amazing to handle complex set of dependencies !
Nicolas Kuttler wrote this comment on May 7, 2017, 9:49 p.m.

Oh, cool, I hadn't found the py3 version of that! And yes, I love pip-tools, so much nicer than doing everything with pip, and pip-sync is great as well.
Fidel wrote this comment on May 8, 2017, 7:06 p.m.

Would killing gunicorn with HUP instead of TERM work? That should do a graceful reloading, which should drop less connections.
Nicolas Kuttler wrote this comment on May 8, 2017, 7:11 p.m.

In some scenarios a HUP can work, but not if you upgrade your requirements for example. Pip will remove packages before upgrading them, leading to random errors. Restarting has a predictable outcome, and it's easy to work around problems. The method I describe installs a new gunicorn every time anyway, so restarting is not an option.
Nicolas Kuttler wrote this comment on May 20, 2018, 6:16 p.m.

By the way, fabric finally released a python 3 version a few days ago: http://www.fabfile.org/