Configure Celery for reliable delivery

Celery is probably the most used task runner in the Python ecosystem. Yet, I've found today that configuring it to support a basic scenario for such a critical piece in a software architecture surprisingly difficult and poorly documented.

So, what's the problem? Let's say we have a task taking several minutes to run. Imagine now that we have a server failure, completely independent of our source code: a server restart, a power failure, etc. Thus, the task was stopped in the middle of execution and didn't finished. Fine! Things like that happen often with computers 😏 All we want now is Celery to execute the task again after the server restart!

Well, actually, by default: it won't 🥹 This is actually an intended default of Celery. By the way, this is one of the main painpoint raised by the author of Dramatiq, a competing library.

Fortunately, there is a way to do it! And this blog post will save you hours of research 🙃

TL;DR

Here is the configuration you need to apply for this to work:

app = Celery(
    "tasks",
    broker="YOUR_BROKER",
    backend="YOUR_BACKEND",
)
app.conf.update(
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    worker_state_db="./celery-state.db",
)

Don't go now! I highly recommend you to read the rationale about it 👇

Rationale

So what those options are doing? Basically, we need to tell Celery:

To acknowledge the task after their completion, not right before (task_acks_late).
To reject the task if the worker is lost or shutdown (task_reject_on_worker_lost).
By default, task rejections are persisted only in memory and thus lost if the server is stopped. So we need to setup a state file (worker_state_db).

The visibility timeout quirk

Still, if you test to manually kill the worker while a long task is running before restarting it, you'll see that the task looks like stuck and never re-executed.

Actually, this is normal behavior. Why? Because tasks are not re-delivered until they reach the visibility timeout. And by default, this timeout is 1 hour. It means that your pending task will be executed again in 1 hour.

If you set this value to 30 seconds, you'll see that your task will get executed again within 30 seconds.

It might be tempting to decrease this visibility timeout and keep a very low value but bear in mind that it'll apply for every tasks, not only the crashed ones. If you have a task that takes more than 1 hour to execute, it'll be automatically rescheduled on another worker. That's why it's important to keep it quite high, especially if we have long tasks to run.

TL;DR

Rationale

The visibility timeout quirk

References