Celery is a great tool for background task processing in Django. We use it in a lot of the custom web apps we build at Caktus, and it's quickly becoming the standard for all variety of task scheduling work loads, from simple to highly complex.
Although rarely, sometimes a Celery worker may stop processing tasks and appear completely hung. In other words, issuing a restart command (through Supervisor) or kill (from the command line) doesn't immediately restart or shutdown the process. This can particularly be an issue in cases where you have a queue setup with only one worker (e.g., to avoid processing any of the tasks in this queue simultaneously), because then none of the new tasks in the queue will get processed. In these cases you may find yourself resorting to manually calling kill -9 <pid> on the process to get the queue started back up again.
We recently ran into this issue at Caktus, and in our case the stuck worker process wasn't processing any new tasks and wasn't showing any CPU activity in top. That seemed a bit odd, so I thought I'd make an attempt to discover what that process was actually doing at the time that it became non-responsive. Enter strace.
strace is a powerful command-line tool for inspecting running processes to determine what "system calls" they're making. System calls are low level calls to the operating system kernel that might involve accessing hard disks, the network, creating new processes, or other such operations. First, let's find the PID of the celery process we're interested in:
ps auxww|grep celery
You'll find the PID in the second column. For the purposes of this post let's assume that's 1234. You can inspect the full command in the process list to make sure you've identified the right celery worker.
Next, run strace on that PID as follows:
sudo strace -p 1234 -s 100000
The -p flag specifies the PID, and the -s flag specifies the size of the output. By default it's limited to 32 characters, which we found isn't very helpful if the system call being made includes a long string as an argument. You might see something like this:
Process 1234 attached - interrupt to quit write(5, "ion id='89273' responses='12'>\n ...", 37628
In our case, the task was writing what looked like some XML to file descriptor "5". The XML was much longer and at the end included what looked like a few attributes of a pickled Python object, but I've shortened it here for clarity's sake. You can see what "5" corresponds to by looking at the output of lsof:
sudo lsof|grep 1234
The file descriptor shows up in the "FD" column; in our version of strace, that happens to be the 4th column from the left. You'll see a bunch of files that you don't care about, and then down near the bottom, the list of open file descriptors:
python 1234 myuser 0r FIFO 0,8 0t0 6593806 pipe python 1234 myuser 1w FIFO 0,8 0t0 6593807 pipe python 1234 myuser 2w FIFO 0,8 0t0 6593807 pipe python 1234 myuser 3u 0000 0,9 0 4738 anon_inode python 1234 myuser 4r FIFO 0,8 0t0 6593847 pipe python 1234 myuser 5w FIFO 0,8 0t0 6593847 pipe python 1234 myuser 6r CHR 1,9 0t0 4768 /dev/urandom python 1234 myuser 7r FIFO 0,8 0t0 6593850 pipe python 1234 myuser 8w FIFO 0,8 0t0 6593850 pipe python 1234 myuser 9r FIFO 0,8 0t0 6593916 pipe python 1234 myuser 10u IPv4 6593855 0t0 TCP ip-10-142-126-212.ec2.internal:33589->ip-10-112-43-181.ec2.internal:amqp (ESTABLISHED)
You can see "5" corresponds to a pipe, which at least in theory ends up with a TCP connection to the amqp port on another EC2 server (host names are fictional).
RabbitMQ was operating properly and not reporting any errors, so our attention turned to the Celery task in question. Upon further examination, an object we were passing to the task included a long XML string as an attribute, which was being pickled and passed to RabbitMQ. Issues have been reported with long argument sizes in Celery before, and while it appears they should be supported, an easy workaround for us (and Celery's recommended approach) was to pass an ID for this object rather than the object itself, greatly reducing the size of the task's arguments and avoiding the risk of overwriting any object attributes.
While there may have been other ways to fix the underlying issue, strace and lsof were crucial in helping us figure out the problem. One might be able to accomplish the same thing with a lot of logging, but if your code is stuck in a system call and doesn't appear to be showing any noticeable CPU usage in top, strace can take you immediately to the root of the problem.