Django + Postgres: The Hunt for Long Running Queries

Database queries sometimes take a lot longer to complete than the average developer would expect. Seemingly innocent actions such as searching can produce very complex SELECT statements which take what feels like an eternity to finish. If you get a few users performing these heavy queries in parallel, your app’s performance may begin to suffer.

In this article we reflect on how we at Intempus implemented some basic tooling to help prevent getting into the scenario described above. In particular, we describe a way to detect long running active queries and also a way in which we can set a timeout on a query-by-query basis. We speak in terms of implementing these ideas in Django and Postgres, but most concepts should be transferable to other frameworks and databases.

The Set-Up

Our website offers users the ability to filter data on an almost limitless number of combinations of different columns. It also allows users to perform full-text searches which match on multiple “simple” columns together (e.g. “name” and “number”).

The Incident

SELECT * FROM "table_1" INNER JOIN "table_2" ON ("table_1"."foreign_key_id" = "table_2"."id") WHERE (UPPER("table_2"."name"::text) LIKE UPPER(%Paul%) AND "table_1"."another_foreign_key_id" = 12345

In reality, the SQL produced on our server was much more complex since the search must also pass through an involved authorisation layer. Still, it was surprising to us that this query would take in the order of minutes to complete, especially when similar looking queries were so fast.

What next?

  • Perform deep analysis on why this specific query is so slow and attempt to optimise it.
  • Restructure our authorisation framework to play more nicely with these kind of searches.
  • Limit the kind of searches a user can perform — for example, can we move away from expensive LIKE type queries?

All of these approaches are valid and likely to be fruitful. However, none of them guarantee that another slow search query won’t slip through the cracks and hurt our app’s performance. We want a way to detect these queries before they cause an issue. Even better, we want to find a way to ensure these queries automatically cancel themselves if they’re running over a certain time limit.

The Hunt

To hunt for long running active queries, we made use of the pg_stats view. The Python function below will print data on all active queries running for the specified duration:

from django.db import connection


def get_queries_active_for_duration(duration_in_minutes):
cursor = connection.cursor()
select_statement = """
SELECT
pid,query,
pg_stat_activity.query_start AS query_start,
now() - pg_stat_activity.query_start AS duration,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) >
interval '{duration_in_minutes} minutes'
AND state='active' ORDER BY duration;
""".format(duration_in_minutes=duration_in_minutes)
cursor.execute(select_statement)
row = cursor.fetchall()
return row

Next, we set up a cron-job which uses this function to periodically scan the database for slow queries. When one is found, we raise an alert on rollbar which allows us to quickly decide if action should be taken before things get worse. If we wanted to kill the active query, we could use tooling such as pg_kill.

The Kill

  1. The timeout applies only to queries coming from the search. We do not want a global timeout to be applied across all database queries (since some actions are expected to take longer than others).
  2. We want to be alerted when the query does time out and we want that alert to contain information on what the query was for.

To achieve the above, we made use of postgres statement_timeout setting. Before executing the query in our search code, we set a LOCAL statement_timeout value in the same transaction as the one in which the search query is executed. If the query takes longer the specified timeout value, the transaction is cancelled and an OperationalError is raised, at which point we raise an alert on rollbar. The code for this looked something like the following:

try:
with transaction.atomic():
connection.cursor().execute(
"SET LOCAL statement_timeout = '60s';".format(timeout))
result = list(queryset) # evaluates and speaks to postgres
except OperationalError:
erorr_message = 'Cancelled search query after 60s.'
message += 'Query SQL:\n {}'.format(queryset.query)
# Logic for reporting this message to e.g. rollbar goes here.

Note that it the use of SET LOCAL that allows us to satisfy condition (1) and the logic in the except block which allows us to satisfy condition (2). With this in place, we can be sure that search queries will only run for at most 60s (or whatever other value we choose).
The beauty of this is not only does it ensure we don’t get into the tricky situation we described in the introduction, but it actually very quickly alerts us to which search queries are slow and hence allows us to devise ways to improve the performance of the search as well.

For some of my other work, see https://github.com/PaulGilmartin/ and www.paulgilmartin.co.uk.

Python Software Developer based in Copenhagen, Denmark