Django + Postgres: The Hunt for Long Running Queries
Database queries sometimes take a lot longer to complete than the average developer would expect. Seemingly innocent actions such as searching can produce very complex SELECT statements which take what feels like an eternity to finish. If you get a few users performing these heavy queries in parallel, your app’s performance may begin to suffer.
In this article we reflect on how we at Intempus implemented some basic tooling to help prevent getting into the scenario described above. In particular, we describe a way to detect long running active queries and also a way in which we can set a timeout on a query-by-query basis. We speak in terms of implementing these ideas in Django and Postgres, but most concepts should be transferable to other frameworks and databases.
The Set-Up
I work as a Software Engineer at Intempus , a time-registration software company based in Denmark. I help develop the backend of the app and website, which is built from Django (1.11) on top of a Postgres (12) database.
Our website offers users the ability to filter data on an almost limitless number of combinations of different columns. It also allows users to perform full-text searches which match on multiple “simple” columns together (e.g. “name” and “number”).
The Incident
Recently we noticed some transient but significant performance degradation of the website as a whole. We were eventually able to trace this back to one of our bigger customers performing the same innocent looking, but apparently very expensive, search repeatedly. Without going into too much detail, the user was essentially attempting to filter by ID on a specific foreign key column combined with a search on a “name” column. The below is a simplified version of how this would be mapped into SQL
SELECT * FROM "table_1" INNER JOIN "table_2" ON ("table_1"."foreign_key_id" = "table_2"."id") WHERE (UPPER("table_2"."name"::text) LIKE UPPER(%Paul%) AND "table_1"."another_foreign_key_id" = 12345
In reality, the SQL produced on our server was much more complex since the search must also pass through an involved authorisation layer. Still, it was surprising to us that this query would take in the order of minutes to complete, especially when similar looking queries were so fast.
What next?
There are a lot of different ways to start tackling this issue. Some possibilities are:
- Perform deep analysis on why this specific query is so slow and attempt to optimise it.
- Restructure our authorisation framework to play more nicely with these kind of searches.
- Limit the kind of searches a user can perform — for example, can we move away from expensive
LIKE
type queries?
All of these approaches are valid and likely to be fruitful. However, none of them guarantee that another slow search query won’t slip through the cracks and hurt our app’s performance. We want a way to detect these queries before they cause an issue. Even better, we want to find a way to ensure these queries automatically cancel themselves if they’re running over a certain time limit.
The Hunt
To hunt for long running active queries, we made use of the pg_stats view. The Python function below will print data on all active queries running for the specified duration:
from django.db import connection
def get_queries_active_for_duration(duration_in_minutes):
cursor = connection.cursor()
select_statement = """
SELECT
pid,query,
pg_stat_activity.query_start AS query_start,
now() - pg_stat_activity.query_start AS duration,
state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) >
interval '{duration_in_minutes} minutes'
AND state='active' ORDER BY duration;
""".format(duration_in_minutes=duration_in_minutes)
cursor.execute(select_statement)
row = cursor.fetchall()
return row
Next, we set up a cron-job which uses this function to periodically scan the database for slow queries. When one is found, we raise an alert on rollbar which allows us to quickly decide if action should be taken before things get worse. If we wanted to kill the active query, we could use tooling such as pg_kill.
The Kill
Although we were feeling more secure with our new alerting in place, we felt more could be done. In particular, we felt like there should be a postgres-level timeout on queries coming from searches which also meets the following criteria:
- The timeout applies only to queries coming from the search. We do not want a global timeout to be applied across all database queries (since some actions are expected to take longer than others).
- We want to be alerted when the query does time out and we want that alert to contain information on what the query was for.
To achieve the above, we made use of postgres statement_timeout setting. Before executing the query in our search code, we set a LOCAL
statement_timeout value in the same transaction as the one in which the search query is executed. If the query takes longer the specified timeout value, the transaction is cancelled and an OperationalError
is raised, at which point we raise an alert on rollbar. The code for this looked something like the following:
try:
with transaction.atomic():
connection.cursor().execute(
"SET LOCAL statement_timeout = '60s';".format(timeout))
result = list(queryset) # evaluates and speaks to postgres
except OperationalError:
erorr_message = 'Cancelled search query after 60s.'
message += 'Query SQL:\n {}'.format(queryset.query)
# Logic for reporting this message to e.g. rollbar goes here.
Note that it the use of SET LOCAL
that allows us to satisfy condition (1) and the logic in the except
block which allows us to satisfy condition (2). With this in place, we can be sure that search queries will only run for at most 60s (or whatever other value we choose).
The beauty of this is not only does it ensure we don’t get into the tricky situation we described in the introduction, but it actually very quickly alerts us to which search queries are slow and hence allows us to devise ways to improve the performance of the search as well.
For some of my other work, see https://github.com/PaulGilmartin/ and www.paulgilmartin.co.uk.