Scalable filters and bulk-operations in campaignion

Roman Zimmermann's picture
Roman Zimmermann

In the last weeks I did some work on optimizing queries for the "Manage Supporter"-Interface in campaignion (online campaigning / online fundraising distribution). The goal is to filter Redhen contacts and then apply bulk-operations on them. In our larger databases we have up to 500,000 contacts and millions of activities - and the queries are not that simple either. How do you get “All supporters that signed at least two petitions last year but never made a donation”? In this blogpost I share some of the techniques that can be used to make such queries reasonably fast.

Optimizing the query

NOTE: The timings try to give some feeling for the scalability of a query. They are not benchmarks. Your results may vary - though I'm pretty confident that the general considerations are valid for a wide range of systems.

Before the optimizations the query generated for the example (“All supporters that signed at least two petitions last year but never made a donation”) looked something like this (some things have been simplified):


SELECT r.contact_id, r.first_name, r.last_name, e.redhen_contact_email_value AS email
FROM redhen_contact r
  INNER JOIN field_data_redhen_contact_email e ON e.entity_type='redhen_contact' AND e.entity_id=r.contact_id
WHERE r.contact_id IN (
  SELECT act.contact_id
  FROM campaignion_activity act
    INNER JOIN campaignion_activity_webform USING(activity_id)
    INNER JOIN node n USING(nid)
  WHERE act.type='webform_submission'
    AND n.type='petition'
    AND act.created BETWEEN UNIX_TIMESTAMP('2014-01-01 00:00:00') AND UNIX_TIMESTAMP('2015-01-01 00:00:00')
  GROUP BY act.contact_id
  HAVING count(*) > 1
)
AND r.contact_id NOT IN (
  SELECT act.contact_id
  FROM campaignion_activity act
  WHERE act.type='webform_payment'
)
ORDER BY r.updated DESC

Since we use a pager for the interface this is not the actual query that has been run on our database. The pager created two derivates from that query:

  1. The count-query: The original query with the fields replaced with COUNT(*) and ORDER BY removed.
  2. The limit-query: The original query with an additional LIMIT 20.

Each of those queries takes more than 5 minutes to execute. That was pretty embarassing.

Joins

One strategy to speed up queries is to ”avoid nested queries at all cost”.

The examle query can indeed be transformed into using joins. In general things look different though: Imagine multiple filters using a HAVING-clause. Anyway, let's take at the example transformed into using joins:


SELECT r.contact_id, r.first_name, r.last_name, e.redhen_contact_email_value AS email
FROM redhen_contact r
  INNER JOIN field_data_redhen_contact_email e ON e.entity_type='redhen_contact' AND e.entity_id=r.contact_id
-- First filter
  INNER JOIN campaignion_activity act1 ON act1.contact_id=r.contact_id
    AND act1.type='webform_submission'
    AND act1.created BETWEEN UNIX_TIMESTAMP('2014-01-01 00:00:00') AND UNIX_TIMESTAMP('2015-01-01 00:00:00')
  INNER JOIN campaignion_activity_webform actw1 ON actw1.activity_id=act1.activity_id 
  INNER JOIN node n1 ON n1.nid=actw1.nid AND n1.type='petition'
-- Second filter
  LEFT OUTER JOIN campaignion_activity act2 ON act2.contact_id=r.contact_id
    AND act2.type='weform_payment'
WHERE act2.contact_id IS NULL
GROUP BY r.contact_id
HAVING count(*) > 1
ORDER BY r.updated DESC

This query takes about 5.1s (count-query: 1.9s, limit-query 3.2s) which is a lot faster than the >5mins from the original query.

Temporary tables to the rescue!

At that point I wanted to know whether the GROUP BY/HAVING was the dealbreaker. Turns out it wasn't. The two inner-selects timed in at far below one second each (0.75s and 0.04s respectively). So they can't explain the slow query. The next idea was to separate the activity-filters from the rest of the query and check whether that makes any difference.

The idea was to explicity list all contacts matching the first activity-filter, then get a list of all contacts additionally matching the second filter. In Drupal this can be achieved using db_query_temporary(). Here is translation of that into MySQL:


CREATE TEMPORARY TABLE a1
SELECT act.contact_id
FROM campaignion_activity act
  INNER JOIN campaignion_activity_webform USING(activity_id)
  INNER JOIN node n USING(nid)
WHERE act.type='webform_submission'
  AND n.type='petition'
  AND act.created BETWEEN UNIX_TIMESTAMP('2014-01-01 00:00:00') AND UNIX_TIMESTAMP('2015-01-01 00:00:00')
GROUP BY act.contact_id
HAVING count(*) > 1;

CREATE TEMPORARY TABLE a2
SELECT r.contact_id
FROM a1 r
  LEFT OUTER JOIN campaignion_activity act ON act.contact_id=r.contact_id AND act.type='webform_payment'
WHERE act.contact_id IS NULL;

SELECT r.contact_id, r.first_name, r.last_name, e.redhen_contact_email_value AS email
FROM redhen_contact r
  INNER JOIN field_data_redhen_contact_email e ON e.entity_type='redhen_contact' AND e.entity_id=r.contact_id
  INNER JOIN a2 USING(contact_id)
ORDER BY r.updated DESC

Using that I got the following timings:

  • Creating a1: 1.1s.
  • Creating a2: 0.3s.
  • The count-query: 0.05s.
  • The limit-query: 0.1s.

So from more than 5 minutes we are down to under 2s. That's even better than the query using joins!

Sidenote: Avoid joins between two temporary tables as they are incredible slow!

We had luck with this query though: The first filter applied in a1 is a lot more restrictive than the second one (7,000 matches vs. 400,000). If I exchange them we get the following timings:

  • Creating a2: 7.5s.
  • Creating a1: 4.8s.
  • The count-query: 0.05s.
  • The limit-query: 0.1s.

So one further optimization would be to have an estimate on the numbers of matches for each filter and then always apply the most restrictive filter first.

Bulk-Operations

The tricky bit about the bulk-operations is how to pass (potentially 100.000s of) contact ids to the bulk operation.

Getting them all from the database will not work. Sending them back and forth between Drupal and the database simply takes too long. Sometimes a bulk-operation would time out before it had started.

The next idea is to pass the query / filter to the bulk operation. While this works, doing the filter query (that takes several seconds each time) again and again for each batch isn't exactly optimal either. That leaves only one other option: Store the result in a special table in the database. (Credit goes to CiviCRM where I've first seen that approach).

In campaignion we are now putting the results of the filter into their own table using a INSERT INTO … SELECT …-query. Then each batch needs only to get a number of contact ids from that table. The tradeoff is that we have to manage an extra set of tables (ie. cleanup the results from time to time).

Share this!