Found the problem making that SQL query from yesterday take so long. As it turns out, there was a quirk in our ad hoc “configuration management” system.

About a week before competition, we started keeping a database pre-loaded with teams and services updated, so that it didn’t strike 9:50 on 4/20 and leave us with a big gob of work to be done in incredibly short order.

Compounded with this, we had been working on the scoreboard pretty heavily over the same time span, trying to speed it up. It was during this period in which the big query (from yesterday) was written, as well as a migration to add indices to a table that needed them:

class IndexUp < ActiveRecord::Migration
  def self.up
    add_index(:uptimes, [:team_id, :service_id, :flag_id, :value])
    add_index(:uptimes, :service_id)
    add_index(:uptimes, :team_id)
    add_index(:uptimes, :flag_id)
    add_index(:uptimes, :value)
  end

  def self.down
    remove_index(:uptimes, [:team_id, :service_id, :flag_id, :value])
    remove_index(:uptimes, :team_id)
    remove_index(:uptimes, :service_id)
    remove_index(:uptimes, :flag_id)
    remove_index(:uptimes, :value)
  end
end

We kept testing the whole week, satisfied with the speed of the scoreboard on the test data that had been accumulating all week just by running the uptime checker/flag placer against a handful of VMs.

Morning of, we’ve got the whole setup from the previous day in a satisfactory condition, so we copy the directory for the database right on top of the one filled with test data.

RAILS_ENV=production rake db:migrate

That’s what we forgot. With the indices above, the big query ran in 0.6 seconds instead of the 6 it was taking on the machine we tested with today.

So even if we do nothing else, the scoreboard next game won’t be as bad, because instead of a cache miss taking 20 seconds (and queueing up 20 seconds worth of cascading cache misses), it’ll now only queue up 1 second of cascading cache misses.

And I think we have a solution for that too.

Prefetching in Rails

The plan right now is to use caching on both the SQL output and page caching. Since we can’t completely negate the penalty of a page cache miss (or the cascading cache misses), we’re planning to also cache SQL output. Since you can easily populate the SQL cache from outside an HTTP request (using script/runner for example), we can set up the two sets of caches to complement each other, operated by a script parallel to the HTTP service:

  1. Expire (either by force or a cache that knows expiration) the SQL cache.
  2. Perform SQL requests that will get cached.
  3. Expire the page cache.

The next request might still cause a cascading cache miss, but since it’ll all be in-process misses (the data will be cooling its heels in the cache), it shouldn’t bring anything to its knees.

I’m still hoping that this won’t be necessary for our scoreboard, but I’m throwing it out here for criticism and in case somebody else needs it.