How to … Minimise running Sidekiq jobs

I’m working for a customer project, and stand for the task to build a view of many models¹ — yeap, the data model was/is a poor one —, so I came up, inspired by this post, with Scenic to use a materialized view as solution.

All was fine, cause the data once imported will never change, only one thing to have in mind is, refreshing the view after data import. Naming the model in this example Result ends up in this worker:

class ResultRefreshJob
  include Sidekiq::Worker
  sidekiq_options queue: :views, retry: false

  def perform
    Result.refresh
  end
end

it calls the refresh method of our materialized view as it is documented here.

All fine, but some months later, a feature request comes in, to add another column. Sounds after a low-hanging-fruit, but these data can occur 0..n times and it ends up in a flatten pivot table — presenting it as pivot table was not wanted —, means, the data grows exponential with count of entries. And a second problem was, the data can be changed, very quickly.

Its naïve to think the above solution can handle the boost of jobs, which blocking itself.

Goal

So the count of running jobs must be minimised
and it have to be ensured the view is refreshed.

I came up with a second job, which checks if refresh jobs are running, if not it enqueues a new one, and if one exists, an error will be raised and it will be enqueued again.

It looks like following:

class ResultEnqueueJob
  include Sidekiq::Worker
  sidekiq_options queue: :views, retry: 3

  # TODO: this should be observed and maybe adapted
  def perform
    running_worker = Sidekiq::Workers.new.map(&:last).select { |x| x['queue'] == 'views' }.last
    can_run = if running_worker
                run_at = DateTime.strptime("#{running_worker['run_at']}", '%s')
                run_at < DateTime.current - 1.minutes
              else
                true
              end

    if can_run
      ResultRefreshJob.perform_async
    else
      raise StandardError, "---> another ResultRefreshJob is running"
    end
  end
end

ok, looks quite simple, but what’s the trick here?

We have here 3 parameters, number one the retry count (rc), number two the timespan we go back (ts) and number three, the unknown one, we can only estimate and observe, the time to rebuild the view (tr).

The actual values are tc = 3, ts = 1min and we estimate tr to round 2min, now we have to optimize these values, to met the above goal and to adapt it on changes.

Have in mind, that the time to retry will exponential growing, see: automatic job retry.

Example:

t_x : job₁ checks if can start → no job running, starts
t_x+ts : job₂ checks if can start → job₁ is running, can not
t_x+ts+30s : job₂, checks again → job₁ is running, can not
t_x+ts+30s+46s : job₂, checks again → job₁ finished, starts

joining 6 tables, resulting in some 100m rows, this takes around 1-2min to perform ↩