Again working on a frontend for visualizing a big amount of data, think you know the situation,
a kind of index view, or a GET all. So you start to implement something clever,
not to fetch all the data at once, ending up by a pagination or something similar
to utilise raw SQL LIMIT
and OFFSET
to speed things up.
This requires mostly the knowledge of the count of items,
so the OFFSET
can be calculated correctly.
In following some log output for the count and select statements, results table contains ~4,5M rows …
(129.8ms) SELECT COUNT(*) FROM "results"
…
Result Load (0.4ms) SELECT "results".* FROM "results" LIMIT $1 OFFSET $2 [["LIMIT", 100], ["OFFSET", 0]]
one can see that COUNT
statement needs multiple times more then the SELECT
statement.
But do we need every time we are fetching the next span of data a new COUNT
?
Think, NO.
Separate endpoints for count and data fetching
Add another endpoint for getting the count only when its needed, for example on page load.1
get 'count' do
results = ::Result.filter(params)
{ count: results.count }
end
get 'filter' do
limit = params[:limit] || 500
offset = params[:offset] || 0
results = ::Result.filter(params).limit(limit).offset(offset)
{ items: results }
end
Ok this results in an decreased count of executing of the count statement, cause of hitting the count endpoint only, when parameters are changed, but doesn’t speed it up.
Estimate
By thinking and researching about, what can be improved, I found this inspiring post count-performance.
Ok, why not … the exact count isn’t really needed, so I decided to give it a try.
The Function
First we have to add a SQL function, which does the estimation. For that, the function from the post above will be packed into a migration.
class CreateEstimateFunction < ActiveRecord::Migration[6.0]
def up
function = "CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE
rec record;
rows integer;
BEGIN
FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
rows := substring(rec.\"QUERY PLAN\" FROM ' rows=([[:digit:]]+)');
EXIT WHEN rows IS NOT NULL;
END LOOP;
RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;"
ActiveRecord::Base.connection.execute(function)
end
end
Remember the name of the function (here: count_estimate
) for later usage.
Now we have the function place, we should make usage of it. So we are changing the endpoint into …
get 'count' do
count = ::Result.counting(params)
{ count: count }
end
to call the new model method. Note, that it also take the same parameters as for filter endpoint.
def self.filter(params)
# filter implementation
# returns an ActiveRecord::Relation object
end
def self.counting(params)
query = if <search params given?>
insert = ActiveRecord::Base.connection.quote(filter(params).to_sql)
search_total_query(insert)
else
total_query
end
result = ActiveRecord::Base.connection.exec_query(query)
result.rows.flatten.first.to_i
end
def self.search_total_query(query)
"SELECT count_estimate(" + query + ");"
end
# helper methods
def self.total_query
"SELECT
(reltuples/relpages) * (pg_relation_size('results') /
(current_setting('block_size')::integer)) AS count
FROM pg_class where relname = 'results';"
end
To be consequent, we will use both estimations and decide via search params given?
, which one to choose.
This is useful if want to make search queries which are decreasing the result space,
so the estimation want not differ too much.
For that the value of the filter(params)
method have to be an ActiveRecord::Relation
object,
so that we can reuse it here to generate the SQL string.
First some comparisions:
timings | count | estimate |
---|---|---|
no fiter | 116.8ms | 0.7ms |
fiter | 9130.1ms | 7.8ms |
The difference is evident.