The problem:
This scenario is a much more practical case of a similar concept with Rails / ActiveRecord count, size, and length. In this case, you’re needing to run ruby code off of either every value from a where
method result or a subset that is determined by ruby code. For example, assume that a model Entry
has a method after_sunrise
(probably want a real sunrise/sunset gem but this is a simplistic example) as follows:
# determine if a time included in the string description of the entry is after sunrise
# nil (falsey) if not found
def after_sunrise
notated_time = what.match('\d+:?\d* ?[AP]M')
return nil if notated_time.nil?
(Time.parse([date.to_s, notated_time].join(' ')) >
Time.parse([date.to_s, "6:30 AM"].join(' ')))
end
Using select for filtering will cause the result from where clause to be loaded all at once:
irb(main):178:0> entries = Entry.where(private: false).select { |entry| entry.after_sunrise } #.each { do something with each record here }
D, [2021-08-07T17:53:52.492261 #4] DEBUG -- : Entry Load (10.4ms) SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 [["private", false]]
=>
[#<Entry:0x000055ddafc1b230
find_each for the batch
If you don’t have very many records, this isn’t a big deal, but if you have models with a lot of data per model instance and you have millions of rows, you risk running out of memory trying to load it all at once. Anytime you take the result of a query or association and try to switch to operating on it like an array of ruby objects, you force the lazy loading of ActiveRecord::Relation
and ActiveRecord::Association::CollectionProxy
to load those records.
If you switch to find_each
you can load those records in batches:
irb(main):179:1* entries = Entry.where(private: false).find_each do |entry|
irb(main):180:1* next unless entry.after_sunrise
irb(main):181:1* # do something with each record here
irb(main):182:0> end
D, [2021-08-07T17:57:07.895360 #4] DEBUG -- : Entry Load (5.1ms) SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2 [["private", false], ["LIMIT", 1000]]
=> nil
I only have 214 records in this example database, so the default batch size of 1000 makes this look almost identical to the original. However… you can use a keyword argument of batch_size:
to tune the size of the batches pulled, in case your records are small and so more than 1000 records can be loaded at a time or they’re large and 1000 records is too much, or you have contrived example and want to show it actually batching:
irb(main):184:1* entries = Entry.where(private: false).find_each(batch_size: 2) do |entry|
irb(main):185:1* next unless entry.after_sunrise
irb(main):186:1* # do something with each record here
irb(main):187:0> end
D, [2021-08-07T17:59:22.339190 #4] DEBUG -- : Entry Load (1.9ms) SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2 [["private", false], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.344225 #4] DEBUG -- : Entry Load (1.8ms) SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3 [["private", false], ["id", 8], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.347152 #4] DEBUG -- : Entry Load (1.9ms) SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3 [["private", false], ["id", 10], ["LIMIT", 2]]
Conclusion
The difference between chaining .each
or .select
off of a .where
clause vs. chaining .find_each
is something you won’t necessarily see the benefits of if you’re building something from the ground up and don’t have much data flowing through your application. You may even even have a successful launch until you grow an order of magnitude or so. That’s part of the challenge of recognizing the need for it.