Use find_each vs. select or each when needing to process each record via Ruby


The problem:

This scenario is a much more practical case of a similar concept with Rails / ActiveRecord count, size, and length. In this case, you’re needing to run ruby code off of either every value from a where method result or a subset that is determined by ruby code. For example, assume that a model Entry has a method after_sunrise (probably want a real sunrise/sunset gem but this is a simplistic example) as follows:

  # determine if a time included in the string description of the entry is after sunrise
  # nil (falsey) if not found
  def after_sunrise
    notated_time = what.match('\d+:?\d* ?[AP]M')
    return nil if notated_time.nil?

    (Time.parse([date.to_s, notated_time].join(' ')) > 
      Time.parse([date.to_s, "6:30 AM"].join(' ')))
  end

Using select for filtering will cause the result from where clause to be loaded all at once:

irb(main):178:0> entries = Entry.where(private: false).select { |entry| entry.after_sunrise } #.each { do something with each record here }
D, [2021-08-07T17:53:52.492261 #4] DEBUG -- :   Entry Load (10.4ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1  [["private", false]]
=>
[#<Entry:0x000055ddafc1b230

find_each for the batch

If you don’t have very many records, this isn’t a big deal, but if you have models with a lot of data per model instance and you have millions of rows, you risk running out of memory trying to load it all at once. Anytime you take the result of a query or association and try to switch to operating on it like an array of ruby objects, you force the lazy loading of ActiveRecord::Relation and ActiveRecord::Association::CollectionProxy to load those records.

If you switch to find_each you can load those records in batches:

irb(main):179:1* entries = Entry.where(private: false).find_each do |entry|
irb(main):180:1*   next unless entry.after_sunrise
irb(main):181:1*   # do something with each record here
irb(main):182:0> end
D, [2021-08-07T17:57:07.895360 #4] DEBUG -- :   Entry Load (5.1ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2  [["private", false], ["LIMIT", 1000]]
=> nil

I only have 214 records in this example database, so the default batch size of 1000 makes this look almost identical to the original. However… you can use a keyword argument of batch_size: to tune the size of the batches pulled, in case your records are small and so more than 1000 records can be loaded at a time or they’re large and 1000 records is too much, or you have contrived example and want to show it actually batching:

irb(main):184:1* entries = Entry.where(private: false).find_each(batch_size: 2) do |entry|
irb(main):185:1*   next unless entry.after_sunrise
irb(main):186:1*   # do something with each record here
irb(main):187:0> end
D, [2021-08-07T17:59:22.339190 #4] DEBUG -- :   Entry Load (1.9ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2  [["private", false], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.344225 #4] DEBUG -- :   Entry Load (1.8ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3  [["private", false], ["id", 8], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.347152 #4] DEBUG -- :   Entry Load (1.9ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3  [["private", false], ["id", 10], ["LIMIT", 2]]

Conclusion

The difference between chaining .each or .select off of a .where clause vs. chaining .find_each is something you won’t necessarily see the benefits of if you’re building something from the ground up and don’t have much data flowing through your application. You may even even have a successful launch until you grow an order of magnitude or so. That’s part of the challenge of recognizing the need for it.


Leave a Reply

%d bloggers like this: