Use find_each vs. select or each when needing to process each record via Ruby

The problem:

This scenario is a much more practical case of a similar concept with Rails / ActiveRecord count, size, and length. In this case, you’re needing to run ruby code off of either every value from a where method result or a subset that is determined by ruby code. For example, assume that a model Entry has a method after_sunrise (probably want a real sunrise/sunset gem but this is a simplistic example) as follows:

  # determine if a time included in the string description of the entry is after sunrise
  # nil (falsey) if not found
  def after_sunrise
    notated_time = what.match('\d+:?\d* ?[AP]M')
    return nil if notated_time.nil?

    (Time.parse([date.to_s, notated_time].join(' ')) > 
      Time.parse([date.to_s, "6:30 AM"].join(' ')))
  end

Using select for filtering will cause the result from where clause to be loaded all at once:

irb(main):178:0> entries = Entry.where(private: false).select { |entry| entry.after_sunrise } #.each { do something with each record here }
D, [2021-08-07T17:53:52.492261 #4] DEBUG -- :   Entry Load (10.4ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1  [["private", false]]
=>
[#<Entry:0x000055ddafc1b230

find_each for the batch

If you don’t have very many records, this isn’t a big deal, but if you have models with a lot of data per model instance and you have millions of rows, you risk running out of memory trying to load it all at once. Anytime you take the result of a query or association and try to switch to operating on it like an array of ruby objects, you force the lazy loading of ActiveRecord::Relation and ActiveRecord::Association::CollectionProxy to load those records.

If you switch to find_each you can load those records in batches:

irb(main):179:1* entries = Entry.where(private: false).find_each do |entry|
irb(main):180:1*   next unless entry.after_sunrise
irb(main):181:1*   # do something with each record here
irb(main):182:0> end
D, [2021-08-07T17:57:07.895360 #4] DEBUG -- :   Entry Load (5.1ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2  [["private", false], ["LIMIT", 1000]]
=> nil

I only have 214 records in this example database, so the default batch size of 1000 makes this look almost identical to the original. However… you can use a keyword argument of batch_size: to tune the size of the batches pulled, in case your records are small and so more than 1000 records can be loaded at a time or they’re large and 1000 records is too much, or you have contrived example and want to show it actually batching:

irb(main):184:1* entries = Entry.where(private: false).find_each(batch_size: 2) do |entry|
irb(main):185:1*   next unless entry.after_sunrise
irb(main):186:1*   # do something with each record here
irb(main):187:0> end
D, [2021-08-07T17:59:22.339190 #4] DEBUG -- :   Entry Load (1.9ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2  [["private", false], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.344225 #4] DEBUG -- :   Entry Load (1.8ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3  [["private", false], ["id", 8], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.347152 #4] DEBUG -- :   Entry Load (1.9ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3  [["private", false], ["id", 10], ["LIMIT", 2]]

Conclusion

The difference between chaining .each or .select off of a .where clause vs. chaining .find_each is something you won’t necessarily see the benefits of if you’re building something from the ground up and don’t have much data flowing through your application. You may even even have a successful launch until you grow an order of magnitude or so. That’s part of the challenge of recognizing the need for it.


Rails / ActiveRecord count, size, and length

When trying to be sensitive to n+1 queries and memory usage, knowing the differences between count, size, and length in ActiveRecord is important. It had been a while since I reviewed the usage, and I wanted to ensure that I hadn’t made some bad assumptions along the way that somehow stuck. The reality is that each method is pretty close to indicating what it will do, with size being the method that will load the data on (or for) you.

count

Back in the old days count was a more sophisticated member of ActiveRecord::Calculations::ClassMethods module. You could pass conditions to the method, or column names… basically a combination where and includes and joins.

The column/distinct counting moved to ActiveRecord::Calculations without all the extra conditionals, joins, and including. Note that you do not need a query to “count”:

irb(main):011:0> Model.count(:special_data) # count Model records with non-nil special_data
   (191.9ms)  SELECT COUNT(`models`.`special_data`) FROM `models`
=> 41828
irb(main):012:0> Model.distinct.count(:special_data) # count Model records with DISTINCT non-nil special_data
   (17.6ms)  SELECT COUNT(DISTINCT `models`.`special_data`) FROM `models`
=> 1909
irb(main):013:0> Model.count # count all records
   (3790.8ms)  SELECT COUNT(*) FROM `models`
=> 594383

If you’re just looking for a count of records for a query that has not been loaded, that’s now a member of ActiveRecord::Associations::CollectionProxy.

irb(main):015:0> Model.all.count
   (744.2ms)  SELECT COUNT(*) FROM `models`
=> 594383
irb(main):017:0> Model.where('special_data is not null').count
   (24.0ms)  SELECT COUNT(*) FROM `models` WHERE (special_data is not null)
=> 41828

length

length will load all of the records indicated by a collection, which might be useful if calling length on an association that you’re going to use the data from anyway, but not if you are throwing that data away. You’ll be wasting time (and memory) on the operation.

irb(main):018:0> Model.where('special_data is not null').length
  Model Load (647.9ms)  SELECT ...
.
.
.
=> 41828

You also can’t call length on a model’s class name, as it is not a collection itself:

irb(main):020:0> Model.length
Traceback (most recent call last):
        1: from (irb):20
NoMethodError (undefined method `length' for #<Class:0x00007f810ed6ec28>)

size

size also requires a collection, but does not attempt to load that collection, instead wrapping a COUNT around its query:

irb(main):022:0> Model.where('special_data is not null').count
   (22.8ms)  SELECT COUNT(*) FROM `models` WHERE (special_data is not null)
=> 41828

Like with length, this doesn’t work:

irb(main):023:0> Model.size
Traceback (most recent call last):
        1: from (irb):20
NoMethodError (undefined method `size' for #<Class:0x00007f810ed6ec28>)

Conclusion

The behavior of these methods isn’t all that surprising, but sometimes we can let our guard down in Ruby and think of methods as synonyms when they actually have distinct behaviors. This is especially risky if you are working in more than one language or framework and might otherwise gravitate toward a method such as length because it’s second nature elsewhere.


Getting a count of the rows in all of your ActiveRecord models

Warning: Going through ActiveRecord is likely not something you want to do in production unless you have a replica database that your script or your model can point at. See the second part for MySQL and PostgreSQL examples in SQL.

How do you enumerate your models?

For apps created under Rails 5 app or later, all of your ActiveRecord models should be descendants ApplicationRecord as well. If this is an option, you’ll be able to filter out some noise from descendants such as SchemaMigration, etc…

Using the descendants of ApplicationRecord you can map the statistics… You’ll need to catch errors on tables that might only be in production-like environments if you’re running locally:

counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end


=> []

Oops. I’m in development so… no eager loading of the models.


Rails.application.eager_load! # if in development

counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end

# lots of SELECT COUNT(*) logs if you're on :debug level logging
# followed by an array of name/table name/counts

You might also want to compact that result due to nils:


Rails.application.eager_load!
counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end

From there, you can export to CSV:

Rails.application.eager_load!

counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end

output_filename = "/tmp/counts.csv"

CSV.open(output_filename, "wt") do |csv|
  csv << ['class name', 'table name', 'row count']
  counts.compact.each do |row|
    csv << row
  end
end

What about from the database itself?

The below options are *significantly* faster than going through the Rails stack, so unless you need additional information from Rails, you should refine the below options instead (especially for production)

  • MySQL:
select table_name, sum(table_rows) from information_schema.tables where table_schema = 'app_dbname_development' group by table_name;
select table_schema, 
       table_name, 
       (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count
from (
  select table_name, table_schema, 
         query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count
  from information_schema.tables
  where table_schema = 'public' --<< change here for the schema you want
) t order by 3 DESC


Rails 4, phantom ArgumentError: wrong number of arguments (0 for 1) on UserSession.find for AuthLogic

This was a total pain to locate, as the exceptions being raised were pointed to UserSession.find in Authlogic::Session::Persistence.

gems in question:

  activerecord-session_store 0.1.1
  authlogic 3.4.6
  activerecord 4.2.3
  rails 4.2.3

I noticed that the only meaningful difference between the two environments was the following instance variable in Rails.application.config:

irb> pp Rails.application.config
.
.
.
 @logger=
  #,
   @level=1>
.
.
.

Commenting out this initialization in the environment allowed the application to work again.

   config.logger = ActiveSupport::TaggedLogging.new(Syslog::Logger.new('name of log'))

Searching on “usersession syslog rails 4” results in the following issue as the third search result: Version 0.1.1 breaks Syslog::Logger setups There is a fix in master, but it doesn’t seem to have been published. I had promoted activerecord-session_store to 0.1.1 because DEPRECATION WARNING: `#quietly` is deprecated in rails-4.2.0.beta4, but it looks like locking the gem at 0.1 will work otherwise except for noisy tests.


has_many :through, scoped to a value on the other side of the :through relationship

Here goes another round of wrapping my head around what rails associations are looking for.

This time, I’m trying to scope a collection on the other side of a has_many :through to an attribute on that model.

Contrived exhibit A:

class Gym < ActiveRecord::Base
  # gym has a boolean "open"
end

class TrainerGym < ActiveRecord::Base
  has_one :trainer
  has_one :gym
end

class Trainer < ActiveRecord::Base 
  has_many :trainer_gyms
  has_many :gyms, through: :trainer_gyms
end

I want to create an association of `open_gyms` that my `Trainer` is connected to.

The end result looks like this for the `Trainer` model:

class Trainer < ActiveRecord::Base 
  has_many :trainer_gyms
  has_many :gyms, through: :trainer_gyms

  has_many :open_trainer_gyms ->{ includes(:gym).where(gyms: {open: true}) }, class_name: TrainerGym
  has_many :open_gyms, through: :open_trainer_gyms, source: :gym
end

The scope:

->{ includes(:gym).where(gyms:{open: true}) }

`includes` the association `gym` from `TrainerGym`. The `where` clause specifies the table `gyms` and the nested hash is the condition on `gyms` to match (`{open: true}`).

The `class_name: TrainerGym` tells the association `open_trainer_gyms` that it is a collection of the model `TrainerGym`, since that cannot be determined by reflection/inflection magic.

The `open_gyms` association piggy-backs the `open_trainer_gyms` association, but the `:through` relationship needs to be told that the `gym` association on `TrainerGym` is how it gets to the other side, since `open_gyms` can't be converted to `gym` automatically by the reflection and inflection magic.


Conditions on a has_many :through and ActiveRecord 4.1 enum

Say you have a set of models (model names sanitized for generic posting) where the join table in the middle of a has_many :through has an enum that you want to add a filtered association for.

class Student < ActiveRecord::Base
    has_many :student_grades
    has_many :grades, through: :student_grades
  end

  class StudentGrade < ActiveRecord::Base
    enum approval_status: [:not_reviewed, :rejected, :approved]
    belongs_to :student
    belongs_to :grade
  end

  class Grade
  end

I ended up creating two additional has_many associations on Student to accomplish this. I have to specify a condition on :approved_student_grades, and tell it what class name (StudentGrade) to point to because reflection through the association name won't work. I really dislike the verbosity of StudentGrade.approval_statuses[:approved] as a way to get the integer value of the enum (in this case, 2). For the :approved_grades association, I also have to repeat a similar process, except I have to specify both the class_name: and the source: due to reflection not working for :approved_grades.

class Student < ActiveRecord::Base
  has_many :student_grades
  has_many :grades, through: :student_grades
  has_many :approved_student_grades, -> {where approval_status: StudentGrade.approval_statuses[:approved] }, class_name: "StudentGrade"
  has_many :approved_grades, through: :approved_student_grades, class_name: "Grade", source: :grade
end

I'd be surprised if there wasn't a cleaner way to do this, but enum is new in Rails 4.1, so that may be part of my problem.


has_many :through, self referential models

Hat tip to [has_many :through self-referential example] that sent me down the correct path on this. The challenge I had went a little deeper than the aforementioned post, however. I’m going to reuse the friend example, but with a twist.

class Person < ActiveRecord::Base
  # id :integer
end
class Friendship < ActiveRecord::Base
  # source_friend_id :integer
  # target_friend_id :integer
end

In the above contrived example, Person is a friend who initiated the friendship (source_friend), and a friend that was sought for friendship (target_friend). For the friendship, reflection won’t work on either connection, so we need to specify the class_name:.

class Friendship < ActiveRecord::Base
  belongs_to :source_friend, class_name: "Person"
  belongs_to :target_friend, class_name: "Person"
end

Including the friendship would normally be easy. That’s has_many :friendships. However, there’s no explicit mapping back to Person for the Friendship. ActiveRecord will attempt to use friendships.person_id to no avail, so foreign_key must be specified to tell ActiveRecord how to map back the has_many relationship.

  class Person < ActiveRecord::Base
    has_many :friendships, foreign_key: :source_friend_id
  end

Now, the friends you’ve “made” can be mapped by putting the target_friend as the source for the has_many :through.

  class Person < ActiveRecord::Base
    has_many :friendships, foreign_key: :source_friend_id
    has_many :made_friends, through: :friendships, source: :target_friend
  end

Principle of Least Surprise (Astonishment), foreign keys, and Rails

In yesterday’s post, I sorted through the foreigner gem to figure out how to change the reference column (primary key) that a foreign_key maps to.

The problem here is that, unless your Rails project(s) has grown up referencing “natural” primary keys instead of the autoincremented id implicit in an ActiveRecord::Migration, avoiding creating surprise by not referring to an artificial primary key actually adds surprise: The instinct of someone reading your code will be to assume that the foreign key maps to the auto-id of the foreign model.


includes and has_many :through associations

I’ve been trying to mentally absorb how includes and associations work when trying to query a has_many :through association.

My understanding at this point:

The symbol in the includes() is the same as the association name on whatever model you have. For example:

class Account < ActiveRecord::Base
  has_many :account_destinations
  has_many :destinations, through: :account_destinations
end

class AccountDestination < ActiveRecord::Base
  has_one :destination
  # has an active_destination attribute
end

class Destination < ActiveRecord::Base
  # has several attributes that we might want to use to filter account_destinations on
end

The association used in includes for account_destinations.includes is :destination, because AccountDestination has_one :destination. Meanwhile, in the where clause, the hash key for specifying what values to match in Destination refers to the table name, or destinations.

class Account
.
.
  def some_method
    account_destinations.includes(:destination)
      .where
      .not(destinations: { name: %w(some list of names) } )
  end
end