Getting a Count of Occurrences of Items in a Ruby Array (and a Caveat for Rails)

I feel like I’m often wanting to count occurrences of items in an array (Rails has its own special case as well), and I’m always trying to do it the “long way.”

I finally stumbled upon this answer on StackOverflow that details the version-by-version options:

  • Ruby 2.7+ use .tally directly on the array:
irb(main):006:0> %i{a b c c d c e b a a a b d}.tally<br>=> {:a=>4, :b=>3, :c=>3, :d=>2, :e=>1}
irb(main):011:0> %i{a b c c d c e b a a a b d}.group_by(&:itself).transform_values(&:count)
=> {:a=>4, :b=>3, :c=>3, :d=>2, :e=>1}
irb(main):012:0> %i{a b c c d c e b a a a b d}.group_by(&:itself).map { |k,v| [k, v.length] }.to_h<br>=> {:a=>4, :b=>3, :c=>3, :d=>2, :e=>1}

The Rails Exception

It’s a pretty common temptation, especially once you start thinking in terms of the list of items you want to count, to try to use a pure Ruby solution for things. But what if your source is from the your database?

The key here is the database. You probably don’t want to load all of the records from the database just to count them using the above methods, and SQL has a GROUP BY clause which is just called .group.

irb(main):013:0> Entry.group(:user_id).count
D, [2021-08-26T02:49:43.996743 #4] DEBUG -- :    (1.2ms)  SELECT COUNT(*) AS count_all, "entries"."user_id" AS entries_user_id FROM "entries" GROUP BY "entries"."user_id"
=> {1=>231, 4=>15, 2=>2}

This output is tallying entries by what User (via user_id) entered them. More importantly, the SQL used did the counts within the database without retrieving any data contained into the application except what was counted. (This used to be a pun on the :what column in the entries table, but apparently we’re not there with proper rendering and cutting and pasting of emojis between apps and OSes and well, I enter emoji as part of my entries in this app.

This original example in extreme wide screen glory

Use find_each vs. select or each when needing to process each record via Ruby

The problem:

This scenario is a much more practical case of a similar concept with Rails / ActiveRecord count, size, and length. In this case, you’re needing to run ruby code off of either every value from a where method result or a subset that is determined by ruby code. For example, assume that a model Entry has a method after_sunrise (probably want a real sunrise/sunset gem but this is a simplistic example) as follows:

  # determine if a time included in the string description of the entry is after sunrise
  # nil (falsey) if not found
  def after_sunrise
    notated_time = what.match('\d+:?\d* ?[AP]M')
    return nil if notated_time.nil?

    (Time.parse([date.to_s, notated_time].join(' ')) > 
      Time.parse([date.to_s, "6:30 AM"].join(' ')))
  end

Using select for filtering will cause the result from where clause to be loaded all at once:

irb(main):178:0> entries = Entry.where(private: false).select { |entry| entry.after_sunrise } #.each { do something with each record here }
D, [2021-08-07T17:53:52.492261 #4] DEBUG -- :   Entry Load (10.4ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1  [["private", false]]
=>
[#<Entry:0x000055ddafc1b230

find_each for the batch

If you don’t have very many records, this isn’t a big deal, but if you have models with a lot of data per model instance and you have millions of rows, you risk running out of memory trying to load it all at once. Anytime you take the result of a query or association and try to switch to operating on it like an array of ruby objects, you force the lazy loading of ActiveRecord::Relation and ActiveRecord::Association::CollectionProxy to load those records.

If you switch to find_each you can load those records in batches:

irb(main):179:1* entries = Entry.where(private: false).find_each do |entry|
irb(main):180:1*   next unless entry.after_sunrise
irb(main):181:1*   # do something with each record here
irb(main):182:0> end
D, [2021-08-07T17:57:07.895360 #4] DEBUG -- :   Entry Load (5.1ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2  [["private", false], ["LIMIT", 1000]]
=> nil

I only have 214 records in this example database, so the default batch size of 1000 makes this look almost identical to the original. However… you can use a keyword argument of batch_size: to tune the size of the batches pulled, in case your records are small and so more than 1000 records can be loaded at a time or they’re large and 1000 records is too much, or you have contrived example and want to show it actually batching:

irb(main):184:1* entries = Entry.where(private: false).find_each(batch_size: 2) do |entry|
irb(main):185:1*   next unless entry.after_sunrise
irb(main):186:1*   # do something with each record here
irb(main):187:0> end
D, [2021-08-07T17:59:22.339190 #4] DEBUG -- :   Entry Load (1.9ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 ORDER BY "entries"."id" ASC LIMIT $2  [["private", false], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.344225 #4] DEBUG -- :   Entry Load (1.8ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3  [["private", false], ["id", 8], ["LIMIT", 2]]
D, [2021-08-07T17:59:22.347152 #4] DEBUG -- :   Entry Load (1.9ms)  SELECT "entries".* FROM "entries" WHERE "entries"."private" = $1 AND "entries"."id" > $2 ORDER BY "entries"."id" ASC LIMIT $3  [["private", false], ["id", 10], ["LIMIT", 2]]

Conclusion

The difference between chaining .each or .select off of a .where clause vs. chaining .find_each is something you won’t necessarily see the benefits of if you’re building something from the ground up and don’t have much data flowing through your application. You may even even have a successful launch until you grow an order of magnitude or so. That’s part of the challenge of recognizing the need for it.


Rails / ActiveRecord count, size, and length

When trying to be sensitive to n+1 queries and memory usage, knowing the differences between count, size, and length in ActiveRecord is important. It had been a while since I reviewed the usage, and I wanted to ensure that I hadn’t made some bad assumptions along the way that somehow stuck. The reality is that each method is pretty close to indicating what it will do, with size being the method that will load the data on (or for) you.

count

Back in the old days count was a more sophisticated member of ActiveRecord::Calculations::ClassMethods module. You could pass conditions to the method, or column names… basically a combination where and includes and joins.

The column/distinct counting moved to ActiveRecord::Calculations without all the extra conditionals, joins, and including. Note that you do not need a query to “count”:

irb(main):011:0> Model.count(:special_data) # count Model records with non-nil special_data
   (191.9ms)  SELECT COUNT(`models`.`special_data`) FROM `models`
=> 41828
irb(main):012:0> Model.distinct.count(:special_data) # count Model records with DISTINCT non-nil special_data
   (17.6ms)  SELECT COUNT(DISTINCT `models`.`special_data`) FROM `models`
=> 1909
irb(main):013:0> Model.count # count all records
   (3790.8ms)  SELECT COUNT(*) FROM `models`
=> 594383

If you’re just looking for a count of records for a query that has not been loaded, that’s now a member of ActiveRecord::Associations::CollectionProxy.

irb(main):015:0> Model.all.count
   (744.2ms)  SELECT COUNT(*) FROM `models`
=> 594383
irb(main):017:0> Model.where('special_data is not null').count
   (24.0ms)  SELECT COUNT(*) FROM `models` WHERE (special_data is not null)
=> 41828

length

length will load all of the records indicated by a collection, which might be useful if calling length on an association that you’re going to use the data from anyway, but not if you are throwing that data away. You’ll be wasting time (and memory) on the operation.

irb(main):018:0> Model.where('special_data is not null').length
  Model Load (647.9ms)  SELECT ...
.
.
.
=> 41828

You also can’t call length on a model’s class name, as it is not a collection itself:

irb(main):020:0> Model.length
Traceback (most recent call last):
        1: from (irb):20
NoMethodError (undefined method `length' for #<Class:0x00007f810ed6ec28>)

size

size also requires a collection, but does not attempt to load that collection, instead wrapping a COUNT around its query:

irb(main):022:0> Model.where('special_data is not null').count
   (22.8ms)  SELECT COUNT(*) FROM `models` WHERE (special_data is not null)
=> 41828

Like with length, this doesn’t work:

irb(main):023:0> Model.size
Traceback (most recent call last):
        1: from (irb):20
NoMethodError (undefined method `size' for #<Class:0x00007f810ed6ec28>)

Conclusion

The behavior of these methods isn’t all that surprising, but sometimes we can let our guard down in Ruby and think of methods as synonyms when they actually have distinct behaviors. This is especially risky if you are working in more than one language or framework and might otherwise gravitate toward a method such as length because it’s second nature elsewhere.


Getting a count of the rows in all of your ActiveRecord models

Warning: Going through ActiveRecord is likely not something you want to do in production unless you have a replica database that your script or your model can point at. See the second part for MySQL and PostgreSQL examples in SQL.

How do you enumerate your models?

For apps created under Rails 5 app or later, all of your ActiveRecord models should be descendants ApplicationRecord as well. If this is an option, you’ll be able to filter out some noise from descendants such as SchemaMigration, etc…

Using the descendants of ApplicationRecord you can map the statistics… You’ll need to catch errors on tables that might only be in production-like environments if you’re running locally:

counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end


=> []

Oops. I’m in development so… no eager loading of the models.


Rails.application.eager_load! # if in development

counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end

# lots of SELECT COUNT(*) logs if you're on :debug level logging
# followed by an array of name/table name/counts

You might also want to compact that result due to nils:


Rails.application.eager_load!
counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end

From there, you can export to CSV:

Rails.application.eager_load!

counts = ApplicationRecord.descendants.map do |klass|
  [klass.name, klass.table_name, klass.count]
rescue Mysql2::Error, ActiveRecord::StatementInvalid # if certain tables aren't present in certain environments
  nil
end

output_filename = "/tmp/counts.csv"

CSV.open(output_filename, "wt") do |csv|
  csv << ['class name', 'table name', 'row count']
  counts.compact.each do |row|
    csv << row
  end
end

What about from the database itself?

The below options are *significantly* faster than going through the Rails stack, so unless you need additional information from Rails, you should refine the below options instead (especially for production)

  • MySQL:
select table_name, sum(table_rows) from information_schema.tables where table_schema = 'app_dbname_development' group by table_name;
select table_schema, 
       table_name, 
       (xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count
from (
  select table_name, table_schema, 
         query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count
  from information_schema.tables
  where table_schema = 'public' --<< change here for the schema you want
) t order by 3 DESC


Rails’ and Ruby’s Hash transform_values

Rails: 4.2.1-5.2.3 and Ruby >= 2.5.5 have a transform_values method on Hash that allows you to pass a block to the method and transform the values of the key-value map in the hash based on the block contents. Essentially, it’s map but for the Hash values only, and with no weird Hash/array-element syntax. From the linked API documents above:

h = { a: 1, b: 2, c: 3 }
h.transform_values {|v| v * v + 1 }  #=> { a: 2, b: 5, c: 10 }
h.transform_values(&:to_s)           #=> { a: "1", b: "2", c: "3" }
h.transform_values.with_index {|v, i| "#{v}.#{i}" }
                                     #=> { a: "1.0", b: "2.1", c: "3.2" }


XML Handler Bypassing ParseError Handling Rack Middleware on Rails 6

In a project using actionpack-xml_parser, and implementing a modified version of Catching Invalid JSON Parse Errors with Rack Middleware to also handle XML errors, an upgrade to Rails 6 broke the handling of the ParseError. Interestingly enough, JSON was being handled as expected.

TL;DR: The short answer to this is that config/initializers/wrap_parameters.rb only had json mentioned as a format. This can be remedied in the initializer or on a per controller basis. The entire app fix was just adding :xml to the initializer:

ActiveSupport.on_load(:action_controller) do
  wrap_parameters format: [:json, :xml]
end

In looking through the stack traces/call stack for where the parse error occurs (ActionDispatch::Http::Parameters#parse_formatted_parameters) using the caller array in the gem source and backtrace in the rescue, I was able to catch that ActionController::Metal::ParamsWrapper#_wrapper_enabled? was called for JSON, but not for XML. Ultimately, I was able to track down the difference to ActionController::Metal::ParamsWrapper#_wrapper_formats?, which was only returning [:json], which led to the wrap_parameters functionality in Rails.


rspec-rails tricks: Hacking Your Routes for a Spec

I don’t know that this is so much a reminder of “What can be done with rspec-rails” as a note that, “If you do this will rspec-rails, you will also need to undo it.”

Today’s note: If you hack the application routes for a controller test, you have to reload routes.

The original problem

This all came down to a spec that was attempting to test a method in ApplicationController. In our Rails 3 setup (rspec-core and rspec-rails 3.7.0), the tests only needed a monkey patched ApplicationController to happen prior to the test:

before do
  class ApplicationController
    def dummy_action
      dummy_method_actually_under_test
    end
  end
end

it 'example do test' do
  get :dummy_action #etc...
  # validate dummy_method_actually_under_test did the thing
end

The fix

In Rails 4 with the same gem versions, we’d end up with the “No route matches…” which could be remedied by redefining the routes.

before do
  class ApplicationController
    def dummy_action
      dummy_method_actually_under_test
    end
  end
  Rails.application.routes.draw do
    get 'application/dummy_action/:id', to: 'application#dummy_action'
  end
end

Great! The test passes now! (Insert philosophical argument about whether the person writing the test should have tested a method in this way.)

Or not…

..until you run the rest of the suite, of course. Now *every* subsequent controller test has a “No route matches” issue. (Insert philosophical argument about whether you should be writing controller tests.)

This should serve as a periodic reminder to clean up after your tests, which in this case is:

after do
  Rails.application.reload_routes!
end

Multiple Postgres Schemas and Rails db:test:prepare

In our Rails databases, we use multiple Postgres schemas. This is partly for partitioning archive data from fresh data. On a new Rails 5 project, I started having tests fail once the models had the “archivable” concern added, which depended on an archive schema existing.

Ultimately, the problem is that the archive schema wasn’t being created in the test database, because rake db:test:prepare performs a db:test:load which was loading the db/schema.rb, which is completely ignorant of the Postgres-specific schema concept.

In order to fix this without attempting ugly hacks, the simplest solution was just to switch to using structure.sql for the Rails schema dump.

In config/application.rb, insert the following in your Application class definition:


config.active_record.schema_format = :sql

If you haven’t already added the additional schema(s) to your schema_search_path in your config/database.yml.


database: whatever_db
schema_search_path: 'main,main_archive,public'

You’ll need to run rake db:migrate to force structure.sql generation before running tests again. Also, be sure to switch from schema.rb to structure.sql if you’re committing this file.


RailsSettings vs parallel_tests

RailsSetting stubbing problem

I have an area of functionality that is controlled by a Settings class that is a subclass of RailsSettings::CachedSettings (rails-settings-cached gem) and changing that setting for tests was interfering with the parallel_tests gem when I ran my rspec tests.

I tried namespacing the cache, making TTL 0, invalidating the cache, monkeypatching the lookup… Every convoluted solution possible… to try and get tests to play nicely together.

The humbling thing about programming is how simple actual solutions are compared the things the brain comes up with.

The solution? Wrap the access to the Setting in a layer of abstraction (which I had partially done already.)

class ThatFunctionality
  def self.enabled?
    Setting['that_functionality.enabled']
  end
  def self.enabled=(value)
    Setting['that_functionality.enabled']=value
  end

And then in tests:

before do
  allow(ThatFunctionality).to receive(:enabled?).and_return(true) # or false
end

Trying again to stub RailsSettings

The class method approach is explicit, but once a large set of settings is needed, it can get unwieldy. If you haven’t found a better strategy for wrapping settings values, you can go back to stubbing the :[] method. The caveat is that you need to save the original Settings behavior for any settings that you want to leave unchanged. By stubbing the :[] first to passthrough to .and_call_original, you can add additional stubs on top.

# a module included
module StubSettings
def stub_setting(setting_name, setting_value)
allow(Setting).to receive(:[]).with(setting_name).and_return(setting_value)
end
def stub_setting_assignment
allow(Setting).to receive(:[]=) { |setting, value| stub_setting(setting, setting) }
end
end
# in spec_helper.rb
config.before(:each) do
# detect if Settings being set without you knowing about it
allow(Setting).to receive(:[]=)
.and_raise("stub_setting(setting_name, setting_value) to set the value of a Setting instead")
# passthrough read-only settings calls (necessary if you don't want .with to block other settings
allow(Setting).to receive(:[]).and_call_original
end
# in rspec example
describe 'settings will be set in code invoked from here' do
# automatically stubs assigned value to new stub
before { stub_setting_assignment }
end
# stubbing a setting value
describe 'setting to be forced to a value' do
before { stub_setting('some_toggle_setting', true) }
end

view raw
stub_settings.rb
hosted with ❤ by GitHub


Rails 4, phantom ArgumentError: wrong number of arguments (0 for 1) on UserSession.find for AuthLogic

This was a total pain to locate, as the exceptions being raised were pointed to UserSession.find in Authlogic::Session::Persistence.

gems in question:

  activerecord-session_store 0.1.1
  authlogic 3.4.6
  activerecord 4.2.3
  rails 4.2.3

I noticed that the only meaningful difference between the two environments was the following instance variable in Rails.application.config:

irb> pp Rails.application.config
.
.
.
 @logger=
  #,
   @level=1>
.
.
.

Commenting out this initialization in the environment allowed the application to work again.

   config.logger = ActiveSupport::TaggedLogging.new(Syslog::Logger.new('name of log'))

Searching on “usersession syslog rails 4” results in the following issue as the third search result: Version 0.1.1 breaks Syslog::Logger setups There is a fix in master, but it doesn’t seem to have been published. I had promoted activerecord-session_store to 0.1.1 because DEPRECATION WARNING: `#quietly` is deprecated in rails-4.2.0.beta4, but it looks like locking the gem at 0.1 will work otherwise except for noisy tests.