Aug 13, 2015 Tags: programming, ruby
As a way to better familiarize myself with Ruby’s best practices for gem creation, I recently published a fairly trivial gem: ruby-upworthy (many thanks to Mike Lacher for letting me translate his Upworthy Generator to do it.)
Normally I wouldn’t make a blog post on something as everyday as learning a new tool, but creating ruby-upworthy made me aware of a neat little feature of Ruby as well as its shortcomings and a potential solution to them.
Like many interpreted languages, Ruby has string interpolation:
1
2
3
4
>> name = "William"
# => "William"
>> "My name is #{name}!"
# => "My name is William!"
This should be obvious, but interpolation doesn’t play well with undefined variables:
1
2
>> "My favorite food is #{food}."
# => NameError: undefined local variable `food' for main:Object
To make up for this, like other languages, Ruby has a form of string templating
aided by the %
operator (which is really Kernel::sprintf
). Aside from
normal C-like sprintf
formatting, Ruby’s sprintf
can also format a hash
into a string:
1
2
3
4
>> fmt = { name: "William" }
# => Hash
>> "My name is %{name}!" % fmt
# => "My name is William!"
This is really handy and conceptually fits snugly with ruby-upworthy, which
need to convert template strings like
"What happens when one %{actoradjective} %{actor} %{action}"
into real titles.
Naturally, however, there’s a snag. Since ruby-upworthy wouldn’t be much fun if
each template was only filled in with one value (i.e. if there was only one
actor
, action
, etc), it’s necessary to format the string with random
values for each key. This poses a problem, as the content hash is formatted as
an array of possible values for each key:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CONTENT = {
# ...
actor: [
"new mom",
"seventh-grader",
"physicist",
"bullied teen",
"science teacher",
"eight year-old",
"man",
# ...
],
# ...
}
Therefore, in order to produce reasonable results, the CONTENT
hash needs to
be manipulated or accessed in some way such that a key lookup returns a random
element from the value array.
My first thought on that front was to use Ruby’s dynamic hash functionality,
which is pretty awesome. Dynamic hashes are Hash
objects just like normal Ruby
hashes, except that they contain no actual data. Instead, they are initialized
with a Proc
that is lazily evaluated upon lookup, not unlike Haskell’s
approach to data structure population. This means we can do cool things like
this:
1
2
3
4
5
6
>> squares = Hash.new { |_, k| k * k }
# => Hash
>> squares[4]
# => 16
>> squares[10]
# => 100
Naturally, I thought I could apply this kind of functionality to my own problem.
Since CONTENT
needs to be “flattened” from a Symbol
-> Array
association
to a Symbol
-> String
(from Array
) association, the dynamic hash I created
looked like this:
1
2
3
4
5
6
CONTENT_ARRS = {
# the original data arrays...
}
CONTENT = Hash.new { |_, k| CONTENT_ARRS[k].sample }
Boom! It works:
1
2
3
4
5
6
>> CONTENT[:actor]
# => "science teacher"
>> CONTENT[:actor]
# => "man"
>> CONTENT[:actor]
# => "physicist"
…or so I thought.
This approach was looking really great, until I actually tried to use it to format one of the template strings:
1
2
>> "That moment when a %{actor} %{action}" % CONTENT
# => KeyError: key{actor} not found
Uh oh.
After tearing my hair out for a few minutes, I decided to settle for a
significantly uglier solution* and get some
opinions from the fine folks in Freenode’s #ruby. They tracked
Kernel::sprintf
’s hash formatting behavior down fairly quickly:
sprintf.c:
1
2
3
4
if (sym != Qnil) nextvalue = rb_hash_lookup2(hash, sym, Qundef);
if (nextvalue == Qundef) {
rb_enc_raise(enc, rb_eKeyError, "key%.*s not found", len, start);
}
and hash.c:
1
2
3
4
5
6
7
8
9
10
VALUE
rb_hash_lookup2(VALUE hash, VALUE key, VALUE def)
{
st_data_t val;
if (!RHASH(hash)->ntbl || !st_lookup(RHASH(hash)->ntbl, key, &val)) {
return def; /* without Hash#default */
}
return (VALUE)val;
}
The choice of rb_hash_lookup2
in Kernel::sprintf
is the cause of this
behavior, as only rb_hash_aref
retrieves the default value (or Proc
)
associated with the hash:
1
2
3
4
5
6
7
8
9
10
VALUE
rb_hash_aref(VALUE hash, VALUE key)
{
st_data_t val;
if (!RHASH(hash)->ntbl || !st_lookup(RHASH(hash)->ntbl, key, &val)) {
return hash_default_value(hash, key);
}
return (VALUE)val;
}
Therefore, simply changing the hash lookup logic in sprintf.c to use
rb_hash_aref
first and falling back on rb_hash_lookup2
should allow
dynamic hashes to be used to format strings. Things aren’t actually that
simple (due in no small part to sprintf.c’s absurd complexity and heavy-handed
use of state-modifying macros), but it’s a start.
For the time being, ruby-upworthy will continue to use a workaround solution.
I’ve sent an email
out to the ruby-core mailing list in the hopes of attracting some attention to
this problem, but I’m not too optimistic about these changes in behavior being
made. Based on the explicit usage of rb_hash_lookup2
in sprintf.c, it’s
very possible that Matz and the rest of the MRI maintainers are aware of this
use case and have explicitly prevented it because of implementation complexities
or the potential overhead involved in doing a second lookup.
While very powerful in a case like mine the use of dynamic hashes for string
formatting also muddies the concept of the missing key, as a hash with a Proc
or default value is by definition “complete” for every possible key. As such,
it would be pretty easy to shoot yourself in the foot with a dynamic hash while
formatting and end up with a string full of garbage.
In summary, Ruby’s dynamic hashes are really cool and have a lot of potential
power, but are limited in application (possibly on purposem to boot). This
leaves them in a strange state of limbo where some things work flawlessly with
them, while others don’t work at all (like Kernel::sprintf
). More uniform
support for their features is something that the MRI maintainers should
definitely consider for future releases, even if it means clouding the
distinction between key presence and key availability via a default value or
Proc
.
- William
Afternotes:
* My workaround was to map
the CONTENT
hash as (key, value) pairs, creating
array groups of [key, value.sample]
that are then fed into Hash[]
to
construct the flattened associations. Compared to a dynamic hash, this wastes an
extraordinary amount of memory (creating array pairs and an entire new hash) and
is much more difficult to read.