ENOSUCHBLOG

Programming, philosophy, pedaling.


String Formatting With Ruby Dynamic Hashes

Aug 13, 2015     Tags: programming, ruby    

This post is at least a year old.

As a way to better familiarize myself with Ruby’s best practices for gem creation, I recently published a fairly trivial gem: ruby-upworthy (many thanks to Mike Lacher for letting me translate his Upworthy Generator to do it.)

Normally I wouldn’t make a blog post on something as everyday as learning a new tool, but creating ruby-upworthy made me aware of a neat little feature of Ruby as well as its shortcomings and a potential solution to them.

Like many interpreted languages, Ruby has string interpolation:

1
2
3
4
>> name = "William"
# => "William"
>> "My name is #{name}!"
# => "My name is William!"

This should be obvious, but interpolation doesn’t play well with undefined variables:

1
2
>> "My favorite food is #{food}."
# => NameError: undefined local variable `food' for main:Object

To make up for this, like other languages, Ruby has a form of string templating aided by the % operator (which is really Kernel::sprintf). Aside from normal C-like sprintf formatting, Ruby’s sprintf can also format a hash into a string:

1
2
3
4
>> fmt = { name: "William" }
# => Hash
>> "My name is %{name}!" % fmt
# => "My name is William!"

This is really handy and conceptually fits snugly with ruby-upworthy, which need to convert template strings like "What happens when one %{actoradjective} %{actor} %{action}" into real titles.

Naturally, however, there’s a snag. Since ruby-upworthy wouldn’t be much fun if each template was only filled in with one value (i.e. if there was only one actor, action, etc), it’s necessary to format the string with random values for each key. This poses a problem, as the content hash is formatted as an array of possible values for each key:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
CONTENT = {
	# ...

	actor: [
		"new mom",
		"seventh-grader",
		"physicist",
		"bullied teen",
		"science teacher",
		"eight year-old",
		"man",
		# ...
	],

	# ...
}

Therefore, in order to produce reasonable results, the CONTENT hash needs to be manipulated or accessed in some way such that a key lookup returns a random element from the value array.

My first thought on that front was to use Ruby’s dynamic hash functionality, which is pretty awesome. Dynamic hashes are Hash objects just like normal Ruby hashes, except that they contain no actual data. Instead, they are initialized with a Proc that is lazily evaluated upon lookup, not unlike Haskell’s approach to data structure population. This means we can do cool things like this:

1
2
3
4
5
6
>> squares = Hash.new { |_, k| k * k }
# => Hash
>> squares[4]
# => 16
>> squares[10]
# => 100

Naturally, I thought I could apply this kind of functionality to my own problem. Since CONTENT needs to be “flattened” from a Symbol -> Array association to a Symbol -> String (from Array) association, the dynamic hash I created looked like this:

1
2
3
4
5
6
CONTENT_ARRS = {
	# the original data arrays...
}

CONTENT = Hash.new { |_, k| CONTENT_ARRS[k].sample }

Boom! It works:

1
2
3
4
5
6
>> CONTENT[:actor]
# => "science teacher"
>> CONTENT[:actor]
# => "man"
>> CONTENT[:actor]
# => "physicist"

…or so I thought.

The “gotcha!”

This approach was looking really great, until I actually tried to use it to format one of the template strings:

1
2
>> "That moment when a %{actor} %{action}" % CONTENT
# => KeyError: key{actor} not found

Uh oh.

After tearing my hair out for a few minutes, I decided to settle for a significantly uglier solution* and get some opinions from the fine folks in Freenode’s #ruby. They tracked Kernel::sprintf’s hash formatting behavior down fairly quickly:

sprintf.c:

1
2
3
4
if (sym != Qnil) nextvalue = rb_hash_lookup2(hash, sym, Qundef);
if (nextvalue == Qundef) {
    rb_enc_raise(enc, rb_eKeyError, "key%.*s not found", len, start);
}

and hash.c:

1
2
3
4
5
6
7
8
9
10
VALUE
rb_hash_lookup2(VALUE hash, VALUE key, VALUE def)
{
    st_data_t val;

    if (!RHASH(hash)->ntbl || !st_lookup(RHASH(hash)->ntbl, key, &val)) {
		return def; /* without Hash#default */
    }
    return (VALUE)val;
}

The choice of rb_hash_lookup2 in Kernel::sprintf is the cause of this behavior, as only rb_hash_aref retrieves the default value (or Proc) associated with the hash:

1
2
3
4
5
6
7
8
9
10
VALUE
rb_hash_aref(VALUE hash, VALUE key)
{
    st_data_t val;

    if (!RHASH(hash)->ntbl || !st_lookup(RHASH(hash)->ntbl, key, &val)) {
		return hash_default_value(hash, key);
    }
    return (VALUE)val;
}

Therefore, simply changing the hash lookup logic in sprintf.c to use rb_hash_aref first and falling back on rb_hash_lookup2 should allow dynamic hashes to be used to format strings. Things aren’t actually that simple (due in no small part to sprintf.c’s absurd complexity and heavy-handed use of state-modifying macros), but it’s a start.

Wrapup

For the time being, ruby-upworthy will continue to use a workaround solution.

I’ve sent an email out to the ruby-core mailing list in the hopes of attracting some attention to this problem, but I’m not too optimistic about these changes in behavior being made. Based on the explicit usage of rb_hash_lookup2 in sprintf.c, it’s very possible that Matz and the rest of the MRI maintainers are aware of this use case and have explicitly prevented it because of implementation complexities or the potential overhead involved in doing a second lookup.

While very powerful in a case like mine the use of dynamic hashes for string formatting also muddies the concept of the missing key, as a hash with a Proc or default value is by definition “complete” for every possible key. As such, it would be pretty easy to shoot yourself in the foot with a dynamic hash while formatting and end up with a string full of garbage.

In summary, Ruby’s dynamic hashes are really cool and have a lot of potential power, but are limited in application (possibly on purposem to boot). This leaves them in a strange state of limbo where some things work flawlessly with them, while others don’t work at all (like Kernel::sprintf). More uniform support for their features is something that the MRI maintainers should definitely consider for future releases, even if it means clouding the distinction between key presence and key availability via a default value or Proc.

- William

Afternotes:

* My workaround was to map the CONTENT hash as (key, value) pairs, creating array groups of [key, value.sample] that are then fed into Hash[] to construct the flattened associations. Compared to a dynamic hash, this wastes an extraordinary amount of memory (creating array pairs and an entire new hash) and is much more difficult to read.