09/12 2013

A Nicer Syntax for Nicer Search

Riak’s Yokozuna (or “yz” for short) full-text search feature has been a long time coming. The first commit by Ryan Zezeski is from June of 2012, and we’ll get it released with Riak 2.0 this year. Yokozuna integrates Apache Solr full-text search with Riak.

The last few months of my two years (as of today!) at Basho have had me on the Clients team, where I’ve been building the Riak Ruby client, that allows Ruby applications to store and retrieve data from Riak. As part of this, we’re getting the clients ready for Riak 2.0, which includes Yokozuna support.

The Competition

Rails 3 included the Active Record Query Interface, which makes writing complicated queries and getting objects back easy:

party.users.where(given_name: given_name).count

Instead of requiring the programmer to know the syntax of both Ruby and SQL, this query interface does a pretty good job of making the SQL semantics make sense in Ruby.

Solr is Not SQL and Riak is Not SQL

The Apache Solr full-text search engine is a database with a query langage. It’s distinct from SQL: instead of describing and manipulating relations like SQL, it’s completely built around querying a collection of documents.

bryce:"hi" OR bryce:"hello"

Yokozuna is built on Solr. Saving a Riak object in a yz-indexed bucket triggers the indexing process, which parses the document, finds keywords, and puts them in an index. Depending on the document, it may have multiple fields to index. For example, a JSON document containing a hash has lots of fields, an XML document is full of fields, and a plain text document has one big field.

Querying for values in the index returns a list of matches, which include the Riak bucket and key where the document is stored: _yz_rb and _yz_rk for “Riak Bucket” and “Riak Key,” respectively. With that result set, the developer can then load the keys at their leisure.

Ruby is Not Solr

Putting Solr/Lucene query strings together is almost as awful as putting SQL together, and almost as dangerous too: Solr injection doesn’t have as many dangers as SQL injection, but they’re still there. Fortunately, we have a known-nice interface for building queries, and we can take what we need:

q = @bucket.query.
        where(name_t: '*e*').
        order(created_dt: 'asc').
        limit(5).
        offset(5)

This is implemented in the Riak Yokozuna query Ruby gem I put together on Sunday. It adds a query method to Riak::Bucket objects, which returns a Riak::YzQuery::QueryBuilder instance, on which the rest of the query expression is built.

A nice part about Solr is that it only includes matching index entries (SQL WHERE), and specifies sorting (ORDER), row selection, and other query features separately. This means that we can generally keep our concerns separate, and nothing’s really stopping us from building the “where” and “order” clauses as we go.

A QueryBuilder instance provides the four query methods seen above, and “chains” them so that they return a new QueryBuilder that represents the original one modified by the new query data:

q = @bucket.query.where name_t: 'Andrew'
q.keys #=> ["PtgA5YsxWpSg7RzTY2eJVJ81hDQ", "OL1quOfOKiYEmxYsqvjf9cyRmH3"]
q2 = q.where name_t: 'Stone'
q2.keys #=> ["OL1quOfOKiYEmxYsqvjf9cyRmH3"]

They also have the keys and values methods to perform the query and retrieve the response and the values for the keys in the response.

Limit and Offset

These are just integers, so the QueryBuilder simply chains off a new instance with these new values.

Order

A Solr order clause is very similar to the ORDER BY clause in SQL. To order by a document’s “name” field ascending, and “created_at” descending when two documents have the same name, the clause in both SQL and Solr is “name asc, created_at desc”. The gem allows you to build this several ways:

q = @bucket.query.where(name_t: 'Andrew').order(created_dt: 'asc').keys
#=> ["OL1quOfOKiYEmxYsqvjf9cyRmH3", "PtgA5YsxWpSg7RzTY2eJVJ81hDQ"]
q = @bucket.query.where(name_t: 'Andrew').order('created_dt desc').keys
#=> ["PtgA5YsxWpSg7RzTY2eJVJ81hDQ", "OL1quOfOKiYEmxYsqvjf9cyRmH3"]

Where

This is where the complexity is: I wanted to build something as flexible as Active Record’s where clauses, that would take hashes for equality matches, arrays for parameter escaping, and strings for more complicated querying. I also wanted to support ranges, times, numbers, in addition to strings. As a result, the WhereClause class is as complicated as the QueryBuilder.

When creating where clauses with hashes alone, handling it is fairly simple: just keep merging the hashes together, and at the end, join the pairs with “AND” and send that to Yokozuna.

For strings (and arrays, which I treat like strings after deparameterizing them) I didn’t want to deal with parsing them back out to hashes, so I just parenthesize them and AND them together. If I’m joining a string and a hash, the hash gets converted to a string, and everything follows from there.

Nicer

A big benefit of this approach is that it minimizes “string programming.” Instead of having huge monolithic Solr strings in your models, instead you have syntax-checked Ruby with no, or at least smaller, strings. The chaining means it’s easier to manipulate and paginate the queries prior to expensive fetch operations or showing the results to the user.

At the same time, it’s not strictly necessary, so it’ll remain separate from but dependent on the Riak Ruby client proper. It’s a nice way to build queries, but doesn’t really provide any of the plumbing necessary to run them itself.

Today and Tomorrow

The gem is published on RubyGems and the source is on GitHub, but it’s in no way production ready: the tests are hardcoded to my local Riak install, it depends on an unstable development version of Riak (2.0.0pre1), and the very literal bleeding edge of the Riak Ruby client. I’d like to get it 1.0 after Riak 2 comes out, but time will tell.