Sonic search: notes on its use

I have been working on a WordPress search plugin to use Valerian Saliou‘s Sonic search backend. It is a lightweight alternative to backends like Elastic Search. Its straightforward approach to normalizing natural language queries makes it useful for searching WordPress content. And its speed and feature set lend themselves gracefully to autocompletion.

Using Sonic in place of WordPress’s native search capability improves both performance and user experience. (WordPress relies on SQL’s notoriously slow LIKE pattern-matching operator to perform searches.)

Sonic is a Rust package that compiles to a single executable program. Internally it uses the RocksDB key-value store. It connects to client programs via TCP. Php code from inside WordPress accesses it via the psonic library.

Because its connection is pure TCP without TLS, Sonic is probably insecure except on a trusted local network. It can be configured to listen only to the loopback adapter ([::1] or 127.0.0.1).

Sonic stores its content in the file system of its host machine. It works best on solid state drives.

Being a simple lightweight program, Sonic does not scale out to multiple host machines. At the same time, because it is simple and lightweight it can handle a large load on just one machine.

As of late October 2022, it can be installed using a Docker container or by building the Rust program. Ideally it would have .deb packages and other binary installers.

Storage

It stores information in a three-level hierarchy: Collection, Bucket, and Object. Sonic uses Unicode throughout.

Collections

In WordPress use, a Collection corresponds to an independent site or blog. In a WordPress multisite installation, for example, each blog (subsite) would have its own Collection. If a single Sonic server supported multiple tenant WordPress installations, each would have its own Collection.

For a Collection name my plugin uses a hash of the site URL concatenated with the subsite’s database prefix. For example, it uses a hash of https://plumislandmedia.net/wp_ for this site.

Buckets

In many cases a Collection has just one Bucket. However, an application requiring separate searching for different kinds of content (for example, posts, products, and users) might have a Bucket for each. As of late October 2022, my WordPress plugin has one bucket per collection.

Suggest and Search requests search over all Objects in a single Bucket.

Objects

Within each Bucket are multiple Objects. Each Object has an assigned ObjectID, and contains some text to be searched. In the WordPress application, I use these sorts of ObjectIDs.

  • ‘12345:title’ contains the post_title of post ID 12345.
  • ‘12345:summary’ contains the post_summary of that post.
  • ‘12345:content’ contains its post_content.
  • ‘12345:_yoast_wpseo_metadesc’ contains the description metadata gathered by the Yoast SEO plugin. In general, any relevant post metadata item can have its own ObjectID. The part of the ObjectID after the colon contains the meta_key value for the relevant metadata. (My plugin handles descriptions from several different SEO plugins.)

The plugin uses the second part of the ObjectID to determine the weights of search results. For example, searches matching a post’s title have higher weight than those matching the content.

Each object holds one or more text strings, each containing a sequence of words with HTML and other markup removed. Sonic handles those words case-insensitively. As of late October 2022 it does not do any diacritical mark normalization. For example, it treats “Français” and “Francais” as different words.

Locales (languages)

Information in Sonic can be processed according to a specified natural language. It contains support for multiple languages. A client program specifies the language using the appropriate three-letter language code specified by ISO 639-3. For example, “fra” specifies French, and “eng” English. When Sonic has a language code it applies language-specific stemming and stopword processing.

Requests

Sonic has three main types of request: Suggest, Search, and Ingest. It also has some control requests.

Suggest

Each Suggest request searches a single Bucket in a single Collection. It accepts a search term containing single word of text. (It throws an exception if given more than one word.) It returns a list of similar words based on the contents of the bucket. These words, when shown to a user, can help correct typographic errors. Sonic presents the list of words in lexical order.

If the search term starts with the same letters as words in the corpus of text being searched, it suggests possibly relevent words. But given a partial word with the first letters missing it does seem to present the complete word as one of its suggestions. For example, given “anana” one of its suggestions is “banana”.

Each Search request searches a single Bucket in a single Collection. It accepts a search term containing multiple words of text in a single string. It returns a list of matching ObjectIDs. Client software should treat the order of the matching ObjectIDs as unpredictable.

Notice that it does not return the matched text, only the ObjectIDs. Sonic’s client software must use those ObjectIDs to retrieve the matched text. In my WordPress plugin an ObjectID might be ‘12345:content’. That means I can retrieve the matched text with

SELECT post_content FROM wp_posts WHERE ID = 12345;

The Search request is word-oriented, not phrase-oriented. It searches independently for each word in its search term.

For example, searching the previous paragraph for “search phrase” will find the paragraph, even though that particular sequence of words does not appear. Given multiple words in a search term, Sonic returns the ObjectIDs containing all the words.

Ingest

The Ingest request accepts a string of words associated with a Collection, Bucket, and ObjectID. For example, a client program can tell it to Ingest with a (pseudocode) request like this:

Ingest mysite mybucket 12345:title "Most cherries are smaller than bananas"

Multiple strings of words can be placed into each ObjectID.

After a client program instructs Sonic to Ingest objects, it is necessary to give it a Consolidate request.

Flush

A client program can use a Flush request to remove the contents of an Object, a Bucket, or an entire Collection.

Search strategy

My WordPress plugin uses the following strategy to find content.

When a user presents a list of one or more words (or partial words), the plugin

  • Uses Suggest to retrieve related words to each user-presented word.
  • Weights the related words by their Levenshtein distance from the user-presented word.
  • Uses Search on the user-presented words and the related words.
  • Weights the returned objects by Levenshtein distance, title, summary, or content, with titles and summaries weighted more highly.
  • Presents results in weighted order.

WordPress then shows the results to the user on its standard search results page.

Autocomplete strategy

Autocompletion performs searches and offers the titles of searched posts for selection.

Credit

Props to Valerian for reviewing this.

Leave a Comment