Meandering Musing About Views
Sun Jul 20 23:19:24 EDT 2014
As happens periodically, I've been thinking about Domino views lately. When I get into one of these moods, I find it helps to take a step back to look at what an NSF is.
An NSF is, in its heart of hearts, a key/value store. Each entry has several keys of which the useful ones are Note ID and the UNID, which are 32-bit and 128-bit integers, respectively, and where the Note ID is fixed and the UNID is mutable. Each entry's value is a multimap with string keys and values that are either effectively blobs or multi-value strings, numbers, or date/times+ranges, plus metadata.
What it does not intrinsically have (conceptually) is an ability to collect, index, and query documents other than "all" or by key (db.search can be thought of as a specialized instance of indexing). Layered on top (again: conceptually) of the NSF are two IBM-supported indexing schemes: the view indexer (NIF) and the full-text index. Though these services are baked into Domino at an API level (including data transit over the same wire), they are in many ways no different from LDAP, the RnR manager, or third-party Domino addins: they are independent services that provide additional capabilities to the server.
This is a long-winded way of saying: there's nothing stopping anyone from doing their own indexer, particularly now that the primary use would be within an XPage running on the server directly, not a legacy client. So what would such an indexer need? The way I see it, there are four conceptual parts: document selection, index entry creation, index updating, and querying. Both of the two built-in methods handle these tasks in their own way. The full-text index's answers are:
- Document selection: all of them.
- Index entry creation: all processable string data in the document, with variations configured at setup time. There is no user-defined index-stored metadata and the resultant index is effectively a black box.
- Index updating: "immediately", by which it means "eventually". Specifically, it's handled by an updater task that I'll charitably assume batches changes for group processing. Can also be done on schedule or manually.
- Querying: a custom string-based DSL that allows for selection of documents and sorting by "relevance" or creation date. It also attempts to provide location/highlight information for matched data in the document, but it's best not to think about that.
NIF's answers are:
- Document selection: formula language, with the limitation that it can select only based on summary data.
- Index entry creation: a combination of formula language (also with the summary limitation) and column and view configuration, resulting in a combined array+tree structure of individual entries, each of which is a combined array+map structure. This also involves specifying categories and keys for later querying.
- Index updating: this is a bit more reliably quick than the FT update, and operates on similar lines: by default, it's triggered by DB change events, but can be updated manually and set to update less frequently.
- Querying: querying is done via a series of operations to read the index structure. These operations focus on getting a single column's values, selecting entries/documents by key, and traversing entries sequentially with some hierarchical operations. The additional information included during entry creation can be used to eliminate the need to access the actual documents later in some situations.
Click-to-sort columns in views are effectively separate indexes that share much of their configuration information. NIF can also be combined with the full-text search index to insersect the pre-selected contents of the view with the FT query result.
When I dabbled with Fancy Views years ago, I focused primarily on the first two components. For selection, I allowed either the standard formula-based selection or an FT-search query, and for the entry creation I took the framework established by NIF and expanded it to allow any JSR-223 language to return a value, to work transparently with MIMEBean values, and to allow storing any Serializable value. The updating was skeletal - basically whenever I ran the agent - and the querying was half-assed, being limited to a single sort value and then just iteration. Still, this concept has promise: because the relative expense of generating the index is dwarfed on modern systems by the value of having a better resultant index, allowing complex operations like non-summary/MIMEBean access and alternative languages is very worthwhile.
The last couple days, I've been taking a look at CQEngine. CQEngine focuses on the final step - querying. It operates on Java Collections, which could range from an ArrayList
of HashMaps
to an arbitrarily-complex database index that implements Collection
and for which the user provides key/value adapters. Where CQEngine shines is being able to build complex queries across multiple attributes and ordering the results, much like you would do in SQL.
I'm not 100% sold on the notion of CQEngine being a building block of a new view indexer, but it has some promise - and any new indexer doesn't have to be a full NIF replacement or even the only new indexer. The lack of a string-based query syntax makes it a bit awkward (would it be represented as a tree of XSP components?) and the fact that the built indexes aren't meant to be serialized means that they'd have to be rebuilt once per session (though the backing index itself wouldn't be). Combined with an initial indexer that takes the Couch* approach of a JavaScript/JSR-223 function to select documents and emit entry values and an index-update task, it could provide some interesting capabilities while being potentially much faster and more flexible than NIF for many operations.
Though this is currently all speculation, it's satisfying to know that, like with the OpenNTF Domino API, there's nothing standing in between speculation and a real system other than doing a bit of programming. It's also just one of many potential non-exclusive paths. One of the coolest aspects of the Cambrian explosion of NoSQL technology in recent years is that each system comes with its own take on indexing/querying and associated support systems have arisen that can be used side-by-side with a document store like Domino. The latter systems also have the side effect of further opening the window to the outside world.
So will I actually try to fully build out one of these index-replacement ideas? Eh, maybe. I get the itch every once in a while, either for performance concerns or my desire to index on MIMEBean data, and having a working index replacement could open up a world of new possibilities. So we'll see. For now, I put my CQEngine tinkering up on GitHub and I expect I'll keep the concept floating around in the back of my brain for the next couple days at least.
Nathan T. Freeman - Mon Jul 21 11:03:37 EDT 2014
+1 for the use of Cambrian Explosion reference. :-)
One thing you left out of your indexing criteria was Sorting. And this is kind of important because it goes to a point you made: "Click-to-sort columns in views are effectively separate indexes that share much of their configuration information. "
I've seen people echo this before, and admittedly it's implied by IBM's documentation, but it isn't actually true. Click-to-sort columns don't create an entirely separate *index*. They create a separate *collation sequence*. You can actually see this as multiple $Collation items in the view design note. But this is different from duplicating the $Collection item that handles all the other view information (like the values of the individual columns)
Jesse Gallagher - Mon Jul 21 11:10:12 EDT 2014
Yes indeed... I should have added another "(conceptually)" hedge on that sentence. Though I didn't look into the specifics of what is and isn't shared between the collation items, I figured it was something like that - but to the programmer using it, it's basically like another view with the same docs and columns (minus categories).
Sorting generally is definitely another consideration, and the way it's done depends on the specifics of the other aspects. For NIF-style views (and my original Fancy Views), it's absolutely crucial to specify sorting up front. For the CQEngine approach, it acts a bit more like SQL where the internal sorting of the data and indexes is obfuscated in favor of the query asking for what it wants and the system handling it however it needs.