Delving Into NSF Raw Item Data
Tue Jul 29 13:46:21 EDT 2014
For the Design package of the OpenNTF Domino API, one of my goals was to add the ability to access the supporting structures around XPages: file resources, images, CSS, and, crucially, Java classes and the XPages themselves. My weapon of choice for the whole API is DXL, and this covered the job nicely for most of those aspects. DXL has clean representations for the first three in the list: specialized XML formats containing the pertinent metadata and a single BASE64-encoded block of the actual file data.
That's not the case for Java classes and XPages, however. For those, DXL falls back to the "note" format (inaptly referred to as "Binary DXL" in the Designer prefs). What sets this format apart from the element-specific DXL is that it doesn't attempt to present the data in a particularly friendly way - no specialized file- or image-resource support, no converting of agent code to a readable format, and no converting of rich text to the HTML-like structure you've seen if you've worked with DXL. Not everything is stored as binary data in this format - text, numbers, (some) formulas, and (most) date/time values are still converted to a human-friendly form. But the point remains: this format is closer to the data you would get by using the C API. In fact, for the "rawitemdata" entries, it is the data you would get by using the C API. This came up originally in the Design API and again the last couple days in the "nsfdata" package I'm working on to replace the design innards.
The APIs
I'd like to step back a moment from the notion of getting the C API data to go over the various APIs of Domino. The normal Java API we work with - the lotus.domino.local
one, not the OpenNTF one - is a third level of abstraction (or maybe fourth - I'm not sure if there's an extra layer of C++ stuff in here). The Java API is a fairly-thin wrapper around the same back-end classes used in LotusScript ("LSXBE"). Other than presenting some data types as Java native types like String
and int
and providing a few convenience methods like DateTime#toJavaDate
, it doesn't bring too much to the table.
The "LSXBE" objects have a more-complicated job (assuming there's not another layer): they bring object-oriented order to the lowest level, the C API. The C API is basically a big soup of data structures, constants, type aliases (e.g. DWORD
= 32-bit integer), and functions to operate on these. As is usually the case with a C API, dealing with it is onerous, and so the higher-level APIs wrap these capabilities into the objects you know and love. This wrapping has a number of significant limitations in functionality and speed, but overall the capabilities you see in the objects have direct analogues at the C level.
Those limitations, incidentally, are laid bare by the XPage environment's cheating-like-crazy "napi" classes.
But the point is that using the higher-level APIs generally shields you from the harsh realities of the C structures, but they leak through in the "rawitemdata" elements in DXL. As it turns out, data in these elements, when BASE64-decoded, can be accessed according to the specifications in the C API struct documentation.
Composite Data
Most of the item data structures are both pretty straightforward and not actually presented in DXL anyway (numbers-list items, for example, are basically a count of entries and then a series of 64-bit floating-point values (oddly, the API supports the concept of a number range like a date range, but NSF does not)). There are a couple types that take some more understanding, though, and foremost among them is Composite Data.
Composite Data is primarily the format that NSF uses to store rich text data, though it's full of components to store non-RT data like file resources. A Composite Data item is essentially an array of "records", which consist of a few things:
- A number indicating the type of the record (potentially when combined with the next part). A record type is something like "TEXT", "TABLEBEGIN", or "LINK": one of the core components that make up Composite Data. You can think of these sort of like XML start and end tags.
- A number indicating either the stated size of the record or, if it's 0 or 255, the size of the number following this that indicates the stated size of the record. C APIs are full of crap like that.
- If the size number was 0 or 255, a 16- or 32-bit integer indicating the stated size of the record.
- Any fixed-size structure data stored in the record. For example, CDFILEHEADER, one of the records used in file resources, contains a number of fields indicating things like the length of the file extension (I don't know why), the number of CDFILESEGMENTs that make up the file, and so forth.
- Any variable-length data stored in this record. Not all record types contain any special data beyond the basic struct, but many do. This usually consists of any associated strings (such as the aforementioned file extension) or raw binary data like that of a file segment. There may also be padding bytes attached to strings when the number of bytes used to store them is non-even.
This structure is why RichTextNavigator
acts the awkward way it does.
The nice thing about understanding this is that, once you do, it opens a window into all sorts of low-level operations. You're no longer restricted to working with just the elements that "friendly" DXL presents: while you still have to do the work of understanding the pertinent records, the work is straightforward. I originally used this as a way to read and write file-resource-type notes, but my new classes are meant to be more generic, opening the door to arbitrary Composite Data manipulation.
So... okay... why?
As I mentioned at the start, the impetus to my delving this past week was to clean up the backend of the Design API to work more generically, rather than consisting of one-offs to deal with the items just in file resources. The side effects are intriguing in a "mad scentist" sort of way, though. In effect, what I've been building so far is a mechanism for accessing and manipulating native NSF data structures without the presence of an actual NSF (or, in fact, any Notes/Domino dependency at all), in the process completing the utility of DXL. The implications of a fully-functional non-NSF store of NSF data are fascinating.
It's also important foundational work if I deal with the C API more directly in the future. Years back, I dabbled in a Ruby wrapper around the API, and the concept has never quite left my brain. As the XPage NAPI demonstrates, there's a tremendous amount of speed and functionality benefit to be had in bypassing the legacy Java API completely, and I may travel more fully down that path one day.
But the final reason why is "because it's there". Learning more about these underlying structures provides tremendous insight into why Domino does the things it does. It's also just good for my brain to deal with an API with conventions other than the legacy Java API or Java-standard semantics.
So I'm interested to see where this will go in the long term. It's possible its life will be primarily as a slightly-better back end for the existing Design API functionality, but it opens up numerous potential roads for future capabilities.