How I Learned to Stop Worrying and Love C Structs

Aug 4, 2014 10:38 PM

Tags: c java

Since a large amount of on-site client time has put my framework work on hold for a bit, I figured I'd continue my dalliance in the world of raw item data.

Specifically, what I've been doing is filling out the collection of structs with Java wrappers, particularly in the "cd" subpackage. For someone like me who has only done bits and pieces of C over the years, this is a fruitful experience, particularly since I'm dealing with just the core concept of the Notes API data structures without the messy business of actually worrying about memory or crashing programs.

If you're not familiar with structs, what they are is effectively just the data portion of a class. They're like a blueprint of related pieces of data - ints, doubles, other structs, etc. - used both for creating new elements and for a contract for the type of data received from an API. The latter is what I'm dealing with: since the raw item data in DXL is presented as just a series of bytes, the struct documentation is vital in figuring out what to expect in which spot. Here is a representative example of a struct declaration from the Notes API, one which includes bonus concepts:

typedef struct{
	LSIG	Header;		/* Signature and Length */
	WORD 	FileExtLen;		/* Length of file extenstion */
	DWORD	FileDataSize;	/* Size (in bytes) of the file data */
	DWORD	SegCount;		/* Number of CDFILESEGMENT records expected to follow */
	DWORD	Flags;		/* Flags (currently unused) */
	DWORD	Reserved;		/* Reserved for future use */
	/*	Variable length string follows (not null terminated).
		This string is the file extension for the file. */
} CDFILEHEADER;

So... that's a thing, isn't it? It represents the start of a File Resource (or CSS file, Java class, XPage, etc.). I'll start from the top:

  1. LSIG Header: An "LSIG" is actually another struct, but a smaller one: its job is to let a processor (like my code) identify the following record, as well as providing the total size of the record. When processing the byte stream, my code looks to the first two bytes of each record to determine what to expect for the next block.
  2. WORD FileExtLen: To start with, "WORD" here is just an alias for an "unsigned short" - a 16-bit integer ranging from 0 to 65535 (64K). The term "word" itself refers to the computer-architecture term. This field specifically denotes the number of bytes to expect after the record for the file extension of the file resource being defined.
    • The "unsigned" above refers to the way the numbers are stored in memory, which is to say a series of binary digits. By default, the highest-value bit is reserved for the "sign" - a 0 for positive and a 1 for negative. Normal "signed" numbers can store positive values only half the size of their unsigned counterparts, because going from "0111 1111" (+127) wraps around to "1000 0000" (-127). This is why the "half" values - 32K, 2G - are seen as commonly as their "full" counterparts. As to why negative numbers are represented with all those zeros, you'll either have to read up independently or just trust me for now.
  3. DWORD FileDataSize: A "DWORD" is double the size of a "WORD" (hence the "D", presumably): it's an unsigned 32-bit number, ranging from 0 to 4,294,967,295 (4G). This number represents the total size of the file attachment, and so also presumably represents the maximum size of a file resource in Domino.
  4. DWORD SegCount: This indicates the number of "segment" records expected to follow. Segment records are similar to this header we're looking at, but contain the actual file data, split across multiple chunks and across multiple items. That "multiple items" bit is due to the overall structure of rich-text items, and is why you'll often see rich text item names (like "Body" or "$FileData") repeated many times within a note.
  5. DWORD Flags: As the comment in the API indicates, this field is currently unused. However, it's representative of something that is used very commonly in other structures: a set of bits that isn't useful as a number, but instead has values set to correspond to traits of the entity in question. This is referred to as a bit field and is an efficient way to store a fixed number of flags. So if you had a four-bit flags field for a text item, the bits may indicate whether it's summary data, a names field, a readers field, and/or an authors field - for example "1100" for a field that's summary and names, but not readers or authors. That is, incidentally, basically how those flags are implemented, albeit with a larger bit field.
    • There is a secondary type of "flag" in Notes: the "$Flags" and "$FlagsExt" fields in design elements. Those flags are less efficient - 8 bits per flag instead of 1 - but are generally more extensible and somewhat conceptually easier.
    • These bit fields are what those oddball bitwise operators are for, by the way.
  6. DWORD Reserved: Many (most?) C API data structures contain at least one block of bits like this that is reserved for future use. These are so that IBM can add features without breaking all existing code: API users are expected to leave anything there intact and to create new structures with all zeros. You can see this in action in a couple places, such as the addition of rudimentary theme support to legacy design elements, which used up a byte out of the previously-reserved block.
  7. Variable data: This is the fun part. Many structures contain "variable" data following the "fixed" portion. Whereas the previous parts are a predictable size - a DWORD will always be four bytes - the parts following the block can be anywhere from 0 bytes to whatever is the max value of their referring entity. Fortunately, the "SIG" at the beginning of the record tells the API user the length ahead of time, so it's not required to read it all in when dealing with an API entity. Still, the code to read this can be complex and bug-prone, particularly when there are multiple variable parts packed together. In this case, though, it's relatively simple: we get the value of "FileExtLen" and read that many bytes into an array and convert that from LMBCS to a respectable Unicode string.

Dealing with a stream of structs is... awkward at first, particularly when you're used to Java amenities like being able to just pour out and read back in serialized objects without even thinking twice about it. After a while, though, it gets easier, and you get an appreciation for a lower-level type of programming. And while you still may not be happy about Domino's 32K summary-data limit, you at least get an understanding for why it's there.

So if you're in a position to dive into this sort of thing once in a while, I recommend you do so. Though the value of what I'm writing specifically is mixed - I doubt the world needs, say, a Java wrapper for a CD record reflecting a DECS field association - the benefit to my brain is immense. Dealing with web and Java programming can cause you to become very disconnected from the fundamentals of programming, and something like this can bring you back to solid ground. Give it a shot!

Commenter Photo

Dan Sickles - Aug 4, 2014 11:45 PM

Java is one of the few high level languages without a similar construct. Tuples in Python and Scala for example. Tuples and pattern matching had me giddy the other day.  I don't get much giddy these days ;-)

Here's to Tuples in Java 9 (after tail calls of course)

I do love me some C though.

Commenter Photo

Jesse Gallagher - Aug 5, 2014 10:05 AM

Definitely - pattern matching is the sort of thing you discount when you don't have it (blub style) and tuples just seem awkward and weird at first, but you really start to miss them after using them for a bit. Java's weird about this sort of thing, too - it has nods to low-level programming (like the primitive types), but then misses natural structs, unsigned numbers (my code is riddled with bitwise-ANDs to "upgrade" unsigned values to a larger type just so the arithmetic works out), and so forth.

New Comment