Working with Rich Text's MIME Structure
Wed Jul 08 20:28:19 EDT 2015
My work lately has involved, among other things, processing and creating MIME entities in the format used by Notes for storage as rich text. This structure isn't particularly complicated, but there are some interesting aspects to it that are worth explaining for posterity. Which is to say, myself when I need to do this again.
As a quick primer, MIME is a format originally designed for email which has proven generally useful, including for HTTP and, for our needs, internal storage in NSF. Like many things in programming, it is organized as a tree, with each node consisting of a set of headers (generally, things like "Content-Type: text/html"), content, and children.
Domino stores the text part of rich text in MIME as HTML. In the simplest case, this ends up a one-element "tree", which you can see in the document's properties dialog:
Content-Type: text/html; charset="US-ASCII" <font size=2 face="sans-serif">Hello <b>there</b></font>
There's slightly more to its full storage implementation (like the MIME_Version item), but the MIME Part items are the important bits. This simple structure can be abstracted to this tree:
- text/html
Things get a little more complicated when you add embedded images and/or attachments. When you do either of those, the MIME grows to multiple items and becomes a multi-node tree.
Embedded Images
When you add an embedded image in the rich text field, the storage grows to four same-named MIME Part items. Concatenated (and clipped for brevity), the items then look like:
Content-Type: multipart/related; boundary="=_related 006CEB9D85257E7C_=" This is a multipart message in MIME format. --=_related 006CEB9D85257E7C_= Content-Type: text/html; charset="US-ASCII" <font size=3>Here's a picture:</font> <br> <br><img src=cid:_2_0C1832A80C182E18006CEB9885257E7C style="border:0px solid;"> <br> <br><font size=3>Done.</font> --=_related 006CEB9D85257E7C_= Content-Type: image/jpeg Content-ID: <_2_0C1832A80C182E18006CEB9885257E7C> Content-Transfer-Encoding: base64 *snip* --=_related 006CEB9D85257E7C_=--
You can see the same sort of HTML block as before contained in there, but it sprouted a lot of other stuff. To begin with, the starting part turned into "multipart/related". The "multipart" denotes that the top MIME entity has children, and the "related" is used when the children consist of an HTML body and inline images. There are delimiters used to separate each part, using the auto-generated convention of "related" plus an effectively-random number. The image itself is represented as a MIME Part of its own, in this case stored inline and Base64-encoded (it can be shifted off to an attachment by Notes/Domino after a certain size). This structure can be abstracted to:
-
multipart/related
- text/html
- image/jpeg
The HTML is designed so that there is an image tag that references the attached image using a "cid" URL, an email convention that basically means "find the entity in this related MIME structure with the following content ID" - you can then see the content ID reflected in the JPEG MIME Part. This sort of URL doesn't fly on the web, so anything displaying this field on a web page (or otherwise converting it to a non-MIME storage format) needs to translate that reference to something appropriate for its needs.*
Attachments
When you have a rich text field with an attachment (in this case without the embedded image), you get a very similar structure:
Content-Type: multipart/mixed; boundary="=_mixed 006EBF7C85257E7C_=" This is a multipart message in MIME format. --=_mixed 006EBF7C85257E7C_= Content-Type: text/html; charset="US-ASCII" <font size=3>Here's an attachment: <br> </font> <br> <br><font size=3><br> Done. </font> --=_mixed 006EBF7C85257E7C_= Content-Type: application/octet-stream; name="cert.cer" Content-Disposition: attachment; filename="cert.cer" Content-Transfer-Encoding: binary cert.cer --=_mixed 006EBF7C85257E7C_=--
The structure is the same sort of tree as previously, but the "related" content sub-type has changed to "mixed". This indicates that there are multiple types of content, but they're conceptually distinct. In any event, the tree looks like:
-
multipart/mixed
- text/html
- application/octet-stream
"application/octet-stream" is a generic MIME type for, basically, "bag of bytes" - MIME-based tools use it when they either don't know the content type or, as in this case, don't care. In this case, Notes/Domino splits out the content to be an NSF-style attachment and then references that in the MIME - this is an implementation detail, though, as the API returns the value regardless.
This also highlights a minor limitation in rich text storage: attachments do not have an inline representation in the HTML, and so they are always moved to the end of the field in Notes. At first, I was peeved by this limitation, but it makes a sort of sense: cid references are really about images, and I guess Lotus didn't want to override that for use in normal link elements.
That brings us to the final potential structure you're likely to run across:
Embedded Images And Attachments
When you include both embedded images and attachments, things get slightly more complicated. I'll skip the raw MIME and go straight to the tree:
-
multipart/mixed
-
multipart/related
- text/html
- image/jpeg
- application/octet-stream
-
multipart/related
So this becomes a combination of the two formats, and a bit of logic emerges. In Notes's structure, "multipart/mixed" always contains two or more children, and the first one is the textual body, whatever form that may take. One of those forms is just a single-part "text/html", and the other is a "multipart/related" subtree containing the "text/html" and one or more images.
Once you get a feel for these structures, it makes the task of reading and creating Notes-alike MIME items much less daunting. There are a number of other concerns I've been dealing with as well (such as the conversion of composite-data rich text to HTML and how there are two ways to do it), and maybe I'll make a followup post at some point about those.
* As a minor note on this point, it's an area where the Notes client and XPages diverge slightly. The Notes client (which generated the example above), leaves inline images "nameless" - they contain no "Content-Disposition" header and no name in the "Content-Type", instead sticking with just the "Content-ID" for identification. With XPages, however, presumably due to the fact that it has filename information during the upload process, the result still contains (and is referenced by) the "Content-ID" value, but it also contains a line like:
Content-Disposition: inline; filename="foo.jpg"
This functions the same way for most purposes, but it may be significant. For example, if you happen to write processing code that uses the presence of absence of the "Content-Disposition" header as an indicator of whether it's an attachment or not, knowing this ahead of time could save you a certain amount of headache. The right way to do it is to see if the header is either missing or has a basic value of "inline" instead of "attachment".