Replicating Domino to Azure Data Lake With Darwino
Mon May 10 10:06:09 EDT 2021
Though Darwino is an app-dev platform, one of the main ongoing uses it's had has been for reporting on Domino data. By default, when replicating with Domino, Darwino uses its table layout in its backing SQL database, making use of the various vendors' native JSON capabilities to have storage capabilities analogous to Domino. Once it's there, even if you don't actually build any apps on top of it, it's immediately useful for querying at scale with various reporting tools: BIRT, Power BI, Crystal Reports, what have you. Add in some views and, as needed, extra indexes and you have an extraordinarily-speedy way to report on the data.
Generic Replication
But one of the neat other uses comes in with that "by default" above. Darwino's replication engine is designed to be thoroughly generic, and that's how it replicates with Domino at all: since an adapter only needs to implement a handful of classes to expose the data in a Darwino-friendly way, the source or target's actual storage mechanism doesn't matter overmuch. Darwino's own storage is just "first among equals" as far as replication is concerned, and the protocol it uses to replicate with Domino is the same as it uses to replicate among multiple Darwino app servers.
From time to time, we get a call to make use of this adaptability to target a different backend. In this case, a customer wanted to be able to push data to Azure Data Lake, which is a large-scale filesystem-like storage service Microsoft offers. The idea is that you get your gobs of data into Data Lake one way or another, and then they have a suite of tools to let you report on and analyze it to your heart's content. It's the sort of thing that lets businesspeople make charts to point to during meetings, so that's nice.
This customer had already been using Azure's SQL Server services for "normal" Darwino replication from Domino, but wanted to avoid doing something like having a script to transform the SQL data into Data-Lake-friendly formats after the fact. So that's where the custom adapter came in.
The Implementation
Since the requirement here is just going one way from Domino to Data Lake, that took a bit of the work off our plate. It wouldn't be particularly conceptually onerous to write a mechanism to go the other way - mostly, it'd be finding an efficient way to identify changed documents - but the "loose" filesystem concept of Data Lake would make identifying changes and dealing with arbitrary user-modified data weird.
The only real requirements for a Darwino replication target are that you can represent the data in JSON in some way and that you are able to identify changes for delta replication. That latter one is technically a soft requirement, since an adapter could in theory re-replicate the entire thing every time, but it's worlds better to be able to do continuous small replications rather than nightly or weekly data dumps. In Darwino's own storage, this is handled by indexed columns to find changes, while with Domino it uses normal NSFSearch
-type capabilities to find modifications since a specific date.
Data Lake is a little "metadata light" in this way, since it represents itself primarily as a filesystem, but the lack of need to replicate changes back meant I didn't have to worry about searching around for an efficient call. I settled on a basic layout:
Within the directory for NSFs, there are a few entities:
darwino.json
keeps track of the last time the target was replicated to, so I can pick up on that for delta replication in the futuredocs
houses the documents themselves, named like "(unid).json" and containing the converted JSON content of the Domino documentsattachments
has subfolders named for the UNIDs of the documents referenced, followed by the attachment and embedded images, names prefixed with the fields they're from
Back on the Domino side, I can set this up in the Sync Admin database the same way I do for traditional Darwino targets, where the Data Lake extension is picked up:
Once it's set up, I can turn on the replication schedule and let it do its thing in the background and Data Lake will stay in step with the NSFs.
Conclusion
Now, immediately, this is really of interest just to our client who wanted to do it, but I felt like it was a neat-enough trick to warrant an overview. It's also satisfying seeing the layers working together: the side that reads the Domino data needed no changes to work with this entirely-new target, and similarly the core replication engine needed no tweaks even though it's pointing at a fully-custom destination.