Querying Documents With Different Structures in Azure Cosmos DB
This is a quick post to share how we can use the coalesce operator in Azure Cosmos DB (which is a schema-free, NoSQL database formerly known as DocumentDB) to handle situations when the data structure varies from file to file. Varying data structure is a common issue in big data and analytics projects. A schema-free database like DocumentDB allows us to ingest and store the data with varying structures without a lot of upfront effort. However, accommodating these varying data structures is challenging later when we want to analyze the data. When querying the data (think Schema on Read here), I do need to impose a consistent structure on the data to perform analytics.
Following is a highly simplified example which shows how I have a PrevVal in one document, but not the other:
Here are the results if I do a simple select statement in the Azure DocumentDB Query Explorer:
See how with the simple select above I get the PrevVal (aliased to PreviousDataValue) property returned (in results on the right) for Document1 but not Document2? No big surprise there of course. However, what I really want is a "standardized" structure that is consistent across all documents so that I can perform analytics on the data, and potentially store the valuable data in my data warehouse.
Enter coalesce. Coalesce checks for the existence of a property inside of a document. This makes it easier to deal with properties that don't *always* exist in all documents.
Here's what happens when I add the ?? coalesce operator to our select query:
In the above example, I've told the coalesce operator to return a 0 if the PrevValue doesn't exist. And...shazam! We now have a standardized structure output on the right. The trick to making this work of course is including every possible property that could possibly come through in the set of JSON documents.
If missing data values like PrevValue is an issue, you can find specific documents which are missing a specific property by using NOT IS_DEFINED like this:
I've found it very helpful to be able to 'standardize' data from file to file by using the coalesce operator in DocDB. I also like the ability to jump immediately to nested levels of the data without having to do extra work of parsing out levels.
Finding More Information
SQL Query and SQL Syntax in DocumentDB <--Note only a subset of SQL is supported in DocDB
You Might Also Like...
Data Lake Use Cases and Planning Considerations
Setting Up Azure Disk Encryption for a Virtual Machine with PowerShell
Where Azure Analysis Services Fits Into BI & Analytics Architecture