Usage of “Unknown Member” Rows in a Data Warehouse
Overview: Ideas for usage of Unknown Member rows (also referred to as “predefined members” by some folks) beyond just the default row.
Purpose for the “Unknown Member” Row
A bit of background first: When we load facts to a Kimball-style data warehouse, each fact row is associated to the appropriate dimension row. For instance, FactStudentAttendance fact table may relate to the DimStudent, DimClass, and DimDate – these relationships make the magic happen for slicing & dicing capabilities.
However … what happens when we have a fact that cannot be associated to a valid dimension row? It usually is mapped to an “unknown member” row in the dimension. A surrogate key of –1 is often used for the purpose, but it certainly could be any number. (Although, for unknown members, I do like to specify the values rather than let the database generate them. I always use negative numbers for this purpose, for clarity.)
Why bother with the unknown member row? Because it allows the measurement data to still be populated in the fact table – meaning your totals & counts will still be accurate, even if the details cannot all be fully described by the dimensional data. It also reduces the risk of dropping data when inner joins are utilized in queries. Basically, this process gives orphan fact rows a temporary home (hint hint: temporary, meaning that a process should exist to clean up the data quality issues).
The Default “Unknown Member” Row
I typically use –1 as my default “unknown member” row in all dimensions. I know that any facts associated to –1 represent an issue that needs to get resolved. However, you’re not restricted to just one unknown row in a dimension. Sometimes it offers value to have several “flavors” of unknown member rows.
Using an Additional “Unknown Member” Row for a N/A Situation
Besides the default row, you could have a row to map to for data which just doesn’t exist for that fact, but it’s not a problem.
Let’s say we have a FactClassAttendance table which has two relationships back to DimTeacher: one for the primary instructor, and another for the secondary instructor. In situations where we don’t have a secondary instructor, I wouldn’t set that fact record to –1; rather, I’d set it to a –2 row which is set to mean “No Instructor.” Personally, I prefer to have a fixed dimensional row to explicitly state this circumstance, versus allowing the foreign key in the fact to be null.
This also works well for a parent/child relationship when some rows have “No Parent.”
In this situation, just make sure you’re not messing with the granularity of your fact table.
Using Additional “Unknown Member” Rows When We Know Something about the Data
Another situation which warrants more than just the default row is when we have some data elements, but not everything we need to identify the dimension row. Let’s say we have a FactSurvey table. Maybe the student taking the survey was agreeable to providing their identity, but maybe not. Perhaps they wanted to be anonymous, but would fill in their gender. In this case, we could utilize DimStudent rows such as:
-1: Unknown Student (i.e., this is a data quality issue to be fixed)
-2: No Student (i.e., the survey was taken by someone other than a student, so an N/A status is okay)
-3: Anonymous Student (i.e., differentiated from the –1 so we know it’s not a problem)
-4: Male Student
-5: Female Student
For me, the main benefits to having something other than the default (-1) row are that:
- We can still do some amount of analysis on the data (such as slicing the responses from male versus female students) if there’s requirements to do so, and
- We can differentiate between what is a data quality issue versus a “no data” issue which doesn’t need further follow up.