Resilient and temporally correct dimension Id’s for fact as well as dimension tables
In a dimensional data model based on a Data Vault data model, are integer values still necessary as dimension Id’s? And if so, how can the dimension Id’s be provided correctly?
This is the question being asked by the Data Management Center of Excellence (DMCE) at FastChangeCo™. FastChangeCo™ is a fictitious company struggling with everyday challenges in and with the world of data. The DMCE team aims to make FastChangeCo™ fit for the future with new ideas and solutions. With all the uncertainties that the future holds.In a new use case that is about to be implemented, the DMCE team is confronted with the question outlined at the introduction: This is because, depending on the technologies used for the applications in the frontend to access the high-quality and temporal data in their data warehouse, different –technology-dependent– structures are required for an outgoing interface.
Structures in the form of an OASI table, a snowflake schema or the 3rd normal form are just a few examples that are currently used as outgoing interfaces at FastChangeCo™. Another variant, often used by the DMCE team, are structures in the form of a Star Schema. Here, dimension Id’s (also called surrogate keys or primary keys) are elementary components in the dimensions to link facts (fact tables – e.g. retail sales) with the associated descriptive context (dimension tables – e.g. product, date, store or customer) [KiR13].
The dimensional data models (Star Schemas) commonly created at FastChangeCo™ [TED] incorporate dimensions with different variations of the same data – such as contextual information about products from the current, today's perspective (SCD1, As-Is), or with a history about the products (SCD2, As-Was).
Dimension Id’s for FastChangeCo?
At FastChangeCo™, it was decided at the beginning of the current data warehouse project to use hash values as surrogate keys in the implementation of Data Vault. In the further course of the project, the DMCE team decided that hash values would also be used as surrogate keys for outgoing interfaces in the implementation [LiO16].
The DMCE team repeatedly questioned whether using hash values derived from the business key and an eventual timestamp is the only way to generate dimension Id’s. For example, the DMCE team has run performance tests with the multiple front-end application tools used at FastChangeCo™, which showed that integer values for dimension Id’s are more appropriate –than hash values or natural business keys– to provide downstream applications with data in some scenarios.
A new use case
However, Michael Mueller, Head of Data Management at FastChangeCo™, wonders "Where to take from, if not to steal?" – and means, in a metaphorical sense, the integer values for the dimension Id’s.
Michael would like to work with Xuefang Kaya, a young and ambitious data modeler in the DMCE team, and Philomena Pavlovic, an experienced application developer and business analyst, to first develop the requirements for an adequate solution for the new outbound interface needed.
Together, they develop an overview of the requirements that meet both the 'needs' for the technologies used in the use case and the business requirements. Completely decoupled from whether the data already exists in the data warehouse or not.
An extract:
- Content attributes and data formats for the interface structure
- Temporal consistency (As-Is and As-Was), as a base for auditability
- Defined limit values for the update time of the data in the application
- Integer values as dimension Id’s (for technical reasons related to performance in the application)
- Resilient (stable and resistant) dimension Id’s (for technical reasons in the application), e.g., integer values assigned once must not change during the life cycle of a data record
From these requirements, the DMCE team came up with the essential new specifications that in this use case with a Star schema as the outgoing interface, the integer values must be resilient, traceable, reusable and temporally correct for fact and dimension tables. The only problem is that the integer values required for this do not yet exist in the implemented data solution.
Fig. 1: Definition of the structure of a dimension - outgoing interface
An adequate solution
Except for the dimension Id’s, all data is already available in the raw data. No further effort is expected here to provide the data. To find an adequate solution for dimension Id’s, the team wants to focus on a few Data Vault tables (Fig. 2). This makes it easier to develop ideas and not have to deal with the complexity of a physical data model. To simplify the example, the trio consciously includes the complete time period (Ldts-Ledts, closed-open) in the example sketches and data. This makes it easier for everyone involved to understand about the data as well as its explanation and relationships.
Fig. 2: Data Vault base data model, from [Ler19].
As-Is Dimension-Ids
First, Xuefang considers, we should find a solution for the dimension-Id’s to the As-Is –the current– view on the data. Subsequently, we deal with how an implementation of the historical view is feasible.
Michael likes her suggestion. On the one hand, to proceed step by step and, on the other hand, because the first part of the use case can already be implemented for Philomena and she can go into implementation and testing with her application.
The rationale for a Dimension-Id in the case of the As-Is dimension is that there is a 1:1 relationship between the business key (Registration No) and the Dimension-Id. If we had created our Data Vault structures with a Sequence instead of hash values for the Surrogate Key (DV-SK), the solution would be in place. From this, we deduce that we need an additional Data Object that stores the Sequences –Dimension-Id’s– for us 1:1 in relation to the business key, Xuefang muses. Would a Satellite only for the dimension Id’s, without historization, be the solution?
Michael likes the solution because it creates a stable link between the business key and an integer value for the dimension Id that is immutable, as shown in the sketch they created in Figure 3. In addition, Xuefang sketches the matching dimension with sample data in the design, exactly as they had previously defined the interface together. The AsIsDimId column populates from the data contained in the sStudentAsIsDimId Satellite.
A sample prototype implemented at short notice by the DMCE team successfully confirms the assumptions made in the outlined solution. Thus, the DMCE team has met the requirements for the outgoing interface to map integer values for the dimension Id’s in a resilient, traceable, reusable, and temporally correct manner.
As-Was Dimension-Ids
For the As-Was dimension, the implementation is not quite as simple. Xuefang outlines different variants, but none even comes close to meet the specifications. In two of her numerous technical books ([Gil19] and [Hul12]), she reads about the possibilities of a Point-In-Time (PIT) table, so-called performance tuning tables.
The approach seems promising to Xuefang because a PIT table already 'merges' all the timelines of the Satellites involved in a Hub into a common timeline: exactly what is needed for an As-Was dimension.
Xuefang adds an additional column AsWasDimId to the PIT table PitStudentAsWasDimId to get temporally correct dimension Id’s. As she knows from [KiR13], the following applies here for the dimension Id: When historizing descriptive data of a business key, it is very likely to occur in multiple rows. A continuous surrogate key is necessary to keep the data unique per business key in the dimension.
Michael notes that Data Vault performance objects are defined as non-permanent. They can be deleted and rebuilt again ([Hul12], [LiO16]. This would not meet the requirements for a resilient and reusable dimension Id.
Fig. 3: As-Is Dimension-Id’s and derived dimension
Philomena again explains the need for this requirement: A different dimension Id for the same business key can lead to inconsistencies in some of the applications used. The background to this, she explains, is that the applications cache the business keys with the associated dimension Id’s, but do not continuously match or update them with the interface data – for example, if a newly assigned dimension Id means that sales are suddenly assigned to other products because the dimension Id’s between the interface and application or dimension and facts are not synchronized and consistent.
With this information from Philomena, Xuefang considers a new variant (Fig. 5) in which the PIT table for generating the dimension Id’s no longer plays a role. Therefore, Xuefang has removed the PIT table from the sketch. For performance reasons, a PIT table is still conceivable here.
Fig. 4: As-Was Dimension-Id’s in PIT and derived dimension.
With the new variant, a stable As-Was dimension Id is possible – independent of the number of Satellites providing data for the dimension. Now even an As-Was dimension would be possible only with the data from sStudent. Just like the required combination of data from the two Satellites sStudent and sAddress. The formula Serial Id * 100 + Serial Id shows simplified the way to generate a common AsWasDimId.
Adding a Serial Id in each Satellite makes each historized version of the context data unique to a business key. Combining the Serial Id's present in the Satellites in a clever way results in a resilient, trackable, reusable and temporally correct Dimension Id for all business keys, Xuefang explains to both of them.
Philomena is intrigued. Surely this means that if we combine this idea with our approach to As-Is Dimension-Id's, we would have all the conditions met for our use-case requirements, right? An interface equally resistant to changes in the dimension-Id's, for both the As-Is and the As-Was dimension!
Michael and Xuefang agree with Philomena and draft the mutual design concept for this use case to present to the DMCE team:
The DMCE team has a question about the design concept: Why don't we take the Serial Id as AsIsDimId in the DimStudentAsIs dimension? Xuefang and Michael already discussed this question with their data modeling coach: in the case of the As-Is dimension, we always have to use the latest dataset. This does not meet our requirements for interfacing with a resilient dimension id. This is because each historization inevitably creates a new number in the Serial Id column.
Fig. 5: As-Was Dimension-Id’s in Satellites, combined in the dimension.
The final interface
From the DMCE team's point of view, the solution found –with the additional Satellite sStudentAsIsDimId and the additional Serial Id column in the existing Satellites– is the right approach for the given use case.
They are aware that completely different approaches are necessary in a different business context, different prerequisites in the physical data models or even a different technology used. It is therefore very important for the team to have regular coaching with a third and independent person. This enables the team on the one hand to question their own ideas and possibly deadlocked views, and on the other hand to further build up or deepen knowledge within the team. To be fit for a future that is known to be uncertain.
For example, Roelant Vos has also described the problem in the blog post Deterministic dimension keys in a virtual Data Vault.
The two ideas combined could be a possible solution for one of the upcoming projects at FastChangeCoTM.
Philomena Pavlovic is more than pleased with the result the DMCE team has delivered to her for her new dashboard. Users of the dashboard have the option to choose between a current and historical view of events and the queries, thanks to the new method of using numbers as dimension Id’s, are resilient to changes in the data and perform sufficiently well for the given requirements.
For Philomena, working with the DMCE team, listening to their requirements, both business and technical, is the key to success –if not the critical success factor– in implementing this use case.
Fig. 6: The final design concept for implementing the outgoing interface
Literatur
[Gil19] Giles, J.: The elephant in the fridge. TechnicsPub 2019 [Hul12] Hultgren, H.: Modeling the agile data warehouse, New Hamilton 2013 [KiR13] Kimball, R. / Roos, M.: The Data Warehouse Toolkit. 3. Aufl., Wiley 2013 [Ler19] Lerner, D.: TEDAMOH Academy – Temporal data in a fast-changing world. 2019 [LiO16] Linstedt, D. / Olschimke, M.: Building a scalable Data Warehouse. MK 2016 [TED] FastChangeCo – a fictitious company. https://tedamoh.com/de/tedamoh/238-tedamoh/88-fastchangeco-afictitious-company, abgerufen am 19.7.2022