Reference Data Stage Pipeline
For implementing a Reference Data Stage Pipeline within the data warehouse the ExMeX Framework Core introduced with release 2.11.0 a new table stereotype for data modeling.
To use this function within PowerDesigner at least version 2.0.0 of the plugin PowerDesigner Extension - TEDAMOH ExMeX Data Vault is required!
To populate non-historized Reference Data Stage tables the ExMeX Framework Core introduced a new Load Process Step (LPS) pattern (LoadPatternId = 18).
- After data modelling in PowerDesigner, regenerate the metadata.
- After generating the metadata and adapting the scripts, install the metadata.
Reference Data Stage Pipeline (General)
Which Stereotypes for Reference Data Stage are available?
Table type: Reference - Stores all loaded non-historized reference data.
- Table stereotype: SL - Reference Zone
This table stores (insert, update and delete) all current reference data which were available within the full current load. There is no versioning of the data.
Hint: For historized reference data it is recommended to use Data Vault patterns in the History Zone of the Stage Layer.
- Mandatory column: Audit Trail Id (Name and Code as configured)
This column stores an Id which ties the loaded data to the ExMeX Logging Framework. - Mandatory column: Status CDC (Code: StatusCDC)
This column stores a flag, generated by Reference Data Stage LPS: I,U
In the case of the Reference Data Stage LPS only I and U are used.To ensure data quality it is recommended to use a domain for the Status CDC column (see also the following topics).
How to design a domain Status CDC?
To ensure data quality it is recommended to use a domain for the Status CDC column. Set up a domain with the following characteristics.
See section How to design a domain Status CDC? in Chapter Change Data Capture Pipeline.
Any background information?
DAPI
The DAPI MetadataZoneDapi.MetadataForGenerateStageReference is the basis for generating the LPS with the LoadPatternId 18.
Generator
The script for generating the LPS with LoadPatternId 18 is Reference-Stage.sql
Logging
Because of how the MERGE SQL statement logs data changed in the target, the ExMeX Logging Framework can display only the sum of inserted, updated, and deleted rows in the RowsInserted column of the FctProcessStepVapourTrail in the MetadataZone-Mart.
Quote from the Microsoft SQL Docs:
When used after MERGE, @@ROWCOUNT (Transact-SQL) returns the total number of rows inserted, updated, and deleted to the client.
Resources
- Microsoft Docs: MERGE (Transact-SQL)
- MSSQLTips: Using MERGE in SQL Server
With Release 2.13.0 this issue is solved. Inserted, Updated (Changed) and physical deleted rows are logged correctly!
Reference Data Stage Pipeline (Non-Historized)
How to design a reference table?
Table of type SL - Volatile Zone
As always, a table of type Volatile Zone is required to load the data 1:1 from the source into the Stage Layer. For the Reference Data Stage Pipeline there are no special requirements to consider. The following picture shows an example of a Volatile Zone table.
Table of type SL - Reference Zone
To populate all data, which are stored in the previous shown Volatile Zone table to a non-historized reference data table create a new table with stereotype SL - Reference Zone.
A table of type SL - Reference Zone stores (insert, update and delete) all current reference data which were available within the full load. There is no versioning of the data.
This table can contain as many or fewer columns compared to the Volatile Zone table. Depends on the design decision of the data modeler and the requirements for the reference table.
Two additional (metadata) columns are mandatory:
- Audit Trail Id (Name and Code as configured)
- Status CDC
The metadata column Status CDC was already introduced in section Change Data Capture Pipeline. It is reused here because it serves the exact same purpose. To document the change. This column shows whether the row in the reference table is the initial insert (CDCStatus = 'I') or has already been updated (CDCStatus = 'U').
The primary key correspond to Compare Key described in section Change Data Capture Pipeline.
Reasons to populate data into the reference table are and identified by
- New data: Primary key does not exist in reference table. Data will be inserted into reference table.
- Changed (context) data: Changed dates are identified within all other, non metadata- and non-pk-, columns. Data will be updated in reference table.
- Deleted data: Primary key does not exist anymore in Volatile Zone table. Data will be deleted in reference table.
Hint: For historized reference data it is recommended to use Data Vault patterns in the Reference Zone of the Stage Layer.
The following picture shows two examples of a reference table. One with a composite primary key including the column Valid From to detect bitemporal reference data changes as well as reference table with a single column primary key.
How to design mappings for a reference table?
Mapping table of type SL - Volatile Zone to SL - Reference Zone
The mapping of the columns from the table of type SL - Volatile Zone to type SL - Reference Zone is kept simple. Connect the columns to be loaded by drag & drop, give the mapping a suitable name and that's it.
The two additional metadata columns must not be mapped. The Load Process Step takes care of the loading of the two columns:
- Audit Trail Id (Name and Code as configured)
- Status CDC
The following two pictures show exemplary mappings for a primary key with one column as well as with several columns (composite primary key).
Reference Data Stage Pipeline (Batch)
How to set up a batch for Reference Data Pipeline in Stage?
ExMeX Companion App
Starting with version 0.2.0 of the ExMeX Companion App, the data modeler can create a new batch for the Reference Data Stage Pipeline in the ExMeX Companion App.
ExMeX Framework Core Scripts
As always, the data modeler must add a new batch to the \etc\config\ConfigLoadBatchLogistic.sql file. The following SQL example shows an sample batch for the Reference Data Stage Pipeline.
-- Example Batch for Reference Data in Stage
MERGE INTO MetadataZone.LoadBatch AS target
USING ( SELECT 'Reference Data Stage'
,LoadBatchId
FROM MetadataZone.LoadBatch
WHERE BatchLabel = 'BigOne'
) AS source (
BatchLabel
,ParentLoadBatchId
)
ON (target.BatchLabel = source.BatchLabel)
WHEN MATCHED THEN
UPDATE SET
target.ParentLoadBatchId = source.ParentLoadBatchId
WHEN NOT MATCHED THEN
INSERT (
BatchLabel
,ParentLoadBatchId
)
VALUES (
source.BatchLabel
,source.ParentLoadBatchId
)
;
Next, the data modeler must place the new batch in the dependencies to other batches (file \etc\config\ConfigLoadBatchLogistic.sql). In the following example SQLs the batch is arranged after the Stage Layer is loaded (Batch 'Load Stage with SSIS') and before the Batch 'Default - Hub'.
-- Stage to Reference Stage MERGE INTO MetadataZone.LoadBatchDependency AS target USING ( SELECT prelb.LoadBatchId ,suclb.LoadBatchId FROM MetadataZone.LoadBatch prelb CROSS JOIN MetadataZone.LoadBatch suclb WHERE prelb.BatchLabel = 'Load Stage with SSIS' AND suclb.BatchLabel = 'Reference Data Stage' ) AS source ( PredecessorLoadBatchId ,SuccessorLoadBatchId) ON (target.PredecessorLoadBatchId = source.PredecessorLoadBatchId AND target.SuccessorLoadBatchId = source.SuccessorLoadBatchId) WHEN NOT MATCHED THEN INSERT VALUES ( source.PredecessorLoadBatchId ,source.SuccessorLoadBatchId) ; -- Stage to Core MERGE INTO MetadataZone.LoadBatchDependency AS target USING ( SELECT prelb.LoadBatchId ,suclb.LoadBatchId FROM MetadataZone.LoadBatch prelb CROSS JOIN MetadataZone.LoadBatch suclb WHERE prelb.BatchLabel = 'Reference Data Stage' AND suclb.BatchLabel = 'Default - Hub' ) AS source ( PredecessorLoadBatchId ,SuccessorLoadBatchId) ON (target.PredecessorLoadBatchId = source.PredecessorLoadBatchId AND target.SuccessorLoadBatchId = source.SuccessorLoadBatchId) WHEN NOT MATCHED THEN INSERT VALUES ( source.PredecessorLoadBatchId ,source.SuccessorLoadBatchId) ;
In the next step, the data modeler must add the Load Process Steps respectively the tables to the batch. The assignment is done in the file \etc\config\ConfigLoadBatchProcessStepLogistic.sql. The following two SQL statements show as an example the assignment of all tables of type SL - Reference Zone of the data model STAGE_LAYER to the new batch 'Reference Data Stage'.
-- Delete Load Process Steps from Batch DELETE FROM MetadataZone.LoadBatchProcessStep WHERE LoadBatchId = (SELECT LoadBatchId FROM MetadataZone.LoadBatch WHERE BatchLabel = 'Reference Data Stage' ) ; -- DML to assign all Reference Load Process Steps to a batch INSERT INTO MetadataZone.LoadBatchProcessStep (LoadBatchId, ProcessStepId) SELECT LoadBatchId, ProcessStepId FROM (SELECT LoadBatchId FROM MetadataZone.LoadBatch WHERE BatchLabel = 'Reference Data Stage') lb CROSS JOIN (SELECT ProcessStepId FROM MetadataZone.EntityClass iec INNER JOIN MetadataZone.Mapping im ON iec.EntityClassCode = im.TargetEntityClassCode AND iec.ModelCode = im.TargetModelCode INNER JOIN MetadataZone.LoadProcessStep ilps ON ilps.MappingGUID = im.MappingGUID INNER JOIN ( SELECT ProcessStepTypeId FROM MetadataZone.ProcessStepType WHERE ProcessStepType = 'DEF' ) pst ON ilps.ProcessStepTypeId = pst.ProcessStepTypeId -- Assign Data Models to batch WHERE iec.ModelCode IN ( 'STAGE_LAYER' --,'StageLayerHiZ' ) -- Assign Stereotypes to batch AND iec.Stereotype IN ( -- 'Volatile Zone' --,'History Zone' 'Reference Zone' --,'CDC Complete' --,'CDC One Data Load Before' --,'CDC Partially' ) -- Assign tables (according to naming conventions) to batch -- e.g. %edSat% for EdDating-Satellites AND iec.EntityClassCode LIKE '%%' -- Include or exclude mappings according to naming conventions -- NoRun% -> e.g. not used integrity runs -- TestRun% -> mappings not ready to run on quality or production -- but available in test-batches AND im.MappingName NOT LIKE 'NoRun%' AND im.MappingName NOT LIKE 'TestRun%' AND im.MappingName NOT LIKE 'BusinessRun%' ) lps ;