CDC Data Pipeline (Stage Layer)

How to design a CDC table?

Table of type SL - Volatile Zone As always, a table of stereotype SL - Volatile Zone is required to…

Table of type SL - Volatile Zone

As always, a table of stereotype SL - Volatile Zone is required to load the data 1:1 from the source into the Stage Layer. For the CDC Data Pipeline there are some requirements to consider. The following picture shows an example of a Volatile Zone table.

CDC Data Pipeline Stage Volatile Zone Table Artwork In MuseumCDC Data Pipeline Stage Volatile Zone Table Artwork In Museum

As already written in Which Stereotypes for CDC are available? there are two new stereotypes for columns in the CDC Data Pipeline:

  • CDC - Compare Key (Mandatory)
  • CDC - Ignore (Optional)

In the Volatile Zone table shown above, the column Artwork is set to stereotype CDC - Compare Key. This means for the CDC Load Process Step (LPS) that this column (or columns) is used to identify the new, changed or deleted records.

All other columns are used in the Change Data Capture process. This means that the CDC LPS checks these columns for changes.

The columns Other Column 4 and Other Column 5 are set to stereotype CDC - Ignore. This means for the CDC LPS that this columns (or a single column) are ignored in the Change Data Capture process.

Changed data in columns with the column stereotype CDC - Ignore are only updated in the table of type SL - CDC One Data Load Before, if the content of a non-CDC - Ignore column has also changed.

As described above, changed data will be ignored by CDC - Ignore and therefore changed data will not be transported into the OLDB or CDC table. As long as no non-CDC - Ignore column has changed!

If the columns in the Core Layer (Data Vault) are not needed, then the columns should be removed completely! We do NOT recommend to use the column Stereotype CDC - Ignore to "remove" columns! This could lead to misunderstandings why or why data has not been historized in the satellite.

View of type SL - Volatile Zone View

To reduce the data flow to a partial data flow a view of stereotype SL - Volatile Zone View is required. For the CDC Data Pipeline there are some requirements to consider. The following picture shows an example of a Volatile Zone view to create a partial data flow.

CDC Data Pipeline Stage Volatile Zone View for partial data flow Artwork In MuseumCDC Data Pipeline Stage Volatile Zone View for partial data flow Artwork In Museum

As already written in Which Stereotypes for CDC are available? there are two new stereotypes for columns in the CDC Data Pipeline:

  • CDC - Compare Key (Mandatory)
  • CDC - Ignore (Optional)

In the Volatile Zone view shown above, the column Artwork is set to stereotype CDC - Compare Key. This means for the CDC Load Process Step (LPS) that this column (or columns) is used to identify the new, changed or deleted records.

All other columns are used in the Change Data Capture process. This means that the CDC LPS checks these columns for changes.

Changed data in columns with the column stereotype CDC - Ignore are only updated in the table of type SL - CDC One Data Load Before, if the content of a non-CDC - Ignore column has also changed.

As described above, changed data will be ignored by CDC - Ignore and therefore changed data will not be transported into the OLDB or CDC table. As long as no non-CDC - Ignore column has changed!

If the columns in the Core Layer (Data Vault) are not needed, then the columns should be removed completely! We do NOT recommend to use the column Stereotype CDC - Ignore to "remove" columns! This could lead to misunderstandings why or why data has not been historized in the satellite.

Table of type SL - CDC One Data Load Before

For the CDC LPS a new table type is necessary with the stereotype SL - CDC One Data Load Before (ODLB).

This table contains the version of the data that the LPS loaded into the table of type SL - Volatile Zone during the previous execution.

In the case of the Partial CDC:
ODLB and HiZ tables are not maintained (loaded) by the Load Process Step (LPS). The loading of the ODLB or HiZ table must be ensured by another, additional LPS.

This table can contain as many or fewer columns compared to the Volatile Zone table. Depends on the design decision of the data modeler and the requirements for the ODLB and CDC table.

One additional column is mandatory:

  • Hash-Diff CDC (Code: HashDiffCDC)

The Hash-Diff CDC column contains a hash value that the LPS has calculated from all columns that do not have a stereotype in the Volatile Zone Table. By comparing this hash value, the LPS detects whether there are any changes in content between the current and the previous run in any of the columns.

The column Hash-Diff CDC is only necessary in the CDC Data Pipeline where the CDC table has the stereotype SL - CDC Complete.

The following picture shows an example of an ODLB table in the Stage Layer.

CDC Data Pipeline Stage ODLB Table Artwork In Museum ODLBCDC Data Pipeline Stage ODLB Table Artwork In Museum ODLB

 

Variation

It is possible for the CDC LPS, instead of creating a new OLDB table, to use an existing table from the History Zone. However, we do NOT recommend this for a Complete/Partially CDC Data Pipeline because a) the use of the column Hash-Diff CDC would only be possible with considerable effort and b) the existing table will NOT be longer the table with originally intended characteristics, e.g. a history table (see warning below).

If a table from the history zone is used for a Partially CDC Data Pipeline, it should be noted that for the CDC LPS only exactly one previous data version is visible in the table of the history zone. This could be achieved e.g. via

  • an updatable view or
  • the housekeeping (e.g. set HousekeepingUnit = ‘ProcessStepRun’ and HousekeepingAmount = 1).

Keep in mind that the MERGE statement will not only write deleted rows into the CDC table but also delete them in the History Zone table!

Table of type SL - CDC Complete

For the Complete CDC data pipeline a last new table type is necessary with the stereotype SL - CDC Complete.

This table contains only changes (Insert, Update and Delete) but all of the data that the CDC LPS identified. Data which has not changed are finally ignored by the CDC LPS.

This table can contain as many or fewer columns compared to the Volatile Zone table. Depends on the design decision of the data modeler and the requirements for the ODLB and CDC table.

One additional column is mandatory:

  • Status CDC (Code: StatusCDC) - By the way (as of release 2.11.0), it does not matter where in the table this column is. It does not have to be in the last position.

The following picture shows an example of an Complete CDC table in the Stage Layer.

CDC Data Pipeline Stage Complete CDC Table Artwork In Museum CDCCDC Data Pipeline Stage Complete CDC Table Artwork In Museum CDC

As of Release 2.11.0, you can assign a stereotype to a column in tables with type SL - CDC Complete/Partially:

  • MD - Audit Trail Id (Optional)

This stereotype identifies one column which will store the Audit Trail Id provided by ExMeX Logging Framework.

The following picture shows an example of an Complete CDC table in the Stage Layer with a column of type MD - Audit Trail Id.

CDC Data Pipeline Stage Complete CDC Table Stg Clever CDC with Audit Trail IdCDC Data Pipeline Stage Complete CDC Table Stg Clever CDC with Audit Trail Id

The following pictures showing all parts for a Complete CDC Data Pipeline assembled.

CDC Data Pipeline Stage Complete and Partial Data FlowComplete CDC data pipeline and with Partial Data Flow via View

CDC Data Pipeline Stage Complete CDC Table Clever CDC with Audit Trail IdCDC Data Pipeline Stage Complete CDC Table Clever CDC with Audit Trail Id

Table of type SL - CDC Partially

For the Partially CDC data pipeline a new table type is necessary with the stereotype SL - CDC Partially and an additional view which unions identified deleted rows and the full load of the Volatile Zone table.

ODLB and HiZ tables are not maintained (loaded) by the Load Process Step (LPS). The loading of the ODLB or HiZ table must be ensured by another, additional LPS.

This table contains only deleted data that the CDC LPS identified. Data which has not changed, new or updated are finally ignored by the (partially) CDC LPS.

This table can contain as many or fewer columns compared to the Volatile Zone table. Depends on the design decision of the data modeler and the requirements for the ODLB and CDC table.

One additional column is mandatory:

  • Status CDC (Code: StatusCDC) - By the way (as of release 2.11.0), it does not matter where in the table this column is. It does not have to be in the last position.

The following picture shows an example of an PartiallyCDC table in the Stage Layer.

CDC Data Pipeline Stage Partially CDC Table Vehicle License Plate CDCCDC Data Pipeline Stage Partially CDC Table Vehicle License Plate CDC

Now the data modeler has to design the view for unifying the full load and the deleted data sets identified by the CDC LPS. The following picture shows a view that combines the Volatile Zone (set the StatusCDC = 'X' in SQL) table and the Partially CDC (set by the CDC LPS - StatusCDC = 'D') table with UNION ALL.

CDC Data Pipeline Stage Partially CDC View Vehicle License Plate CDCCDC Data Pipeline Stage Partially CDC View Vehicle License Plate CDC

The following picture shows all parts for a Partially CDC Data Pipeline assembled.

CDC Data Pipeline Stage PartiallyCDC Data Pipeline Stage Partially

Optimizing CDC Performance

To improve the performance of the CDC (MERGE statement), we recommend the following index guidelines:

  • Create an index on the CDC Compare Key columns (join columns) in the source table that is unique and covering.
  • Create a unique clustered index on the join columns (associated to the CDC Compare Key columns in the Stage Table) in the target table (ODLB or history).

These indexes ensure that the join keys are unique and the data in the tables is sorted. Query performance is improved because the query optimizer does not need to perform extra validation processing to locate and update duplicate rows and additional sort operations are not necessary.

Posted 1 year agoby Dirk.Lerner

How to design mappings for a CDC data Pipeline with ODLB?

Depending on how the data models are structured in the project, it is necessary that the data modeler creates the…

Depending on how the data models are structured in the project, it is necessary that the data modeler creates the Stage Layer Data Model (with the tables of type SL - Volatile Zone) for the Stage Layer Data Model (with the tables of type SL - CDC One Data Load Before) as data source.

For clarity, the Data Modeller can create the Stage Layer Data Model (with the tables of type SL - CDC One Data Load Before) for the Stage Layer Data Model (with the tables of type SL - CDC Complete/Partially) a second time as Data Source. However, this is not necessary, but optional!

CDC Data Pipeline Stage List of Data Sources

First: Mapping table of type SL - Volatile Zone (or SL - Volatile Zone View) to SL - One Data Load Before

The mapping of the columns from the table of type SL - Volatile Zone (or SL - Volatile Zone View) to type SL - One Data Load Before is kept simple. Connect the columns to be loaded by drag & drop, give the mapping a suitable name and that's it.

The additional Hash-Diff CDC column must not be mapped. The Load Process Step takes care of the loading of the column.

The following picture shows an exemplary mapping.

CDC Stage to ODLB Mapping EditorCDC Stage to ODLB Mapping Editor

Only columns that are present and mapped in the table of type SL - One Data Load Before will be loaded by the CDC LPS and written to the table of type SL - CDC Complete/Partially at the end. Not to be mistaken for the optional column stereotype CDC - Ignore.

Second: Mapping table of type SL - One Data Load Before to SL - CDC Complete/Partially

The mapping of the columns from the table of type SL - One Data Load Before to type SL - CDC Complete/Partially, again, is kept simple. Connect the columns to be loaded by drag & drop, give the mapping a suitable name and that's it.

The additional Status CDC column must not be mapped. The Load Process Step takes care of the loading of the column.

The following picture shows an exemplary mapping.

CDC ODLB to CDC Mapping EditorCDC ODLB to CDC Mapping Editor

Posted 1 year agoby Dirk.Lerner

How to design mappings for a CDC data Pipeline with History Zone?

First: Mapping table of type SL - Volatile Zone (or SL - Volatile Zone View) to SL - History Zone The…

First: Mapping table of type SL - Volatile Zone (or SL - Volatile Zone View) to SL - History Zone

The mapping of the columns from the table of type SL - Volatile Zone (or SL - Volatile Zone View) to type SL - History Zone is kept simple. Connect the columns to be loaded by drag & drop, give the mapping a suitable name and that's it.

If you have chosen this option, this should already have happened elsewhere.

Second: Mapping table of type SL - Volatile Zone to SL - CDC Complete/Partially

The mapping of the columns from the table of type SL - Volatile Zone to type SL - CDC Complete/Partially, again, is kept simple. Connect the columns to be loaded by drag & drop, give the mapping a suitable name and that's it.

The additional Status CDC column must not be mapped. The Load Process Step takes care of the loading of the column.

The following picture shows an exemplary mapping.

CDC Stage to CDC HiZ Mapping EditorCDC Stage to CDC HiZ Mapping Editor

Posted 1 year agoby Dirk.Lerner

Any background information?

DAPI The DAPI MetadataZone.MetadataForGenerateStageCDC is the basis for generating the LPS with the LoadPatternId 16 (ODLB table) and LoadPatternId 17…

DAPI

The DAPI MetadataZone.MetadataForGenerateStageCDC is the basis for generating the LPS with the LoadPatternId 16 (ODLB table) and LoadPatternId 17 (History Zone table).

Generator

The script for generating the LPS with LoadPatternId 16 is CDC-ODLB-Stage.sql

The script for generating the LPS with LoadPatternId 17 is CDC-HiZ-Stage.sql

Logging

Because of how the MERGE SQL statement logs data changed in the target, the ExMeX Logging Framework can display only the sum of inserted, updated, and deleted rows in the RowsInserted column of the FctProcessStepVapourTrail in the MetadataZone-Mart.

Quote from the Microsoft SQL Docs:

When used after MERGE, @@ROWCOUNT (Transact-SQL) returns the total number of rows inserted, updated, and deleted to the client.

Resources

Posted 1 year agoby Dirk.Lerner