In July 2016 Mathias Brink and I had given a webinar how to implement Data Vault on a EXASOL database. Read more about in my previous blogpost or watch the recording on Youtube.
Afterward I became a lot of questions per our webinar. I’ll now answer all questions I got till today. If you have further more questions feel free to ask via my contact page,via Twitter, or write a comment right here.
Asked and answered
Question: Why are the Surrogate Keys implemented as DECIMAL(18,0) and not as INTEGER? Are there special performance or space consumption reasons?
Answer: No, none of them. It’s because DECIMAL(18,0) and INTEGER is EXASOL internally the same.
Q: Why is there no partitioning (distribution) on tables? Is there no need for a MPP partitioning strategy, no impact on query performance?
A: We didn’t use Distributions Keys for a) keeping the model more simple and b) the DDL is developed for the EXASolo (a one node system) which does not need Distribution Keys. Distribution Keys are important, when the EXASOL database has more than one node.
Q: Are there any consideration to do a time line partition on Satellites? Is there (multi-)partitioning at all available on EXASOL?
A: No, for simplicity we didn’t model any kind of partitioning. On the implementation site EXASOL does a lot self-optimizations and therefore there is no need for physical partitioning.
Q: Does EXASOL have temporal tables options?
A: No, not yet.
Q: You implemented physical Foreign Key constraints. Are there any impacts on load performance? With Teradata normally constraints are disabled.
A: We used enabled Foreign Key constraints due to data quality reasons. And no, we didn’t make tests onto differences of load performance with disabled or enabled Foreign Key constraints. As usual, you have the ability to disable Foreign Key constraints before load jobs will start and enable Foreign Key constraints again after data is populated.
Constraint Enabled: Will check if the value of a Foreign Key attribute is available in the referenced Primary Key.
Constraint Disabled: Like trusted or reliable Foreign Key constraints on other databases checks will not be performed on loading data.
If you don’t want to specify Enable/Disable on each ALTER TABLE statement there’s a global default available:
ALTER SYSTEM SET CONSTRAINT_STATE_DEFAULT = 'ENABLE';
Q: Are there benchmark comparisons to other databases available?
A: No.
Q: So, what is the main difference between your two Data Vault schema? How is the optimized one different?
A: The two key points are 1) the reference tables and 2) the redesigned Link LNK_LINE_ITEM. With this changes you’ll get a vast performance "upgrade" for the TPC-H benchmark SQLs. If there’ll be the same behavior on other databases like EXASOL is not tested.
Q: On your Primary Key and Foreign Key constraints, seems it would be better to name them rather than have the sys_122345 names in the DDL. I assume you created them with no names then reverse engineered the DDL to end up with those names?
A: Correct. They were first created without names and therefore got a system number. From a modeling perspective, it’s a bad approach not to name them. I’ve updated the download files.
Q: Any reason that the Hub Alternate Keys (aka Business Keys) inherited to the Links are optional? Ditto on Hub HUB_PART and HUB_ORDER where the Alternate Keys are optional.
A: No, it’s an error in the model. Shouldn't be there. I’ve updated the download files.
Q: Did you record the load metrics anywhere? How long to load the Hubs and how many rows?
A: EXASOL does record this information in Log-Tables (Hint: Therefor auditing needs to be enabled on database level, follow the link.), e.g.:
FROM EXA_DBA_AUDIT_SQL
WHERE COMMAND_CLASS = 'DML'
AND UPPER(SQL_TEXT) LIKE '%INSERT%HUB%';
So long,
Dirk