Generating large example data with TPC-H
- Geschrieben von Dirk Lerner
- Published in General Modeling
Several times I had the need for some large data sets to do some Data Vault tests at customer site, writing a blogpost, doing a demo or a webinar and many more. And sometimes I need data to do performance or data usage tests on different databases. Due to my work together with EXASOL I focused on the TPC-H tool DBGen to generate gigabytes of data.
To share my experience with DBGen generating large data sets I wrote this blogpost as a step by step instruction.
To compile the DBGen I’ve downloaded first the TPC-H sources and then Microsoft Visual Studio Community Edition and installed it.
After unzipping the tpc-h zip-file go to tpc-h-tool\tpch_2_17_0\dbgen and open the visual studio project file tpch.sln and build it.
There will be some errors but everything seems to build fine. Executing DBGen with the option -h I get some help:
C:\tpc-h-tool\tpch_2_17_0\dbgen\Debug>dbgen -h
TPC-H Population Generator (Version 2.17.0 build 0)
Copyright Transaction Processing Performance Council 1994 - 2010
USAGE:
dbgen [-{vf}][-T {pcsoPSOL}]
[-s <scale>][-C <procs>][-S <step>]
dbgen [-v] [-O m] [-s <scale>] [-U <updates>]
Basic Options
===========================
-C <n> -- separate data set into <n> chunks (requires -S, default: 1)
-f -- force. Overwrite existing files
-h -- display this message
-q -- enable QUIET mode
-s <n> -- set Scale Factor (SF) to <n> (default: 1)
-S <n> -- build the <n>th step of the data/update set (used with -C or -U)
-U <n> -- generate <n> update sets
-v -- enable VERBOSE mode
Advanced Options
===========================
-b <s> -- load distributions for <s> (default: dists.dss)
-d <n> -- split deletes between <n> files (requires -U)
-i <n> -- split inserts between <n> files (requires -U)
-T c -- generate cutomers ONLY
-T l -- generate nation/region ONLY
-T L -- generate lineitem ONLY
-T n -- generate nation ONLY
-T o -- generate orders/lineitem ONLY
-T O -- generate orders ONLY
-T p -- generate parts/partsupp ONLY
-T P -- generate parts ONLY
-T r -- generate region ONLY
-T s -- generate suppliers ONLY
-T S -- generate partsupp ONLY
To generate the SF=1 (1GB), validation database population, use:
dbgen -vf -s 1
To generate updates for a SF=1 (1GB), use:
dbgen -v -U 1 -s 1
But when I executed DBGen to generate data I got an error:
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Open failed for .\dists.dss at c:\tpc-h-tool\tpch_2_17_0\dbgen\bm_utils.c:308
- Customers
- Nation
- Lineitem
- Orders
- Parts
- Partsupp
- Region
- Supplier
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.

The data definition file (dss.ddl) and also the referential integrity file (dss.ri) is the folder tpc-h-tool\tpch_2_17_0\dbgen.

Then simply load the data into the model and make your tests on it. Have fun!