WortmarkeTEDAMOH

Generating large example data with TPC-H

Several times I had the need for some large data sets to do some Data Vault tests at customer site, writing a blogpost, doing a demo or a webinar and many more. And sometimes I need data to do performance or data usage tests on different databases. Due to my work together with EXASOL I focused on the TPC-H tool DBGen to generate gigabytes of data.

To share my experience with DBGen generating large data sets I wrote this blogpost as a step by step instruction.

To compile the DBGen I’ve downloaded first the TPC-H sources and then Microsoft Visual Studio Community Edition and installed it.

After unzipping the tpc-h zip-file go to tpc-h-tool\tpch_2_17_0\dbgen and open the visual studio project file tpch.sln and build it.

TPC H Build DBGen

There will be some errors but everything seems to build fine. Executing DBGen with the option -h I get some help:

C:\tpc-h-tool\tpch_2_17_0\dbgen\Debug>dbgen -h
TPC-H Population Generator (Version 2.17.0 build 0)
Copyright Transaction Processing Performance Council 1994 - 2010
USAGE:
dbgen [-{vf}][-T {pcsoPSOL}]
        [-s <scale>][-C <procs>][-S <step>]
dbgen [-v] [-O m] [-s <scale>] [-U <updates>]

Basic Options
===========================
-C <n> -- separate data set into <n> chunks (requires -S, default: 1)
-f     -- force. Overwrite existing files
-h     -- display this message
-q     -- enable QUIET mode
-s <n> -- set Scale Factor (SF) to  <n> (default: 1)
-S <n> -- build the <n>th step of the data/update set (used with -C or -U)
-U <n> -- generate <n> update sets
-v     -- enable VERBOSE mode

Advanced Options
===========================
-b <s> -- load distributions for <s> (default: dists.dss)
-d <n> -- split deletes between <n> files (requires -U)
-i <n> -- split inserts between <n> files (requires -U)
-T c   -- generate cutomers ONLY
-T l   -- generate nation/region ONLY
-T L   -- generate lineitem ONLY
-T n   -- generate nation ONLY
-T o   -- generate orders/lineitem ONLY
-T O   -- generate orders ONLY
-T p   -- generate parts/partsupp ONLY
-T P   -- generate parts ONLY
-T r   -- generate region ONLY
-T s   -- generate suppliers ONLY
-T S   -- generate partsupp ONLY

To generate the SF=1 (1GB), validation database population, use:
        dbgen -vf -s 1

To generate updates for a SF=1 (1GB), use:
        dbgen -v -U 1 -s 1

But when I executed DBGen to generate data I got an error:

C:\tpc-h-tool\tpch_2_17_0\dbgen\Debug>dbgen
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Open failed for .\dists.dss at c:\tpc-h-tool\tpch_2_17_0\dbgen\bm_utils.c:308
 
To solve this issue just copy the tpc-h-tool\tpch_2_17_0\dbgen\dists.dss file to tpc-h-tool\tpch_2_17_0\dbgen\Debug. To avoid other issues, execute DBGen in the Debug folder.
If you just run DBGen without any options, it will generate one GB of data shared over eight different tables:
  • Customers
  • Nation
  • Lineitem
  • Orders
  • Parts
  • Partsupp
  • Region
  • Supplier
C:\Users\dirk.lerner\Downloads\tpc-h-tool\tpch_2_17_0\dbgen\Debug>dbgen -v
TPC-H Population Generator (Version 2.17.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for suppliers table/
Preloading text ... 100%
done.
Generating data for customers tabledone.
Generating data for orders/lineitem tablesdone.
Generating data for part/partsupplier tablesdone.
Generating data for nation tabledone.
Generating data for region tabledone.
 
TPC H Files
 
If you want to generate some more data use the option -s <n> (Scale Factor (SF)). By default, <n> is set to 1 and generates 1 GB of data. So -s 10 generates 10GB and -s 100 generates 100GB.
The data definition file (dss.ddl) and also the referential integrity file (dss.ri) is the folder tpc-h-tool\tpch_2_17_0\dbgen.
 
For your convenience I’ve created a data model and DDLs for different databases you can download right here:
 
TPC H Datamodel
Then simply load the data into the model and make your tests on it. Have fun!
 
Download scripts here:
 
And that’s it folks. There are some upcoming blogposts according to Data Vault on Exasol and other interesting performance topics. Stay tuned.
 
Please leave some comments if you need more information or examples for other databases.
 
So long,
 
Dirk

2°C

Darmstadt-Dieburg

Mostly Cloudy

Humidity: 87%

Wind: 6.44 km/h

  • 03 Jan 2019 2°C -3°C
  • 04 Jan 2019 3°C 0°C