Generating dimension data for dates

Most analytical and BI databases have date dimension table(s). One frequently needs to generate and populate such data. I present a solution below for such data generation, written in Python. Please use different database drivers/modules to connect to your specific database server (MySQL, SQL Server, Oracle, etc.) for data population.

Notes:

1. It takes 2 parameters, start date and end date, in YYYYMMDD format, inclusive. Extensive error checking is built in, but let me know if you have comments/suggestions;

2. The script produce a Python dictionary (associated array) and print out its content;

3. The output includes dayNumber: a day’s position in a year. For example, 2011-02-01 is the 32ed day in 2011, therefore its dayNumber is 32;

4. The output includes weekNumber: a week’s position in a year. The week number in year is based on ISO standard. From documentation: the ISO year consists of 52 or 53 full weeks, where a week starts on a Monday and ends on a Sunday. The first week of an ISO year is the first (Gregorian) calendar week of a year containing a Thursday. This is called week number 1, and the ISO year of that Thursday is the same as its Gregorian year.

So, 2011-01-01 has the weekNumber 52, because it falls on a Saturday and belongs to the last week of 2010.

5. The output includes weekday information as well. 4 different variations are included:
Sunday 0, Monday 1, and so on
Sunday 1, Monday 2, and so on
Monday 0, Tuesday 1, and so on
Monday 1, Tuesday 2, and so on

6. The script requires the argparse module. It comes with Python 2.7. Python version prior to 2.7 does not have it by default, therefore you need to install it.

import argparse, sys, time
from datetime import date, timedelta

parser = argparse.ArgumentParser(description=”Generating date dimension data”)
parser.add_argument(‘-s’, ‘–startDate’, help=’Start date in YYYYMMDD format’, required=True, dest=’startDate’)
parser.add_argument(‘-e’, ‘–endDate’, help=’end date in YYYYMMDD format’, required=True, dest=’endDate’)

argList = parser.parse_args()

if (((not argList.startDate.isdigit()) or (not (len(argList.startDate) == 8))) or ((not argList.endDate.isdigit()) or (not (len(argList.endDate) == 8))) or (argList.startDate > argList.endDate)):
print “Input(s) must be numeric in YYYYMMDD format and end date must not be earlier than start date”
sys.exit (1)

try:
startDate = date(int(argList.startDate[0:4]), int(argList.startDate[4:6]), int(argList.startDate[6:8]))
endDate = date(int(argList.endDate[0:4]), int(argList.endDate[4:6]), int(argList.endDate[6:8]))
except ValueError:
print “Input(s) must be valid date value in YYYYMMDD format”
sys.exit (1)

start = time.time()

while startDate <= endDate: dateInfo = {'dateYYYYMMDD': startDate.strftime('%Y%m%d'), 'calDate': startDate.strftime('%Y-%m-%d'), 'calDay': startDate.day, 'calMonth': startDate.month, 'calYear': startDate.year} dateInfo['dayOfWeekSunday0Monday1'] = startDate.isoweekday() % 7 dateInfo['dayOfWeekSunday1Monday2'] = startDate.isoweekday() % 7 + 1 dateInfo['dayOfWeekSunday6Monday0'] = startDate.weekday() dateInfo['dayOfWeekSunday7Monday1'] = startDate.isoweekday() dateInfo['dayNumber'] = startDate.toordinal() - date(startDate.year - 1, 12, 31).toordinal() dateInfo['weekNumber'] = startDate.isocalendar()[1] print dateInfo startDate = startDate + timedelta(1) [/sourcecode]

Data generation with TPC-H’s dbgen for load testing

2011-06-26 update:

I am not sure if there are any changes in the latest make and gcc packages. Anyway, I noticed when run make, I encountered the message below:

make: g: Command not found
make: [qgen] Error 127 (ignored)

To fix this, find where gcc is at, then created a symbolic link g that points to gcc. All is well afterwards:
[root@ip-10-245-209-196 dbgen]# which gcc
/usr/bin/gcc
[root@ip-10-245-209-196 dbgen]# cd /usr/bin/
[root@ip-10-245-209-196 bin]# ln -s gcc g

End update
Recently I found myself doing some data loading benchmark testing with table partition. Data loading and storing for BI/DW/DSS stuff almost always involves data partitioning. SQL Server partition has a nice feature called partition switch, where you can swap data in and out of a partitioned table. In MySQL, it is called exchange, but it hasn’t been released. Perhaps other RDBMS have similar things as well.

To facilitate testing, I need some data that lends itself easily for partition, which led me to TPC-H’s dbgen tool.

dbgen is pretty easy to build and compile on Linux. It also has a Windows version. I didn’t bother trying because getting it built on Linux was such a quick and easy process. However, the documentation of dbgen, if it can be called as such, leaves a lot to be desired. But by experimenting and surfing around, I was able to get what I needed.

Instructions here are for CentOS. Building it on other flavors of Linux should also be easy.

1. yum install make and yum install gcc if you don’t have that already;

2. wget source code;

3. tar xf tarball;

4. cd dbgen

5. vi makefile.suite and change 4 lines. See uncommented lines below:

################
## CHANGE NAME OF ANSI COMPILER HERE
################
CC      = gcc
# Current values for DATABASE are: INFORMIX, DB2, TDAT (Teradata)
#                                  SQLSERVER, SYBASE, ORACLE
# Current values for MACHINE are:  ATT, DOS, HP, IBM, ICL, MVS,
#                                  SGI, SUN, U2200, VMS, LINUX, WIN32
# Current values for WORKLOAD are:  TPCH
DATABASE= SQLSERVER
MACHINE = LINUX
WORKLOAD = TPCH

Note: It does not have MySQL and others for DATABASE setting, but that really doesn’t matter, because the file generated, which is delimited by a pipe, |, can be used for loading into any system.

6. make -f makefile.suite
This creats the dbgen binary that we can use in the next step.

7. ./dbgen -h gives you a brief description of switches that can be used. I am just interested in creating the lineitem table, because it has a good combination of integer, decimal, date, and character fields. The date inside the tables spans between 1992-01-01 and 1998-12-31, which is good for a partitioned table based on date. The DDL for this table is inside the dss.ddl file in the same directory. Please modify it as necessary. In fact, Lubor Kollar has it ready for SQL Server here, and Vadim Tkachenko has it for MySQL here.

Here is the command I used and its results:

[root@centos dbgen]# ./dbgen -v -T L -s 2
TPC-H Population Generator (Version 2.14.0)
Copyright Transaction Processing Performance Council 1994 - 2010
Generating data for lineitem table/
Preloading text ... 100%
Do you want to overwrite ./lineitem.tbl ? [Y/N]: Y
done.

-v: verbose, -T L: lineitem only, -s 2: scale factor of 2. My understanding is that it roughly indicates that the end file will be close to 2 gig.

It turns out the end file is about 1.5 gig, with close to 12 million rows in it.

-rw-r--r-- 1 root root  1.5G Mar 30 21:33 lineitem.tbl
[root@centos dbgen]# wc -l lineitem.tbl 
11997996 lineitem.tbl

You can play around with different key switches to get the size of file you want. For example, quoting from Lubor Kollar:

dbgen –T L -s 4 -C 3 -S 1

Using the –s option, we set the scale to 4 that generates a Lineitem table of 3GB. Using the –C option we split the table into 3 portions, and then using the –S option we chose only the first 1GB portion of the Lineitem table.

Running the command above, I got a file named lineitem.tbl.1 with close to 8 million rows, about 976 meg.

When trying to load it into SQL Server with SSIS bulk copying task, remember the delimiter is | and line separator is LF (line feed). For MySQL, something like this should do:

load data local infile '/root/dbgen/lineitem.tbl.1' into table lineitem fields terminated by '|' lines terminated by '\n';

SSDs and their impact on database servers

Vadim Tkachenko published interesting benchmark results with PCI-E based SSDs here. I recently got a chance to benchmark FusionIO’s 320 GB PCI-E drive. It was really impressive. My results, done on Windows with sqlio, are consistent (not identical, of course, but in the same ballpark) with what Vadim reported in that blog post, done with sysbench on Linux.

sqlio is a popular IO throughput testing tool from Microsoft. I didn’t get to test the throughput when the SSD is close to full. The key takeaways that I learned from my testing are:

1. I can confirm that there is no difference between random and sequential IO, contrary to the traditional spindle based hard disks;

2. Read is significantly faster than write. Reads and writes with 64 threads can achieve around 1.4 GB/S and 400 MB/S throughput, respectively.

It is good to hear that another vendor, Virident, also offers similar PCI-E based SSDs around similar price points. I think as the technology behind SSDs matures and price of it decreases, its impact on the database field could be significant.

Most, if not all, database platforms have put a lot of emphasis on achieving sequential reads and minimizing fragmentation. This is especially true for business intelligence databases. However, as SSDs gradually take over high end storage market for mission critical databases, with no difference between random and sequential access on those disks, it could relegate our obsession on sequential reads and fragmentation to the back burner, which is significant, in my opinion.

I am not suggesting that SSDs effectively make the fragmentation issue disappear, mind you. Because many database software, be it MySQL (perhaps the InnoDB storage engine in particular), SQL Server, or Oracle, uses prefetch, read-ahead read, or whatever the lingo maybe, to anticipate the need and bring additional data into cache in advance in case they are used next. This can be especially effective for business intelligence applications. In the case of prefectching, fragmentation is still undesirable, but the speed of SSDs could make it a less pressing issue than it has been, which is significant, I think.

A few years back, when multi-core servers started emerging, database software vendors invested time and energy to take advantage of that, with new features added and white paper created for NUMA architecture. I think it will be interesting to watch how emerging SSDs in the server market will be exploited. Percona’s white paper, Scaling MySQL Deployments With Percona Server and Virident tachIOn Drives, that talks “scaling up” with SSDs, instead of “scaling out” with sharding, is a pretty interesting idea.

SQL code for SQL and Relational Theory

I am reading SQL and Relational Theory by C. J. Date. Baron Schwartz wrote a nice review for it. I am reading the online version, about half way through, so am not sure if it has an accompanying CD with source code. In any case, if you want to play with some SQL code listed in the book, here is the script to generate the tables and rows (or should I say relations at a certain point in time with tuples?)

CREATE TABLE S
   ( SNO    VARCHAR(5)   NOT NULL ,
     SNAME  VARCHAR(25)  NOT NULL ,
     STATUS INTEGER      NOT NULL ,
     CITY   VARCHAR(20)  NOT NULL ,
     UNIQUE ( SNO ) ) ;

 CREATE TABLE P
   ( PNO    VARCHAR(6)   NOT NULL ,
     PNAME  VARCHAR(25)  NOT NULL ,
     COLOR  CHAR(10)     NOT NULL ,
     WEIGHT NUMERIC(5,1) NOT NULL ,
     CITY   VARCHAR(20)  NOT NULL ,
     UNIQUE ( PNO ) ) ;

 CREATE TABLE SP
   ( SNO    VARCHAR(5)   NOT NULL ,
     PNO    VARCHAR(6)   NOT NULL ,
     QTY    INTEGER      NOT NULL ,
     UNIQUE ( SNO , PNO ) ,
     FOREIGN KEY ( SNO )
        REFERENCES S ( SNO ) ,
     FOREIGN KEY ( PNO )
        REFERENCES P ( PNO ) ) ;

INSERT INTO S (SNO, SNAME, STATUS, CITY) VALUES ('S1', 'Smith', 20, 'London');
INSERT INTO S (SNO, SNAME, STATUS, CITY) VALUES ('S2', 'Jones', 10, 'Paris');
INSERT INTO S (SNO, SNAME, STATUS, CITY) VALUES ('S3', 'Blake', 30, 'Paris');
INSERT INTO S (SNO, SNAME, STATUS, CITY) VALUES ('S4', 'Clark', 20, 'London');
INSERT INTO S (SNO, SNAME, STATUS, CITY) VALUES ('S5', 'Adams', 30, 'Athens');

INSERT INTO P (PNO, PNAME, COLOR, WEIGHT, CITY) VALUES ('P1', 'Nut', 'Red', 12.0, 'London');
INSERT INTO P (PNO, PNAME, COLOR, WEIGHT, CITY) VALUES ('P2', 'Bolt', 'Green', 17.0, 'Paris');
INSERT INTO P (PNO, PNAME, COLOR, WEIGHT, CITY) VALUES ('P3', 'Screw', 'Blue', 17.0, 'Oslo');
INSERT INTO P (PNO, PNAME, COLOR, WEIGHT, CITY) VALUES ('P4', 'Screw', 'Red', 14.0, 'London');
INSERT INTO P (PNO, PNAME, COLOR, WEIGHT, CITY) VALUES ('P5', 'Cam', 'Blue', 12.0, 'Paris');
INSERT INTO P (PNO, PNAME, COLOR, WEIGHT, CITY) VALUES ('P6', 'Cog', 'Red', 19.0, 'London');

INSERT INTO SP (SNO, PNO, QTY) VALUES ('S1', 'P1', 300);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S1', 'P2', 200);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S1', 'P3', 400);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S1', 'P4', 200);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S1', 'P5', 100);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S1', 'P6', 100);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S2', 'P1', 300);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S2', 'P2', 400);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S3', 'P2', 200);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S4', 'P2', 200);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S4', 'P4', 300);
INSERT INTO SP (SNO, PNO, QTY) VALUES ('S4', 'P5', 400);

Thoughts on Data Masking

Often times, production data needs to be moved to different environments for testing/developing purposes. However, some of that data can be people’s name, birthday, address, account number, etc., that we don’t want testers and/or developers to see, due to privacy and regulatory concerns. Hence the need to mask those data. I can certainly see this needs grow over time for all database platforms. There are software out there that does this sort of task, or similar tasks, such as data generation tool. Oracle actually has a Data Masking Pack since 10g for this purpose. Here are some of my thoughts on this topic.

One method of masking data is through reshuffling, which shuffles the value in target column(s) that you want to protect randomly across different rows.

Another way of doing it is through data generation. For instance, for target column(s), we just replace its value with something else.

For reshuffling, obviously the data element is still meaningful. In other words, a reshuffled account number is still a valid account number, only now its original owner has been changed. Depending on how stringent the requirements are, this may or may not be enough.

For data generation, we have this question to consider: is the format of the generated data important to us? If yes, then obviously some intelligence needs to be built in so that the generated data follows the format we define. For instance, a valid credit card number is 16 digits long, has certain prefix and/or suffix, the nth digit has a certain meaning, so on and so forth.

Another example is people’s name. Do we replace the name with some random letters we concoct together, or do we want the name to be realistic? If we want realistic names, then we may have to supply a dictionary for the masking software to pull that data from.

In either case, we also have the unique and foreign key constraints to deal with, if there are any. In certain instances where more than one schema/database is involved, the complexity increases exponentially.

Regardless of the method being used, performance of data masking process is important to consider. If the volume of data to be masked is small, then it may not be a big deal. But, as is often the case, you may have a huge transaction table that has millions and millions of rows to mask, then performance is a definite concern.

One idea I am toying around with for data masking performance issue is through low-level data manipulation. For instance, in MySQL, maybe play with rowid. And for Sql Server, play around with fileid, pageid, and such.

Another way to get around that is to do masking through batches. In other words, divide a big task into smaller tasks and tackle them one at a time.

Personally, I like the idea of data reshuffling. On one hand, the data element is meaningful. I know I don’t want to work with randomly generated gibberish that does not make sense to me. On the other hand, if one wants to do performance testing in test or development environment, one would like to have the data distribution as close to production as possible. And data reshuffling can probably keep the data distribution pretty close to that of production.

In my next entry, I will share a simple C# program I wrote to reshuffle data inside a CSV file.