Generating dimension data for dates

Most analytical and BI databases have date dimension table(s). One frequently needs to generate and populate such data. I present a solution below for such data generation, written in Python. Please use different database drivers/modules to connect to your specific database server (MySQL, SQL Server, Oracle, etc.) for data population.

Notes:

1. It takes 2 parameters, start date and end date, in YYYYMMDD format, inclusive. Extensive error checking is built in, but let me know if you have comments/suggestions;

2. The script produce a Python dictionary (associated array) and print out its content;

3. The output includes dayNumber: a day’s position in a year. For example, 2011-02-01 is the 32ed day in 2011, therefore its dayNumber is 32;

4. The output includes weekNumber: a week’s position in a year. The week number in year is based on ISO standard. From documentation: the ISO year consists of 52 or 53 full weeks, where a week starts on a Monday and ends on a Sunday. The first week of an ISO year is the first (Gregorian) calendar week of a year containing a Thursday. This is called week number 1, and the ISO year of that Thursday is the same as its Gregorian year.

So, 2011-01-01 has the weekNumber 52, because it falls on a Saturday and belongs to the last week of 2010.

5. The output includes weekday information as well. 4 different variations are included:
Sunday 0, Monday 1, and so on
Sunday 1, Monday 2, and so on
Monday 0, Tuesday 1, and so on
Monday 1, Tuesday 2, and so on

6. The script requires the argparse module. It comes with Python 2.7. Python version prior to 2.7 does not have it by default, therefore you need to install it.

import argparse, sys, time
from datetime import date, timedelta

parser = argparse.ArgumentParser(description=”Generating date dimension data”)
parser.add_argument(‘-s’, ‘–startDate’, help=’Start date in YYYYMMDD format’, required=True, dest=’startDate’)
parser.add_argument(‘-e’, ‘–endDate’, help=’end date in YYYYMMDD format’, required=True, dest=’endDate’)

argList = parser.parse_args()

if (((not argList.startDate.isdigit()) or (not (len(argList.startDate) == 8))) or ((not argList.endDate.isdigit()) or (not (len(argList.endDate) == 8))) or (argList.startDate > argList.endDate)):
print “Input(s) must be numeric in YYYYMMDD format and end date must not be earlier than start date”
sys.exit (1)

try:
startDate = date(int(argList.startDate[0:4]), int(argList.startDate[4:6]), int(argList.startDate[6:8]))
endDate = date(int(argList.endDate[0:4]), int(argList.endDate[4:6]), int(argList.endDate[6:8]))
except ValueError:
print “Input(s) must be valid date value in YYYYMMDD format”
sys.exit (1)

start = time.time()

while startDate <= endDate: dateInfo = {'dateYYYYMMDD': startDate.strftime('%Y%m%d'), 'calDate': startDate.strftime('%Y-%m-%d'), 'calDay': startDate.day, 'calMonth': startDate.month, 'calYear': startDate.year} dateInfo['dayOfWeekSunday0Monday1'] = startDate.isoweekday() % 7 dateInfo['dayOfWeekSunday1Monday2'] = startDate.isoweekday() % 7 + 1 dateInfo['dayOfWeekSunday6Monday0'] = startDate.weekday() dateInfo['dayOfWeekSunday7Monday1'] = startDate.isoweekday() dateInfo['dayNumber'] = startDate.toordinal() - date(startDate.year - 1, 12, 31).toordinal() dateInfo['weekNumber'] = startDate.isocalendar()[1] print dateInfo startDate = startDate + timedelta(1) [/sourcecode]

A comparison of HandlerSocket and mysql client libraries with Python

I’ve done some benchmark testing of 2 Python modules for MySQL data retrieval: MySQLdb and pyhs. MySQLdb uses MySQL’s client libraries, whereas pyhs uses HandlerSocket that bypasses MySQL’s client layer and interfaces Innodb storage engine’s files directly. In my testing, HandlerSocket results in 82% improvement over mysql client libraries based on number of rows retrieved. The tests were conducted under different conditions: right after a start when cache is cold, a warmed up cache after running SELECT * FROM customer, and alternating the execution order of those 2 Python files. The results are fairly consistent in that they all fall in the same range. Below is a sample output.

root@ubuntu:~# python hanSolo.py 
Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0000810623
Rows retrieved:  509863
root@ubuntu:~# python mclient.py 
Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0001530647
Rows retrieved:  280120

Here is my setup:

Hardware and software:
1. Rackspace Cloud server Ubuntu 10.04 Lucid Lynx, 1 gig memory, 40 gig hard disk space, 64-bit
Linux ubuntu 2.6.35.4-rscloud #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64 GNU/Linux

2. Following instruction here to get Percona’s APT repository;

3. apt-get install percona-server-client-5.5

4. apt-get install percona-server-server-5.5

Enable HandlerSocket plugin. HandlerSocket is bundled with Percona Server 5.5, so you don’t have to download source files, config, make, build yourself:

1. mysql> install plugin handlersocket soname 'handlersocket.so'

2 cp /usr/share/mysql/my-large.cnf /etc/mysql/my.cnf

3. vim /etc/mysql/my.cnf with the following under mysqld section

loose_handlersocket_port = 9998
loose_handlersocket_port_wr = 9999
loose_handlersocket_threads = 16
loose_handlersocket_threads_wr = 1
open_files_limit = 65535

4. service mysql restart

Acquire Python’s MySQLdb and pyhs modules:

1. apt-get install libmysqlclient-dev
Necessary for building Python’s MySQLdb

2. apt-get install python-dev
Necessary Python header files, gcc, make, etc., for building Python modules

2. wget the appropriate egg from this page. Get the one for your version of Python.

3. sh eggFileDownloadedFromTheStepAbove

4. easy_install MySQL-python
MySQLdb module, which uses mysql client for MySQL access

5. easy_install python-handler-socket

Prepare testing data

1. Follow instructions here to get dbgen compiled;

2. While at the proper directory, run
./dbgen -v -T c
It will generate a customer file that has 150000 rows

3. Create customer table in test. Here is the DDL:

CREATE TABLE customer ( C_CUSTKEY     INTEGER NOT NULL,
C_NAME        VARCHAR(25) NOT NULL,
C_ADDRESS     VARCHAR(40) NOT NULL,
C_NATIONKEY   INTEGER NOT NULL,
C_PHONE       CHAR(15) NOT NULL,
C_ACCTBAL     DECIMAL(15,2)   NOT NULL,
C_MKTSEGMENT  CHAR(10) NOT NULL,
C_COMMENT     VARCHAR(117) NOT NULL,
primary key (C_CUSTKEY));

4. load data local infile ‘/root/dbgen/customer.tbl’ into table customer fields terminated by ‘|’ lines terminated by ‘\n’;
Adjust file location as necessary.

Finally, here is the content of my throwaway Python test scripts. One highlight of mclient.py is it demonstrate how to return results back in as dict with MySQLdb.

hanSolo.py

import time
from pyhs import Manager

hs = Manager()

start = time.time()
i = 1
j= 0
while i < 150000:
	data = hs.get('test', 'customer', &#91;'C_CUSTKEY', 'C_NAME', 'C_ADDRESS'&#93;, '%s' % i)
	i=i+1
	if i == 150000:
		i = 1
	end = time.time()
	j = j + 1
	if int(end - start) > 60:
		break

print "Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)

mclient.py

import sys, MySQLdb, time

my_host = "localhost"
my_user = "root"
my_pass = ""
my_db = "test"

try:
    db = MySQLdb.connect(host=my_host, user=my_user, passwd=my_pass, db=my_db)
except MySQLdb.Error, e:
     print "Error %d: %s" % (e.args[0], e.args[1])
     sys.exit (1)

cursor = db.cursor (MySQLdb.cursors.DictCursor)
i=1
j=0
start = time.time()
while i < 150000:
	sql = "select c_custkey, c_name, c_address from customer where c_custkey=%s" % i;
	cursor.execute(sql)
	results = cursor.fetchall()
	i=i+1
	if i==150000:
		i=1
	end = time.time()
	j=j+1
	
	if int(end - start) > 60:
		break
print "Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)
db.close()

Finding long running INNODB transactions

Notes:
1. The script prints out elapsed time since transaction started, MySQL thread id, and the kill statement for transactions running longer than a defined threshold value, in seconds. Just copy, paster, and then execute the kill statement if you want to terminate the long transaction(s);
2. Adjust shellCmd variable;
3. Adjust longRunningThreshold value as needed. It is measured in seconds;
4. No special libraries/modules needed, as long as there is a working mysql client;
5. re module is used for regex processing. Good place to find examples of regular expression search and grouping. A status variable is used to assist locating MySQL thread id once a transaction running longer than the defined threshold is found.

import re, shlex, subprocess 

def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)

shellCmd = """mysql -h hostName -e "show innodb status\\G""""
longRunningThreshold = 600

returnCode, stdOut, stdErr = runCmd(shellCmd)

targetTransactionFound = False
for line in stdOut.split('\n'):
    if targetTransactionFound:
        match = re.search(r'^MySQL\sthread\sid\s(\d+),', line)
        if match:
            print 'MySQL thread id', match.group(1), ' has been running for ', secondsTransactionElapsed, ' seconds'
            print 'To kill it, run: kill', match.group(1)
            targetTransactionFound = False
    else:
        match = re.search(r'^---TRANSACTION\s\w+,\sACTIVE\s(\d+)\ssec', line)
        if match:
            if (long(match.group(1)) > longRunningThreshold):
                targetTransactionFound = True
                secondsTransactionElapsed = match.group(1)

Poor man’s MySQL replication monitoring

Using MySQL replication slave(s) for reporting (with potentially different storage engines) is a very popular way of scaling database read activities. As usual, you want to be on top of things when replication breaks so end users can be notified and issues addressed. When Nagios, Zabbix, or whatever monitoring tools are not available or otherwise not accessible, there got to be another way. I wrote the following Python script for that purpose. The script can then be scheduled via crontab to check replication status in an interval you define. When things break, you get a notification email.

Notes:
1. I toyed with MySQLdb Python module for this task, but I don’t like the fact that I cannot easily retrieve values via column names in a MySQLdb cursor. If there is an easier way that I am not aware of due to my ignorance, I’d appreciate it if you could let me know.

I then tried pyodbc, but got a Python 2.4 compiling issue. I didn’t have time to troubleshoot at the moment, so I opted for subprocess calling mysql client and then processing standard output. The added benefit of this approach is the lack of dependencies on additional Python modules;

2. Here is my test for replication status: I look at 3 replication slave status variables. If Slave_IO_Running == “Yes” and Slave_SQL_Running == “Yes” and Last_Errno == “0”, then I consider that to be success. Otherwise, I want to know about it. Let me know if this test is not sufficient or there is a better way.

3. Good place to scavenge code for running external command, list splitting, removing the first and last list items, and dictionary building for Python.

import shlex, smtplib, string, subprocess, sys
from socket import gethostname

def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)

returnCode, stdOut, stdErr = runCmd('mysql -e "show slave status\\G"')
if returnCode:
        print >> sys.stderr, "There was an error (%d): \n" % returnCode
        print >> sys.stderr, stdErr

slaveStatusList = stdOut.split('\n')
#Remove the first and last item of the list
del slaveStatusList[0]
del slaveStatusList[-1]
slaveStatusDict={}
for i in slaveStatusList:
        slaveStatusDict[i.split(':')[0].strip()] = i.split(':')[1].strip()

if (slaveStatusDict["Slave_IO_Running"] == "Yes" and slaveStatusDict["Slave_SQL_Running"] == "Yes" and slaveStatusDict["Last_Errno"] == "0"):
        print "cool"
else:
        emailSubject = "Replication problem on slave " + gethostname()
        emailTo = "firstName.lastName@gmail.com"
        emailFrom = "robot@mycompany.com"
        emailBody = string.join((
                "From: %s" % emailFrom,
                "To: %s" % emailTo,
                "Subject: %s" % emailSubject ,
                "",
                stdOut
                ), "\r\n")
        server = smtplib.SMTP("localhost")
        server.sendmail(emailFrom, [emailTo], emailBody)
        server.quit()

Install MySQLdb module for Python

Update:

Commenter MarkR made a great point: if possible, use some packaging tools, to try to maintain proper dependencies, to the extent that is possible. Install from the source should be Plan B. So, try yum install MySQL-python first.

This is mostly for my own future reference. It’ll be icing on the cake if it helps you!

This is geared for CentOS or Red Hat. Use apt-get or other packaging tools for different flavours of Linux.

1. Get Python module setuptools called easy_install. I love easy_install, by the way, sort of like CPAN for Perl modules;
2. To install MySQLdb package, you would think easy_install MySQLdb would do. But that is not the case. I hope the developer would fix that. Instead, you need:

easy_install MySQL-python

3. If you have build errors, you may need:
yum install python-devel or yum install gcc or both.

Update: one more tip. If you encounter:

gcc: error trying to exec ‘cc1plus’: execvp: No such file or directory, you probably need:

yum install gcc-c++