Archive for Python

Generating dimension data for dates

Most analytical and BI databases have date dimension table(s). One frequently needs to generate and populate such data. I present a solution below for such data generation, written in Python. Please use different database drivers/modules to connect to your specific database server (MySQL, SQL Server, Oracle, etc.) for data population.

Notes:

1. It takes 2 parameters, start date and end date, in YYYYMMDD format, inclusive. Extensive error checking is built in, but let me know if you have comments/suggestions;

2. The script produce a Python dictionary (associated array) and print out its content;

3. The output includes dayNumber: a day’s position in a year. For example, 2011-02-01 is the 32ed day in 2011, therefore its dayNumber is 32;

4. The output includes weekNumber: a week’s position in a year. The week number in year is based on ISO standard. From documentation: the ISO year consists of 52 or 53 full weeks, where a week starts on a Monday and ends on a Sunday. The first week of an ISO year is the first (Gregorian) calendar week of a year containing a Thursday. This is called week number 1, and the ISO year of that Thursday is the same as its Gregorian year.

So, 2011-01-01 has the weekNumber 52, because it falls on a Saturday and belongs to the last week of 2010.

5. The output includes weekday information as well. 4 different variations are included:
Sunday 0, Monday 1, and so on
Sunday 1, Monday 2, and so on
Monday 0, Tuesday 1, and so on
Monday 1, Tuesday 2, and so on

6. The script requires the argparse module. It comes with Python 2.7. Python version prior to 2.7 does not have it by default, therefore you need to install it.

import argparse, sys, time
from datetime import date, timedelta
parser = argparse.ArgumentParser(description="Generating date dimension data")
parser.add_argument('-s', '--startDate', help='Start date in YYYYMMDD format', required=True, dest='startDate')
parser.add_argument('-e', '--endDate', help='end date in YYYYMMDD format', required=True, dest='endDate')
argList = parser.parse_args()
if (((not argList.startDate.isdigit()) or (not (len(argList.startDate) == 8))) or ((not argList.endDate.isdigit()) or (not (len(argList.endDate) == 8))) or (argList.startDate > argList.endDate)):
	print "Input(s) must be numeric in YYYYMMDD format and end date must not be earlier than start date"
	sys.exit (1)
try:
	startDate = date(int(argList.startDate[0:4]), int(argList.startDate[4:6]), int(argList.startDate[6:8]))
	endDate = date(int(argList.endDate[0:4]), int(argList.endDate[4:6]), int(argList.endDate[6:8]))
except ValueError:
	print "Input(s) must be valid date value in YYYYMMDD format"
	sys.exit (1)
start = time.time()
while startDate <= endDate:
	dateInfo = {'dateYYYYMMDD': startDate.strftime('%Y%m%d'), 'calDate': startDate.strftime('%Y-%m-%d'), 'calDay': startDate.day, 'calMonth': startDate.month, 'calYear': startDate.year}
	dateInfo['dayOfWeekSunday0Monday1'] = startDate.isoweekday() % 7
	dateInfo['dayOfWeekSunday1Monday2'] = startDate.isoweekday() % 7 + 1
	dateInfo['dayOfWeekSunday6Monday0'] = startDate.weekday()
	dateInfo['dayOfWeekSunday7Monday1'] = startDate.isoweekday()
	dateInfo['dayNumber'] = startDate.toordinal() - date(startDate.year - 1, 12, 31).toordinal()
	dateInfo['weekNumber'] = startDate.isocalendar()[1]
	print dateInfo
	startDate = startDate + timedelta(1)

Comments (1)

A comparison of HandlerSocket and mysql client libraries with Python

I’ve done some benchmark testing of 2 Python modules for MySQL data retrieval: MySQLdb and pyhs. MySQLdb uses MySQL’s client libraries, whereas pyhs uses HandlerSocket that bypasses MySQL’s client layer and interfaces Innodb storage engine’s files directly. In my testing, HandlerSocket results in 82% improvement over mysql client libraries based on number of rows retrieved. The tests were conducted under different conditions: right after a start when cache is cold, a warmed up cache after running SELECT * FROM customer, and alternating the execution order of those 2 Python files. The results are fairly consistent in that they all fall in the same range. Below is a sample output.

root@ubuntu:~# python hanSolo.py
Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0000810623
Rows retrieved:  509863
root@ubuntu:~# python mclient.py
Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:
Seconds elapsed:  61.0001530647
Rows retrieved:  280120

Here is my setup:

Hardware and software:
1. Rackspace Cloud server Ubuntu 10.04 Lucid Lynx, 1 gig memory, 40 gig hard disk space, 64-bit
Linux ubuntu 2.6.35.4-rscloud #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64 GNU/Linux

2. Following instruction here to get Percona’s APT repository;

3. apt-get install percona-server-client-5.5

4. apt-get install percona-server-server-5.5

Enable HandlerSocket plugin. HandlerSocket is bundled with Percona Server 5.5, so you don’t have to download source files, config, make, build yourself:

1. mysql> install plugin handlersocket soname 'handlersocket.so'

2 cp /usr/share/mysql/my-large.cnf /etc/mysql/my.cnf

3. vim /etc/mysql/my.cnf with the following under mysqld section

loose_handlersocket_port = 9998
loose_handlersocket_port_wr = 9999
loose_handlersocket_threads = 16
loose_handlersocket_threads_wr = 1
open_files_limit = 65535

4. service mysql restart

Acquire Python’s MySQLdb and pyhs modules:

1. apt-get install libmysqlclient-dev
Necessary for building Python’s MySQLdb

2. apt-get install python-dev
Necessary Python header files, gcc, make, etc., for building Python modules

2. wget the appropriate egg from this page. Get the one for your version of Python.

3. sh eggFileDownloadedFromTheStepAbove

4. easy_install MySQL-python
MySQLdb module, which uses mysql client for MySQL access

5. easy_install python-handler-socket

Prepare testing data

1. Follow instructions here to get dbgen compiled;

2. While at the proper directory, run
./dbgen -v -T c
It will generate a customer file that has 150000 rows

3. Create customer table in test. Here is the DDL:

CREATE TABLE customer ( C_CUSTKEY     INTEGER NOT NULL,
C_NAME        VARCHAR(25) NOT NULL,
C_ADDRESS     VARCHAR(40) NOT NULL,
C_NATIONKEY   INTEGER NOT NULL,
C_PHONE       CHAR(15) NOT NULL,
C_ACCTBAL     DECIMAL(15,2)   NOT NULL,
C_MKTSEGMENT  CHAR(10) NOT NULL,
C_COMMENT     VARCHAR(117) NOT NULL,
primary key (C_CUSTKEY));

4. load data local infile ‘/root/dbgen/customer.tbl’ into table customer fields terminated by ‘|’ lines terminated by ‘\n’;
Adjust file location as necessary.

Finally, here is the content of my throwaway Python test scripts. One highlight of mclient.py is it demonstrate how to return results back in as dict with MySQLdb.

hanSolo.py

import time
from pyhs import Manager
hs = Manager()
start = time.time()
i = 1
j= 0
while i < 150000:
	data = hs.get('test', 'customer', ['C_CUSTKEY', 'C_NAME', 'C_ADDRESS'], '%s' % i)
	i=i+1
	if i == 150000:
		i = 1
	end = time.time()
	j = j + 1
	if int(end - start) > 60:
		break
print "Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)

mclient.py

import sys, MySQLdb, time
my_host = "localhost"
my_user = "root"
my_pass = ""
my_db = "test"
try:
    db = MySQLdb.connect(host=my_host, user=my_user, passwd=my_pass, db=my_db)
except MySQLdb.Error, e:
     print "Error %d: %s" % (e.args[0], e.args[1])
     sys.exit (1)
cursor = db.cursor (MySQLdb.cursors.DictCursor)
i=1
j=0
start = time.time()
while i < 150000:
	sql = "select c_custkey, c_name, c_address from customer where c_custkey=%s" % i;
	cursor.execute(sql)
	results = cursor.fetchall()
	i=i+1
	if i==150000:
		i=1
	end = time.time()
	j=j+1
	if int(end - start) > 60:
		break
print "Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:"
print "Seconds elapsed: ", str(end - start)
print "Rows retrieved: ", str(j)
db.close()

Comments

Finding long running INNODB transactions

Notes:
1. The script prints out elapsed time since transaction started, MySQL thread id, and the kill statement for transactions running longer than a defined threshold value, in seconds. Just copy, paster, and then execute the kill statement if you want to terminate the long transaction(s);
2. Adjust shellCmd variable;
3. Adjust longRunningThreshold value as needed. It is measured in seconds;
4. No special libraries/modules needed, as long as there is a working mysql client;
5. re module is used for regex processing. Good place to find examples of regular expression search and grouping. A status variable is used to assist locating MySQL thread id once a transaction running longer than the defined threshold is found.

import re, shlex, subprocess
def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)
shellCmd = """mysql -h hostName -e "show innodb status\\G""""
longRunningThreshold = 600
returnCode, stdOut, stdErr = runCmd(shellCmd)
targetTransactionFound = False
for line in stdOut.split('\n'):
    if targetTransactionFound:
        match = re.search(r'^MySQL\sthread\sid\s(\d+),', line)
        if match:
            print 'MySQL thread id', match.group(1), ' has been running for ', secondsTransactionElapsed, ' seconds'
            print 'To kill it, run: kill', match.group(1)
            targetTransactionFound = False
    else:
        match = re.search(r'^---TRANSACTION\s\w+,\sACTIVE\s(\d+)\ssec', line)
        if match:
            if (long(match.group(1)) > longRunningThreshold):
                targetTransactionFound = True
                secondsTransactionElapsed = match.group(1)

Comments (3)

Poor man’s MySQL replication monitoring

Using MySQL replication slave(s) for reporting (with potentially different storage engines) is a very popular way of scaling database read activities. As usual, you want to be on top of things when replication breaks so end users can be notified and issues addressed. When Nagios, Zabbix, or whatever monitoring tools are not available or otherwise not accessible, there got to be another way. I wrote the following Python script for that purpose. The script can then be scheduled via crontab to check replication status in an interval you define. When things break, you get a notification email.

Notes:
1. I toyed with MySQLdb Python module for this task, but I don’t like the fact that I cannot easily retrieve values via column names in a MySQLdb cursor. If there is an easier way that I am not aware of due to my ignorance, I’d appreciate it if you could let me know.

I then tried pyodbc, but got a Python 2.4 compiling issue. I didn’t have time to troubleshoot at the moment, so I opted for subprocess calling mysql client and then processing standard output. The added benefit of this approach is the lack of dependencies on additional Python modules;

2. Here is my test for replication status: I look at 3 replication slave status variables. If Slave_IO_Running == “Yes” and Slave_SQL_Running == “Yes” and Last_Errno == “0″, then I consider that to be success. Otherwise, I want to know about it. Let me know if this test is not sufficient or there is a better way.

3. Good place to scavenge code for running external command, list splitting, removing the first and last list items, and dictionary building for Python.

import shlex, smtplib, string, subprocess, sys
from socket import gethostname
def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)
returnCode, stdOut, stdErr = runCmd('mysql -e "show slave status\\G"')
if returnCode:
        print >> sys.stderr, "There was an error (%d): \n" % returnCode
        print >> sys.stderr, stdErr
slaveStatusList = stdOut.split('\n')
#Remove the first and last item of the list
del slaveStatusList[0]
del slaveStatusList[-1]
slaveStatusDict={}
for i in slaveStatusList:
        slaveStatusDict[i.split(':')[0].strip()] = i.split(':')[1].strip()
if (slaveStatusDict["Slave_IO_Running"] == "Yes" and slaveStatusDict["Slave_SQL_Running"] == "Yes" and slaveStatusDict["Last_Errno"] == "0"):
        print "cool"
else:
        emailSubject = "Replication problem on slave " + gethostname()
        emailTo = "firstName.lastName@gmail.com"
        emailFrom = "robot@mycompany.com"
        emailBody = string.join((
                "From: %s" % emailFrom,
                "To: %s" % emailTo,
                "Subject: %s" % emailSubject ,
                "",
                stdOut
                ), "\r\n")
        server = smtplib.SMTP("localhost")
        server.sendmail(emailFrom, [emailTo], emailBody)
        server.quit()

Comments (3)

Install MySQLdb module for Python

Update:

Commenter MarkR made a great point: if possible, use some packaging tools, to try to maintain proper dependencies, to the extent that is possible. Install from the source should be Plan B. So, try yum install MySQL-python first.

This is mostly for my own future reference. It’ll be icing on the cake if it helps you!

This is geared for CentOS or Red Hat. Use apt-get or other packaging tools for different flavours of Linux.

1. Get Python module setuptools called easy_install. I love easy_install, by the way, sort of like CPAN for Perl modules;
2. To install MySQLdb package, you would think easy_install MySQLdb would do. But that is not the case. I hope the developer would fix that. Instead, you need:

easy_install MySQL-python

3. If you have build errors, you may need:
yum install python-devel or yum install gcc or both.

Update: one more tip. If you encounter:

gcc: error trying to exec ‘cc1plus’: execvp: No such file or directory, you probably need:

yum install gcc-c++

Comments (2)

Running external processes/programs with Python subprocess

Note: It is good to know when the external process runs and for how long, which is included in the code snippet below. For the “when” part, it would have been nice to know the time zone as well. But due to the limitation mentioned here, it is not as easy as it sounds with the datetime module.

It seems the time module in Python has better handling of time zone, but it ends at year 2038, which is a slight concern of mine, so I decided against it. The datetime module’s upper limit is year 9999. That is pretty far into the future, if we humans haven’t managed to extinct ourselves by then.

Speaking of the Linux/Unix epoch time, I wonder if we would encounter year 2038 issue, like the much talked about Y2K back in the day :)

import shlex, subprocess, sys
from datetime import datetime
def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)
startTime = datetime.now()
# "df -h" is a *nix command. To test on Windows, replace it with "notepad" or something like that.
returnCode, stdOut, stdErr = runCmd('df -h')
endTime = datetime.now()
print "The command started at " + str(startTime) + " and ran for " + str((endTime - startTime)) + "\n"
print stdOut
if returnCode:
	print >> sys.stderr, "There was an error (%d): \n" % returnCode
	print >> sys.stderr, stdErr

Update:

When I first posted this, I wrote that to test it on Windows, you can try the “dir” command. Actually, if you do, you will encounter the error below:

C:\Users\haidong\Documents\work\python>tt.py
Traceback (most recent call last):
  File "C:\Users\haidong\Documents\work\python\tt.py", line 14, in <module>
    returnCode, stdOut, stdErr = runCmd('df -h')
  File "C:\Users\haidong\Documents\work\python\tt.py", line 5, in runCmd
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
  File "C:\Python27\lib\subprocess.py", line 672, in __init__
    errread, errwrite)
  File "C:\Python27\lib\subprocess.py", line 882, in _execute_child
    startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

The reason is that dir is a DOS command that needs to be run on top of the command line process. If you replace “dir” with some other independent executibles, it should work. I’ve tried both “notepad” and “sqlcmd”, and both worked for me fine. “sqlcmd” is really the one I am after.

Comments

Parse sqlio log file with Python pyparsing

A few weeks ago I posted some questions on 3 Python text processing modules: pyparsing, SimpleParse, and NLTK. Today I need to analyze a log file generated by sqlio. I decided to use pyparsing. I am pretty pleased with it.

I am aware that there are at least 2 utility scripts for this: one is in Perl and then other is in Windows PowerShell. But I wanted to write in Python.

Noteworthy points:

1. In analyzing and building up the grammar/structure of the text for Python, I opted for line by line analysis and coding. I think this brings clarity and is easier to read. It felt a bit tedious in building it up, but once it is done, it is pretty rewarding to see the end results;
2. After parsing, the script writes the results to a csv file. You can then open that in a spreadsheet program to create some charts. Seeing the results in a column diagram chart brings clarity and focus to the result. It worked pretty well for me.

All in all, a good and productive day. Below are the details. The top part is the sample log content to be parsed, followed by the program.

"""
c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o1 -frandom -b64 -BH -LS Testfile.dat
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
	using 64KB random IOs
	enabling multiple I/Os per thread with 1 outstanding
	buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  6106.50
MBs/sec:   381.65
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 5
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 99  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o2 -frandom -b64 -BH -LS Testfile.dat
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
	using 64KB random IOs
	enabling multiple I/Os per thread with 2 outstanding
	buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  6951.19
MBs/sec:   434.44
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 6
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 87 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
"""
from pyparsing import *
output = open("sample.csv", "w")
input = open("hDriveSqlioResults.txt", "r")
data = input.read()
#------------------------------------------------------------------------
# Define Grammars
#------------------------------------------------------------------------
integer = Word(nums)
singleLetter = Word(alphas, max=1)
realNumber = Combine(Word(nums) + '.' + Word(nums))
reading = Literal("reading")
writing = Literal("writing")
readingOrWriting = reading | writing
random = Literal("random")
sequential = Literal("sequential")
randomOrSequential = random | sequential
#Line 1 is useful. We want to suck out the juicy parts.
sectionBeginning = CaselessLiteral("c:\\program files (x86)\\sqlio>sqlio ")
ioKind = "-k" + singleLetter("k")
numThreads = "-t" + integer("t")
numSeconds = "-s" + integer("s")
driveOrMountPath = "-d" + singleLetter("d")
outstandingRequests = "-o" + integer("o")
accessType = "-f" + Word(alphas)("f")
ioBlockSizeInKB = "-b" + integer("b")
#Line 2 can be safely ignored
line2 = "sqlio v" + restOfLine
#Line 3 can be safely ignored
line3 = "using system counter for latency timings" + restOfLine
#Line 4 can be safely ignored
line4 = integer + "threads" + readingOrWriting + restOfLine
#Line 5 can be safely ignored
line5 = "using" + integer + "KB" + randomOrSequential + "IOs" + restOfLine
#Line 6 can be safely ignored
line6 = "enabling multiple " + restOfLine
#Line 7 can be safely ignored
line7 = "buffering set to use" + restOfLine
#Line 8 we should get the file size out
line8 = "using current size: " + integer("fileSize") + Word(alphas)("fileSizeUnit") + restOfLine
#Line 9 can be safely ignored
line9 = "initialization done" + restOfLine
#Line 10 can be safely ignored
line10 = "CUMULATIVE" + restOfLine
#Line 11 can be safely ignored
line11 = "throughput" + restOfLine
#Line 12 we want IOPS
line12 = "IOs/sec:" + realNumber("IOPS") + restOfLine
#Line 13 we want MBPS
line13 = "MBs/sec:" + realNumber("MBPS") + restOfLine
#Line 14 can be safely ignored
line14 = "latency" + restOfLine
#Line 15 we need to get minLatency out
line15 = "Min_Latency(ms):" + integer("minLatency") + restOfLine
#Line 16 we need to get avgLatency out
line16 = "Avg_Latency(ms):" + integer("avgLatency") + restOfLine
#Line 17 we need to get maxLatency out
line17 = "Max_Latency(ms):" + integer("maxLatency") + restOfLine
#Line 18 can be safely ignored
line18 = "histogram" + restOfLine
#Line 19 can be safely ignored
line19 = "ms:" + restOfLine
#Line 20 can be safely ignored
line20 = "\%:" + restOfLine
extraStuff = ZeroOrMore(line14 | line15 | line16 | line17 | line18 | line19 | line20)
logEntry = sectionBeginning + ioKind + numThreads + numSeconds + driveOrMountPath + outstandingRequests + accessType + ioBlockSizeInKB + restOfLine + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9 + line10 + line11 + line12 + line13 + extraStuff
output.write("IO property, IOPS, MBPS, minLatencyInMS, avgLatencyInMS, maxLatencyInMS\n")
for tokens in logEntry.searchString(data):
	output.write("%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n" % tokens)

Comments (5)

Utility script for instance level objects comparison between 2 SQL Server instances

Useful for SQL Server upgrade and migration, instance consistency check in the enterprise, et cetera.

Known assumptions:
1. pyodbc and proper SQL Server database driver setup and tested;
2. Windows Active Directory trusted connection to both instances. Login/password can also be used with very simple modification of the code;
3. Proper permission is set on both instances.

Known limitations:
1. It is based on object name comparison. In many cases, this is “good enough”;
2. It does not compare End Points or Server level triggers, although that should be added easily.

Command line usage:
scriptName -s sourceInstance -t targetInstance

import pyodbc, argparse
def getAllRows(cn, sql):
	cursor = cn.cursor()
	cursor.execute(sql)
	return cursor.fetchall()
def rowCompare(objectType, rowsSource, rowsTarget):
	if rowsSource == rowsTarget:
		print "%s(s) appeared to have been migrated." % objectType
	else:
		print "%s(s) not migrated yet:" % objectType
		for row in rowsSource:
			if row not in rowsTarget:
				print row[0]
parser = argparse.ArgumentParser(description='SQL Server migration/upgrade comparison: databases')
parser.add_argument('-s', '--source-server', help='Source server/instance',  required=True, dest='sInstance')
parser.add_argument('-t', '--target-server', help='Target server/instance', required=True, dest='dInstance')
argList = parser.parse_args()
try:
	cnSource = pyodbc.connect("DRIVER={SQL Server};SERVER=%s;DATABASE=master;Trusted_Connection=yes" % argList.sInstance)
	cnTarget = pyodbc.connect("DRIVER={SQL Server};SERVER=%s;DATABASE=master;Trusted_Connection=yes" % argList.dInstance)
except:
	print "Couldn't connect to %s. It is down or you might have had a typo."
"""1. Compare if all databases on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.databases order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Database', rowsSource, rowsTarget)
"""2. Compare if all logins on the source server are present at the target server"""
print "\n"
sql = "select name from master.dbo.syslogins where name not like '##%' order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Login', rowsSource, rowsTarget)
"""3. Compare if all linked servers on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.servers where server_id > 0 order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Linked server', rowsSource, rowsTarget)
"""4. Compare if all jobs on the source server are present at the target server"""
print "\n"
sql = "select name from msdb.dbo.sysjobs order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Job', rowsSource, rowsTarget)
"""5. Compare if all credentials on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.credentials order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Credential', rowsSource, rowsTarget)
"""6. Compare if all system stored procedures inside master that start with sp_ on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.all_objects where type = 'p' and name like 'sp_%' and is_ms_shipped = 0 order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('System stored procedure', rowsSource, rowsTarget)

Comments

Search objects in SQL Server databases with pyodbc

Command line utility to search for objects based on names. Search results will be displayed in well formatted tabular format, left-justified.

1. pyodbc installed;
2. Use trusted authentication by default. Find the relevant code to adjust to login/password as needed;
3. -w for wild card search. Default is exact name match, case insensitive in almost all cases, depending on your SQL Server configuration;
4. If -d is not specified, all databases in instance will be searched, minus the ones that one not accessible at the moment. To specify database(s) you want to search into, use -d followed by database names. Separate them with comma if more than one;
5. Use -S to specify instance;
6. Good sample code for command line parameters processing and pyodbc usage;
7. Won’t work with SQL Server 2000 or lower.

Example 1: python objectSearch.py myobject -S instance1
Result: Database objects that are named myobject in instance1 will be displayed

Example 2: python objectSearch.py -S server1 -w -d db1,db2 myobject
Result: Database objects that has “myobject” in db1 and db2 in server1 will be displayed

import pyodbc, argparse
def pp(cursor, data=None, check_row_lengths=True):
    if not data:
        data = cursor.fetchall( )
    names = [  ]
    lengths = [  ]
    rules = [  ]
    for col, field_description in enumerate(cursor.description):
        field_name = field_description[0]
        names.append(field_name)
        field_length = field_description[2] or 12
        field_length = max(field_length, len(field_name))
        if check_row_lengths:
            # double-check field length, if it's unreliable
            data_length = max([ len(str(row[col])) for row in data ])
            field_length = max(field_length, data_length)
        lengths.append(field_length)
        rules.append('-' * field_length)
    format = " ".join(["%%-%ss" % l for l in lengths])
    result = [ format % tuple(names), format % tuple(rules) ]
    for row in data:
        result.append(format % tuple(row))
    return "\n".join(result)
def objectMatch(wildcardSearch, objectNamePattern, dbName, cn):
	if wildcardSearch:
		sql = "select '%s' as dbName, s.name schema_name, o.name, o.type from %s.sys.objects o inner join %s.sys.schemas s on o.schema_id = s.schema_id where o.name like '%%%s%%\' order by dbname, schema_name, o.name" % (dbName, dbName, dbName, objectNamePattern)
		cursor = cn.cursor()
		cursor.execute(sql)
		rows = cursor.fetchall()
		if len(rows) > 0:
			print pp(cursor, rows, 1)
	else:
		sql = "select '%s' as dbName, s.name schema_name, o.name, o.type from %s.sys.objects o inner join %s.sys.schemas s on o.schema_id = s.schema_id where o.name = '%s' order by dbname, schema_name, o.name" % (dbName, dbName, dbName, objectNamePattern)
		cursor = cn.cursor()
		cursor.execute(sql)
		rows = cursor.fetchall()
		if len(rows) > 0:
			print pp(cursor, rows, 1)
parser = argparse.ArgumentParser(description='SQL Server object search. Case insensitive')
parser.add_argument('-S', '--server', help='Instance you wish to connect to. myInstance is the default if not specified', dest='instance', default='myInstance')
parser.add_argument('-w', help='Wildcard search indicator. If specified, LIKE clause will be used for pattern matching. Otherwise it will look for an exact match. Exact match is default', action="store_true", dest='wild', default=False)
parser.add_argument('-d', '--database', help='Database(s) you want to search in. If more than one, separate them by comma only. It will search all databases by default', action="store", dest="dbs")
parser.add_argument('objectNamePattern', help='Object name pattern you want to search for', action="store")
argList = parser.parse_args()
try:
	cn = pyodbc.connect("DRIVER={SQL Server};SERVER=%s;DATABASE=master;Trusted_Connection=yes" % argList.instance)
except:
	print "Couldn't connect to %s. It is down or you might have had a typo." % argList.instance
if argList.dbs is None:
	cursor = cn.cursor()
	cursor.execute("select name from sys.databases where state = 0")
	rows = cursor.fetchall()
	for row in rows:
		objectMatch(argList.wild, argList.objectNamePattern, row.name, cn)
else:
	dbs = argList.dbs.split(',')
	for db in dbs:
		objectMatch(argList.wild, argList.objectNamePattern, db, cn)

Comments

Questions on Text processing with Python

No need to waste time on proving the importance of text processing, I suppose. Here is an automation use case I had in mind when I started my search: sucking out all domain\login inside a block of text.

Yes, I can build my own regular expressions, and I have done that in the past. But, another use case is log file processing: SQL Server, Apache, MySQL, and such. Therefore an existing module that is easy to code, read and maintain is better than me code everything. In the end, regular expression is still going to be used, but a layer of abstraction will help productivity and maintainability.

I came across 3 modules: pyparsing, SimpleParse, and NLTK. I am curious to hear your opinions on those 3 modules, or if you have suggestions other than the 3 mentioned here:

1. How easy/difficult to learn those modules? I haven’t tried SimpleParse or NLTK yet, but I have tried pyparsing, which looks easy to pick up and the author, Paul McGuire, is very helpful. NTLK might be an overkill for what I do, at first glance.

2. What about performance? In most of my use cases, this is probably not that important, but I’ve come across comments on StackOverflow saying that pyparsing does not perform very well when text volume is large.

3. What about support and ongoing development? Like I mentioned earlier, the author behind pyparsing seems to be very active in answering questions and incorporating new ideas.

Thanks in advance for any insights and Happy New Year!

PS, here are 2 solutions to get domain\login out with pyparsing that Paul helpfully provided when I asked:

from pyparsing import *
grammar = Combine(Word(alphanums) + "\\" + Word(alphanums))
matches = grammar.searchString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe")
for m in matches:
	print m[0]
########
for toks,start,end in grammar.scanString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe"):
	print "%d:%d:%s" % (start,end,toks[0])

Comments (3)

« Previous entries

Page optimized by WP Minify WordPress Plugin