Running external processes/programs with Python subprocess

Note: It is good to know when the external process runs and for how long, which is included in the code snippet below. For the “when” part, it would have been nice to know the time zone as well. But due to the limitation mentioned here, it is not as easy as it sounds with the datetime module.

It seems the time module in Python has better handling of time zone, but it ends at year 2038, which is a slight concern of mine, so I decided against it. The datetime module’s upper limit is year 9999. That is pretty far into the future, if we humans haven’t managed to extinct ourselves by then.

Speaking of the Linux/Unix epoch time, I wonder if we would encounter year 2038 issue, like the much talked about Y2K back in the day 🙂

import shlex, subprocess, sys
from datetime import datetime

def runCmd(cmd):
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)

    out, err = proc.communicate()
    ret = proc.returncode
    return (ret, out, err)

startTime = datetime.now()

# "df -h" is a *nix command. To test on Windows, replace it with "notepad" or something like that.
returnCode, stdOut, stdErr = runCmd('df -h')

endTime = datetime.now()

print "The command started at " + str(startTime) + " and ran for " + str((endTime - startTime)) + "\n"

print stdOut
if returnCode:
	print >> sys.stderr, "There was an error (%d): \n" % returnCode
	print >> sys.stderr, stdErr

Update:

When I first posted this, I wrote that to test it on Windows, you can try the “dir” command. Actually, if you do, you will encounter the error below:

C:\Users\haidong\Documents\work\python>tt.py
Traceback (most recent call last):
  File "C:\Users\haidong\Documents\work\python\tt.py", line 14, in <module>
    returnCode, stdOut, stdErr = runCmd('df -h')
  File "C:\Users\haidong\Documents\work\python\tt.py", line 5, in runCmd
    proc = subprocess.Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
  File "C:\Python27\lib\subprocess.py", line 672, in __init__
    errread, errwrite)
  File "C:\Python27\lib\subprocess.py", line 882, in _execute_child
    startupinfo)
WindowsError: [Error 2] The system cannot find the file specified

The reason is that dir is a DOS command that needs to be run on top of the command line process. If you replace “dir” with some other independent executibles, it should work. I’ve tried both “notepad” and “sqlcmd”, and both worked for me fine. “sqlcmd” is really the one I am after.

Parse sqlio log file with Python pyparsing

A few weeks ago I posted some questions on 3 Python text processing modules: pyparsing, SimpleParse, and NLTK. Today I need to analyze a log file generated by sqlio. I decided to use pyparsing. I am pretty pleased with it.

I am aware that there are at least 2 utility scripts for this: one is in Perl and then other is in Windows PowerShell. But I wanted to write in Python.

Noteworthy points:

1. In analyzing and building up the grammar/structure of the text for Python, I opted for line by line analysis and coding. I think this brings clarity and is easier to read. It felt a bit tedious in building it up, but once it is done, it is pretty rewarding to see the end results;
2. After parsing, the script writes the results to a csv file. You can then open that in a spreadsheet program to create some charts. Seeing the results in a column diagram chart brings clarity and focus to the result. It worked pretty well for me.

All in all, a good and productive day. Below are the details. The top part is the sample log content to be parsed, followed by the program.

"""
c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o1 -frandom -b64 -BH -LS Testfile.dat 
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
	using 64KB random IOs
	enabling multiple I/Os per thread with 1 outstanding
	buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  6106.50
MBs/sec:   381.65
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 5
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 99  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o2 -frandom -b64 -BH -LS Testfile.dat 
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
	using 64KB random IOs
	enabling multiple I/Os per thread with 2 outstanding
	buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  6951.19
MBs/sec:   434.44
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 6
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 87 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
"""
from pyparsing import *

output = open("sample.csv", "w")
input = open("hDriveSqlioResults.txt", "r")
data = input.read()

#------------------------------------------------------------------------
# Define Grammars
#------------------------------------------------------------------------
integer = Word(nums)
singleLetter = Word(alphas, max=1)
realNumber = Combine(Word(nums) + '.' + Word(nums))
reading = Literal("reading")
writing = Literal("writing")
readingOrWriting = reading | writing
random = Literal("random")
sequential = Literal("sequential")
randomOrSequential = random | sequential

#Line 1 is useful. We want to suck out the juicy parts.
sectionBeginning = CaselessLiteral("c:\\program files (x86)\\sqlio>sqlio ")
ioKind = "-k" + singleLetter("k")
numThreads = "-t" + integer("t")
numSeconds = "-s" + integer("s")
driveOrMountPath = "-d" + singleLetter("d")
outstandingRequests = "-o" + integer("o")
accessType = "-f" + Word(alphas)("f")
ioBlockSizeInKB = "-b" + integer("b")

#Line 2 can be safely ignored
line2 = "sqlio v" + restOfLine

#Line 3 can be safely ignored
line3 = "using system counter for latency timings" + restOfLine

#Line 4 can be safely ignored
line4 = integer + "threads" + readingOrWriting + restOfLine

#Line 5 can be safely ignored
line5 = "using" + integer + "KB" + randomOrSequential + "IOs" + restOfLine

#Line 6 can be safely ignored
line6 = "enabling multiple " + restOfLine

#Line 7 can be safely ignored
line7 = "buffering set to use" + restOfLine

#Line 8 we should get the file size out
line8 = "using current size: " + integer("fileSize") + Word(alphas)("fileSizeUnit") + restOfLine

#Line 9 can be safely ignored
line9 = "initialization done" + restOfLine

#Line 10 can be safely ignored
line10 = "CUMULATIVE" + restOfLine

#Line 11 can be safely ignored
line11 = "throughput" + restOfLine

#Line 12 we want IOPS
line12 = "IOs/sec:" + realNumber("IOPS") + restOfLine

#Line 13 we want MBPS
line13 = "MBs/sec:" + realNumber("MBPS") + restOfLine

#Line 14 can be safely ignored
line14 = "latency" + restOfLine

#Line 15 we need to get minLatency out
line15 = "Min_Latency(ms):" + integer("minLatency") + restOfLine

#Line 16 we need to get avgLatency out
line16 = "Avg_Latency(ms):" + integer("avgLatency") + restOfLine

#Line 17 we need to get maxLatency out
line17 = "Max_Latency(ms):" + integer("maxLatency") + restOfLine

#Line 18 can be safely ignored
line18 = "histogram" + restOfLine

#Line 19 can be safely ignored
line19 = "ms:" + restOfLine

#Line 20 can be safely ignored
line20 = "\%:" + restOfLine

extraStuff = ZeroOrMore(line14 | line15 | line16 | line17 | line18 | line19 | line20)

logEntry = sectionBeginning + ioKind + numThreads + numSeconds + driveOrMountPath + outstandingRequests + accessType + ioBlockSizeInKB + restOfLine + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9 + line10 + line11 + line12 + line13 + extraStuff

output.write("IO property, IOPS, MBPS, minLatencyInMS, avgLatencyInMS, maxLatencyInMS\n")
for tokens in logEntry.searchString(data):
	output.write("%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n" % tokens)

Utility script for instance level objects comparison between 2 SQL Server instances

Useful for SQL Server upgrade and migration, instance consistency check in the enterprise, et cetera.

Known assumptions:
1. pyodbc and proper SQL Server database driver setup and tested;
2. Windows Active Directory trusted connection to both instances. Login/password can also be used with very simple modification of the code;
3. Proper permission is set on both instances.

Known limitations:
1. It is based on object name comparison. In many cases, this is “good enough”;
2. It does not compare End Points or Server level triggers, although that should be added easily.

Command line usage:
scriptName -s sourceInstance -t targetInstance

import pyodbc, argparse

def getAllRows(cn, sql):
	cursor = cn.cursor()
	cursor.execute(sql)
	return cursor.fetchall()

def rowCompare(objectType, rowsSource, rowsTarget):
	if rowsSource == rowsTarget:
		print "%s(s) appeared to have been migrated." % objectType
	else:
		print "%s(s) not migrated yet:" % objectType
		for row in rowsSource:
			if row not in rowsTarget:
				print row[0]

parser = argparse.ArgumentParser(description='SQL Server migration/upgrade comparison: databases')
parser.add_argument('-s', '--source-server', help='Source server/instance',  required=True, dest='sInstance')
parser.add_argument('-t', '--target-server', help='Target server/instance', required=True, dest='dInstance')

argList = parser.parse_args()

try:
	cnSource = pyodbc.connect("DRIVER={SQL Server};SERVER=%s;DATABASE=master;Trusted_Connection=yes" % argList.sInstance) 
	cnTarget = pyodbc.connect("DRIVER={SQL Server};SERVER=%s;DATABASE=master;Trusted_Connection=yes" % argList.dInstance) 
except:
	print "Couldn't connect to %s. It is down or you might have had a typo."

"""1. Compare if all databases on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.databases order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Database', rowsSource, rowsTarget)

"""2. Compare if all logins on the source server are present at the target server"""
print "\n"
sql = "select name from master.dbo.syslogins where name not like '##%' order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Login', rowsSource, rowsTarget)

"""3. Compare if all linked servers on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.servers where server_id > 0 order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Linked server', rowsSource, rowsTarget)

"""4. Compare if all jobs on the source server are present at the target server"""
print "\n"
sql = "select name from msdb.dbo.sysjobs order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Job', rowsSource, rowsTarget)

"""5. Compare if all credentials on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.credentials order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('Credential', rowsSource, rowsTarget)

"""6. Compare if all system stored procedures inside master that start with sp_ on the source server are present at the target server"""
print "\n"
sql = "select name from master.sys.all_objects where type = 'p' and name like 'sp_%' and is_ms_shipped = 0 order by name"
rowsSource = getAllRows(cnSource, sql)
rowsTarget = getAllRows(cnTarget, sql)
rowCompare('System stored procedure', rowsSource, rowsTarget)

Search objects in SQL Server databases with pyodbc

Command line utility to search for objects based on names. Search results will be displayed in well formatted tabular format, left-justified.

1. pyodbc installed;
2. Use trusted authentication by default. Find the relevant code to adjust to login/password as needed;
3. -w for wild card search. Default is exact name match, case insensitive in almost all cases, depending on your SQL Server configuration;
4. If -d is not specified, all databases in instance will be searched, minus the ones that one not accessible at the moment. To specify database(s) you want to search into, use -d followed by database names. Separate them with comma if more than one;
5. Use -S to specify instance;
6. Good sample code for command line parameters processing and pyodbc usage;
7. Won’t work with SQL Server 2000 or lower.

Example 1: python objectSearch.py myobject -S instance1
Result: Database objects that are named myobject in instance1 will be displayed

Example 2: python objectSearch.py -S server1 -w -d db1,db2 myobject
Result: Database objects that has “myobject” in db1 and db2 in server1 will be displayed

import pyodbc, argparse

def pp(cursor, data=None, check_row_lengths=True):
    if not data:
        data = cursor.fetchall( )
    names = [  ]
    lengths = [  ]
    rules = [  ]
    for col, field_description in enumerate(cursor.description):
        field_name = field_description[0]
        names.append(field_name)
        field_length = field_description[2] or 12
        field_length = max(field_length, len(field_name))
        if check_row_lengths:
            # double-check field length, if it's unreliable
            data_length = max([ len(str(row[col])) for row in data ])
            field_length = max(field_length, data_length)
        lengths.append(field_length)
        rules.append('-' * field_length)
    format = " ".join(["%%-%ss" % l for l in lengths])
    result = [ format % tuple(names), format % tuple(rules) ]
    for row in data:
        result.append(format % tuple(row))
    return "\n".join(result)

def objectMatch(wildcardSearch, objectNamePattern, dbName, cn):
	if wildcardSearch:
		sql = "select '%s' as dbName, s.name schema_name, o.name, o.type from %s.sys.objects o inner join %s.sys.schemas s on o.schema_id = s.schema_id where o.name like '%%%s%%\' order by dbname, schema_name, o.name" % (dbName, dbName, dbName, objectNamePattern)
		cursor = cn.cursor()
		cursor.execute(sql)
		rows = cursor.fetchall()
		if len(rows) > 0:
			print pp(cursor, rows, 1)
	else:
		sql = "select '%s' as dbName, s.name schema_name, o.name, o.type from %s.sys.objects o inner join %s.sys.schemas s on o.schema_id = s.schema_id where o.name = '%s' order by dbname, schema_name, o.name" % (dbName, dbName, dbName, objectNamePattern)
		cursor = cn.cursor()
		cursor.execute(sql)
		rows = cursor.fetchall()
		if len(rows) > 0:
			print pp(cursor, rows, 1)

parser = argparse.ArgumentParser(description='SQL Server object search. Case insensitive')
parser.add_argument('-S', '--server', help='Instance you wish to connect to. myInstance is the default if not specified', dest='instance', default='myInstance')
parser.add_argument('-w', help='Wildcard search indicator. If specified, LIKE clause will be used for pattern matching. Otherwise it will look for an exact match. Exact match is default', action="store_true", dest='wild', default=False)
parser.add_argument('-d', '--database', help='Database(s) you want to search in. If more than one, separate them by comma only. It will search all databases by default', action="store", dest="dbs")
parser.add_argument('objectNamePattern', help='Object name pattern you want to search for', action="store")

argList = parser.parse_args()

try:
	cn = pyodbc.connect("DRIVER={SQL Server};SERVER=%s;DATABASE=master;Trusted_Connection=yes" % argList.instance)
except:
	print "Couldn't connect to %s. It is down or you might have had a typo." % argList.instance

if argList.dbs is None:
	cursor = cn.cursor()
	cursor.execute("select name from sys.databases where state = 0")
	rows = cursor.fetchall()
	for row in rows:
		objectMatch(argList.wild, argList.objectNamePattern, row.name, cn)
else:
	dbs = argList.dbs.split(',')
	for db in dbs:
		objectMatch(argList.wild, argList.objectNamePattern, db, cn)

Questions on Text processing with Python

No need to waste time on proving the importance of text processing, I suppose. Here is an automation use case I had in mind when I started my search: sucking out all domain\login inside a block of text.

Yes, I can build my own regular expressions, and I have done that in the past. But, another use case is log file processing: SQL Server, Apache, MySQL, and such. Therefore an existing module that is easy to code, read and maintain is better than me code everything. In the end, regular expression is still going to be used, but a layer of abstraction will help productivity and maintainability.

I came across 3 modules: pyparsing, SimpleParse, and NLTK. I am curious to hear your opinions on those 3 modules, or if you have suggestions other than the 3 mentioned here:

1. How easy/difficult to learn those modules? I haven’t tried SimpleParse or NLTK yet, but I have tried pyparsing, which looks easy to pick up and the author, Paul McGuire, is very helpful. NTLK might be an overkill for what I do, at first glance.

2. What about performance? In most of my use cases, this is probably not that important, but I’ve come across comments on StackOverflow saying that pyparsing does not perform very well when text volume is large.

3. What about support and ongoing development? Like I mentioned earlier, the author behind pyparsing seems to be very active in answering questions and incorporating new ideas.

Thanks in advance for any insights and Happy New Year!

PS, here are 2 solutions to get domain\login out with pyparsing that Paul helpfully provided when I asked:

from pyparsing import *
grammar = Combine(Word(alphanums) + "\\" + Word(alphanums))


matches = grammar.searchString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe")
for m in matches:
	print m[0]
########
for toks,start,end in grammar.scanString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe"):
	print "%d:%d:%s" % (start,end,toks[0])