Parse sqlio log file with Python pyparsing

A few weeks ago I posted some questions on 3 Python text processing modules: pyparsing, SimpleParse, and NLTK. Today I need to analyze a log file generated by sqlio. I decided to use pyparsing. I am pretty pleased with it.

I am aware that there are at least 2 utility scripts for this: one is in Perl and then other is in Windows PowerShell. But I wanted to write in Python.

Noteworthy points:

1. In analyzing and building up the grammar/structure of the text for Python, I opted for line by line analysis and coding. I think this brings clarity and is easier to read. It felt a bit tedious in building it up, but once it is done, it is pretty rewarding to see the end results;
2. After parsing, the script writes the results to a csv file. You can then open that in a spreadsheet program to create some charts. Seeing the results in a column diagram chart brings clarity and focus to the result. It worked pretty well for me.

All in all, a good and productive day. Below are the details. The top part is the sample log content to be parsed, followed by the program.

"""
c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o1 -frandom -b64 -BH -LS Testfile.dat 
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
	using 64KB random IOs
	enabling multiple I/Os per thread with 1 outstanding
	buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  6106.50
MBs/sec:   381.65
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 5
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 99  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o2 -frandom -b64 -BH -LS Testfile.dat 
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
	using 64KB random IOs
	enabling multiple I/Os per thread with 2 outstanding
	buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec:  6951.19
MBs/sec:   434.44
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 6
histogram:
ms: 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 87 12  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
"""
from pyparsing import *

output = open("sample.csv", "w")
input = open("hDriveSqlioResults.txt", "r")
data = input.read()

#------------------------------------------------------------------------
# Define Grammars
#------------------------------------------------------------------------
integer = Word(nums)
singleLetter = Word(alphas, max=1)
realNumber = Combine(Word(nums) + '.' + Word(nums))
reading = Literal("reading")
writing = Literal("writing")
readingOrWriting = reading | writing
random = Literal("random")
sequential = Literal("sequential")
randomOrSequential = random | sequential

#Line 1 is useful. We want to suck out the juicy parts.
sectionBeginning = CaselessLiteral("c:\\program files (x86)\\sqlio>sqlio ")
ioKind = "-k" + singleLetter("k")
numThreads = "-t" + integer("t")
numSeconds = "-s" + integer("s")
driveOrMountPath = "-d" + singleLetter("d")
outstandingRequests = "-o" + integer("o")
accessType = "-f" + Word(alphas)("f")
ioBlockSizeInKB = "-b" + integer("b")

#Line 2 can be safely ignored
line2 = "sqlio v" + restOfLine

#Line 3 can be safely ignored
line3 = "using system counter for latency timings" + restOfLine

#Line 4 can be safely ignored
line4 = integer + "threads" + readingOrWriting + restOfLine

#Line 5 can be safely ignored
line5 = "using" + integer + "KB" + randomOrSequential + "IOs" + restOfLine

#Line 6 can be safely ignored
line6 = "enabling multiple " + restOfLine

#Line 7 can be safely ignored
line7 = "buffering set to use" + restOfLine

#Line 8 we should get the file size out
line8 = "using current size: " + integer("fileSize") + Word(alphas)("fileSizeUnit") + restOfLine

#Line 9 can be safely ignored
line9 = "initialization done" + restOfLine

#Line 10 can be safely ignored
line10 = "CUMULATIVE" + restOfLine

#Line 11 can be safely ignored
line11 = "throughput" + restOfLine

#Line 12 we want IOPS
line12 = "IOs/sec:" + realNumber("IOPS") + restOfLine

#Line 13 we want MBPS
line13 = "MBs/sec:" + realNumber("MBPS") + restOfLine

#Line 14 can be safely ignored
line14 = "latency" + restOfLine

#Line 15 we need to get minLatency out
line15 = "Min_Latency(ms):" + integer("minLatency") + restOfLine

#Line 16 we need to get avgLatency out
line16 = "Avg_Latency(ms):" + integer("avgLatency") + restOfLine

#Line 17 we need to get maxLatency out
line17 = "Max_Latency(ms):" + integer("maxLatency") + restOfLine

#Line 18 can be safely ignored
line18 = "histogram" + restOfLine

#Line 19 can be safely ignored
line19 = "ms:" + restOfLine

#Line 20 can be safely ignored
line20 = "\%:" + restOfLine

extraStuff = ZeroOrMore(line14 | line15 | line16 | line17 | line18 | line19 | line20)

logEntry = sectionBeginning + ioKind + numThreads + numSeconds + driveOrMountPath + outstandingRequests + accessType + ioBlockSizeInKB + restOfLine + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9 + line10 + line11 + line12 + line13 + extraStuff

output.write("IO property, IOPS, MBPS, minLatencyInMS, avgLatencyInMS, maxLatencyInMS\n")
for tokens in logEntry.searchString(data):
	output.write("%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n" % tokens)

5 thoughts on “Parse sqlio log file with Python pyparsing”

  1. Love it! Nice to see that your log file cracking is going smoothly. Here are a couple of style points (adopt/reject at your own discretion):

    Ever since I introduced setResultsName() I’ve been unhappy with it visually – it really breaks up the grammar definition for me. But I think results names are an important feature, so I added a simplified syntax for defining them. What do you think about this reformatting of some of your parser:

    ioKind = “-k” + singleLetter(“k”)
    numThreads = “-t” + integer(“t”)
    numSeconds = “-s” + integer(“s”)
    driveOrMountPath = “-d” + singleLetter(“d”)
    outstandingRequests = “-o” + integer(“o”)
    accessType = “-f” + Word(alphas)(“f”)
    ioBlockSizeInKB = “-b” + integer(“b”)

    With your very nice results names, you can pass a ParseResults object as a mapping to Python’s string interpolator. See how the results names are used as “%(name)s” in the format string:

    output.write(“%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n” % tokens)

    But these are nits, your code looks pretty clean to me – parse on!

    — Paul

  2. Paul,

    I am in total agreement with your comment on setResultsName(). I like the simplified syntax much better: much more intuitive and “flows” better.

    I tested with the simplified syntax and string interpolation, and it works. That’s what I will use in production, but I will leave the original post as is for now.

    Thanks a lot for your comment and your excellent work!

  3. Paul,

    I have updated the code listed here with the newer simplified syntax. I think that is the right thing to do for me and future visitors.

    Thanks again.

  4. One last bit that might help in your parsing speed a little. Over time, I’ve found that parsing terminals, especially real numbers, using “Combine(Word(nums) + ‘.’ + …etc.)” can be a real slog at parse time, and the may be a good place to bite the bullet and just use a small Regex – in your case this would be:

    realnumber = Regex(r”\d+\.\d+”)

    Since realnumber is a terminal, it will be tested *many* times during parsing, so a single Regex object will parse faster than a composite Combine(whatever). You don’t have to overdo this though. Where possible, Word and oneOf use regular expressions internally, so converting them to backslash-ridden Regex’s won’t run any faster. The single case the comes up most often though is definition of a real number. Give it a try!

    — Paul

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.