A few weeks ago I posted some questions on 3 Python text processing modules: pyparsing, SimpleParse, and NLTK. Today I need to analyze a log file generated by sqlio. I decided to use pyparsing. I am pretty pleased with it.
I am aware that there are at least 2 utility scripts for this: one is in Perl and then other is in Windows PowerShell. But I wanted to write in Python.
Noteworthy points:
1. In analyzing and building up the grammar/structure of the text for Python, I opted for line by line analysis and coding. I think this brings clarity and is easier to read. It felt a bit tedious in building it up, but once it is done, it is pretty rewarding to see the end results;
2. After parsing, the script writes the results to a csv file. You can then open that in a spreadsheet program to create some charts. Seeing the results in a column diagram chart brings clarity and focus to the result. It worked pretty well for me.
All in all, a good and productive day. Below are the details. The top part is the sample log content to be parsed, followed by the program.
[sourcecode language=”python”]
“””
c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o1 -frandom -b64 -BH -LS Testfile.dat
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
using 64KB random IOs
enabling multiple I/Os per thread with 1 outstanding
buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 6106.50
MBs/sec: 381.65
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 5
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o2 -frandom -b64 -BH -LS Testfile.dat
sqlio v1.5.SG
using system counter for latency timings, 2208037 counts per second
2 threads writing for 30 secs to file H:Testfile.dat
using 64KB random IOs
enabling multiple I/Os per thread with 2 outstanding
buffering set to use hardware disk cache (but not file cache)
using current size: 24576 MB for file: H:Testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 6951.19
MBs/sec: 434.44
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 6
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 87 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
“””
from pyparsing import *
output = open(“sample.csv”, “w”)
input = open(“hDriveSqlioResults.txt”, “r”)
data = input.read()
#————————————————————————
# Define Grammars
#————————————————————————
integer = Word(nums)
singleLetter = Word(alphas, max=1)
realNumber = Combine(Word(nums) + ‘.’ + Word(nums))
reading = Literal(“reading”)
writing = Literal(“writing”)
readingOrWriting = reading | writing
random = Literal(“random”)
sequential = Literal(“sequential”)
randomOrSequential = random | sequential
#Line 1 is useful. We want to suck out the juicy parts.
sectionBeginning = CaselessLiteral(“c:\\program files (x86)\\sqlio>sqlio “)
ioKind = “-k” + singleLetter(“k”)
numThreads = “-t” + integer(“t”)
numSeconds = “-s” + integer(“s”)
driveOrMountPath = “-d” + singleLetter(“d”)
outstandingRequests = “-o” + integer(“o”)
accessType = “-f” + Word(alphas)(“f”)
ioBlockSizeInKB = “-b” + integer(“b”)
#Line 2 can be safely ignored
line2 = “sqlio v” + restOfLine
#Line 3 can be safely ignored
line3 = “using system counter for latency timings” + restOfLine
#Line 4 can be safely ignored
line4 = integer + “threads” + readingOrWriting + restOfLine
#Line 5 can be safely ignored
line5 = “using” + integer + “KB” + randomOrSequential + “IOs” + restOfLine
#Line 6 can be safely ignored
line6 = “enabling multiple ” + restOfLine
#Line 7 can be safely ignored
line7 = “buffering set to use” + restOfLine
#Line 8 we should get the file size out
line8 = “using current size: ” + integer(“fileSize”) + Word(alphas)(“fileSizeUnit”) + restOfLine
#Line 9 can be safely ignored
line9 = “initialization done” + restOfLine
#Line 10 can be safely ignored
line10 = “CUMULATIVE” + restOfLine
#Line 11 can be safely ignored
line11 = “throughput” + restOfLine
#Line 12 we want IOPS
line12 = “IOs/sec:” + realNumber(“IOPS”) + restOfLine
#Line 13 we want MBPS
line13 = “MBs/sec:” + realNumber(“MBPS”) + restOfLine
#Line 14 can be safely ignored
line14 = “latency” + restOfLine
#Line 15 we need to get minLatency out
line15 = “Min_Latency(ms):” + integer(“minLatency”) + restOfLine
#Line 16 we need to get avgLatency out
line16 = “Avg_Latency(ms):” + integer(“avgLatency”) + restOfLine
#Line 17 we need to get maxLatency out
line17 = “Max_Latency(ms):” + integer(“maxLatency”) + restOfLine
#Line 18 can be safely ignored
line18 = “histogram” + restOfLine
#Line 19 can be safely ignored
line19 = “ms:” + restOfLine
#Line 20 can be safely ignored
line20 = “\%:” + restOfLine
extraStuff = ZeroOrMore(line14 | line15 | line16 | line17 | line18 | line19 | line20)
logEntry = sectionBeginning + ioKind + numThreads + numSeconds + driveOrMountPath + outstandingRequests + accessType + ioBlockSizeInKB + restOfLine + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9 + line10 + line11 + line12 + line13 + extraStuff
output.write(“IO property, IOPS, MBPS, minLatencyInMS, avgLatencyInMS, maxLatencyInMS\n”)
for tokens in logEntry.searchString(data):
output.write(“%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n” % tokens)
[/sourcecode]
5 responses to “Parse sqlio log file with Python pyparsing”
[…] Run this Python script to parse mDriveSqlioResults.txt into a CSV […]
Love it! Nice to see that your log file cracking is going smoothly. Here are a couple of style points (adopt/reject at your own discretion):
Ever since I introduced setResultsName() I’ve been unhappy with it visually – it really breaks up the grammar definition for me. But I think results names are an important feature, so I added a simplified syntax for defining them. What do you think about this reformatting of some of your parser:
ioKind = “-k” + singleLetter(“k”)
numThreads = “-t” + integer(“t”)
numSeconds = “-s” + integer(“s”)
driveOrMountPath = “-d” + singleLetter(“d”)
outstandingRequests = “-o” + integer(“o”)
accessType = “-f” + Word(alphas)(“f”)
ioBlockSizeInKB = “-b” + integer(“b”)
With your very nice results names, you can pass a ParseResults object as a mapping to Python’s string interpolator. See how the results names are used as “%(name)s” in the format string:
output.write(“%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n” % tokens)
But these are nits, your code looks pretty clean to me – parse on!
— Paul
Paul,
I am in total agreement with your comment on setResultsName(). I like the simplified syntax much better: much more intuitive and “flows” better.
I tested with the simplified syntax and string interpolation, and it works. That’s what I will use in production, but I will leave the original post as is for now.
Thanks a lot for your comment and your excellent work!
Paul,
I have updated the code listed here with the newer simplified syntax. I think that is the right thing to do for me and future visitors.
Thanks again.
One last bit that might help in your parsing speed a little. Over time, I’ve found that parsing terminals, especially real numbers, using “Combine(Word(nums) + ‘.’ + …etc.)” can be a real slog at parse time, and the may be a good place to bite the bullet and just use a small Regex – in your case this would be:
realnumber = Regex(r”\d+\.\d+”)
Since realnumber is a terminal, it will be tested *many* times during parsing, so a single Regex object will parse faster than a composite Combine(whatever). You don’t have to overdo this though. Where possible, Word and oneOf use regular expressions internally, so converting them to backslash-ridden Regex’s won’t run any faster. The single case the comes up most often though is definition of a real number. Give it a try!
— Paul