A few weeks ago I posted some questions on 3 Python text processing modules: pyparsing, SimpleParse, and NLTK. Today I need to analyze a log file generated by sqlio. I decided to use pyparsing. I am pretty pleased with it.
I am aware that there are at least 2 utility scripts for this: one is in Perl and then other is in Windows PowerShell. But I wanted to write in Python.
Noteworthy points:
1. In analyzing and building up the grammar/structure of the text for Python, I opted for line by line analysis and coding. I think this brings clarity and is easier to read. It felt a bit tedious in building it up, but once it is done, it is pretty rewarding to see the end results;
2. After parsing, the script writes the results to a csv file. You can then open that in a spreadsheet program to create some charts. Seeing the results in a column diagram chart brings clarity and focus to the result. It worked pretty well for me.
All in all, a good and productive day. Below are the details. The top part is the sample log content to be parsed, followed by the program.
""" c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o1 -frandom -b64 -BH -LS Testfile.dat sqlio v1.5.SG using system counter for latency timings, 2208037 counts per second 2 threads writing for 30 secs to file H:Testfile.dat using 64KB random IOs enabling multiple I/Os per thread with 1 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 24576 MB for file: H:Testfile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 6106.50 MBs/sec: 381.65 latency metrics: Min_Latency(ms): 0 Avg_Latency(ms): 0 Max_Latency(ms): 5 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 99 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c:\Program Files (x86)\SQLIO>sqlio -kW -t2 -s30 -dH -o2 -frandom -b64 -BH -LS Testfile.dat sqlio v1.5.SG using system counter for latency timings, 2208037 counts per second 2 threads writing for 30 secs to file H:Testfile.dat using 64KB random IOs enabling multiple I/Os per thread with 2 outstanding buffering set to use hardware disk cache (but not file cache) using current size: 24576 MB for file: H:Testfile.dat initialization done CUMULATIVE DATA: throughput metrics: IOs/sec: 6951.19 MBs/sec: 434.44 latency metrics: Min_Latency(ms): 0 Avg_Latency(ms): 0 Max_Latency(ms): 6 histogram: ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+ %: 87 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 """ from pyparsing import * output = open("sample.csv", "w") input = open("hDriveSqlioResults.txt", "r") data = input.read() #------------------------------------------------------------------------ # Define Grammars #------------------------------------------------------------------------ integer = Word(nums) singleLetter = Word(alphas, max=1) realNumber = Combine(Word(nums) + '.' + Word(nums)) reading = Literal("reading") writing = Literal("writing") readingOrWriting = reading | writing random = Literal("random") sequential = Literal("sequential") randomOrSequential = random | sequential #Line 1 is useful. We want to suck out the juicy parts. sectionBeginning = CaselessLiteral("c:\\program files (x86)\\sqlio>sqlio ") ioKind = "-k" + singleLetter("k") numThreads = "-t" + integer("t") numSeconds = "-s" + integer("s") driveOrMountPath = "-d" + singleLetter("d") outstandingRequests = "-o" + integer("o") accessType = "-f" + Word(alphas)("f") ioBlockSizeInKB = "-b" + integer("b") #Line 2 can be safely ignored line2 = "sqlio v" + restOfLine #Line 3 can be safely ignored line3 = "using system counter for latency timings" + restOfLine #Line 4 can be safely ignored line4 = integer + "threads" + readingOrWriting + restOfLine #Line 5 can be safely ignored line5 = "using" + integer + "KB" + randomOrSequential + "IOs" + restOfLine #Line 6 can be safely ignored line6 = "enabling multiple " + restOfLine #Line 7 can be safely ignored line7 = "buffering set to use" + restOfLine #Line 8 we should get the file size out line8 = "using current size: " + integer("fileSize") + Word(alphas)("fileSizeUnit") + restOfLine #Line 9 can be safely ignored line9 = "initialization done" + restOfLine #Line 10 can be safely ignored line10 = "CUMULATIVE" + restOfLine #Line 11 can be safely ignored line11 = "throughput" + restOfLine #Line 12 we want IOPS line12 = "IOs/sec:" + realNumber("IOPS") + restOfLine #Line 13 we want MBPS line13 = "MBs/sec:" + realNumber("MBPS") + restOfLine #Line 14 can be safely ignored line14 = "latency" + restOfLine #Line 15 we need to get minLatency out line15 = "Min_Latency(ms):" + integer("minLatency") + restOfLine #Line 16 we need to get avgLatency out line16 = "Avg_Latency(ms):" + integer("avgLatency") + restOfLine #Line 17 we need to get maxLatency out line17 = "Max_Latency(ms):" + integer("maxLatency") + restOfLine #Line 18 can be safely ignored line18 = "histogram" + restOfLine #Line 19 can be safely ignored line19 = "ms:" + restOfLine #Line 20 can be safely ignored line20 = "\%:" + restOfLine extraStuff = ZeroOrMore(line14 | line15 | line16 | line17 | line18 | line19 | line20) logEntry = sectionBeginning + ioKind + numThreads + numSeconds + driveOrMountPath + outstandingRequests + accessType + ioBlockSizeInKB + restOfLine + line2 + line3 + line4 + line5 + line6 + line7 + line8 + line9 + line10 + line11 + line12 + line13 + extraStuff output.write("IO property, IOPS, MBPS, minLatencyInMS, avgLatencyInMS, maxLatencyInMS\n") for tokens in logEntry.searchString(data): output.write("%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n" % tokens)
Love it! Nice to see that your log file cracking is going smoothly. Here are a couple of style points (adopt/reject at your own discretion):
Ever since I introduced setResultsName() I’ve been unhappy with it visually – it really breaks up the grammar definition for me. But I think results names are an important feature, so I added a simplified syntax for defining them. What do you think about this reformatting of some of your parser:
ioKind = “-k” + singleLetter(“k”)
numThreads = “-t” + integer(“t”)
numSeconds = “-s” + integer(“s”)
driveOrMountPath = “-d” + singleLetter(“d”)
outstandingRequests = “-o” + integer(“o”)
accessType = “-f” + Word(alphas)(“f”)
ioBlockSizeInKB = “-b” + integer(“b”)
With your very nice results names, you can pass a ParseResults object as a mapping to Python’s string interpolator. See how the results names are used as “%(name)s” in the format string:
output.write(“%(k)s%(t)s threads %(o)s queue %(f)s in %(b)s KB chunks,%(IOPS)s,%(MBPS)s,%(minLatency)s,%(avgLatency)s,%(maxLatency)s\n” % tokens)
But these are nits, your code looks pretty clean to me – parse on!
— Paul
Paul,
I am in total agreement with your comment on setResultsName(). I like the simplified syntax much better: much more intuitive and “flows” better.
I tested with the simplified syntax and string interpolation, and it works. That’s what I will use in production, but I will leave the original post as is for now.
Thanks a lot for your comment and your excellent work!
Paul,
I have updated the code listed here with the newer simplified syntax. I think that is the right thing to do for me and future visitors.
Thanks again.
One last bit that might help in your parsing speed a little. Over time, I’ve found that parsing terminals, especially real numbers, using “Combine(Word(nums) + ‘.’ + …etc.)” can be a real slog at parse time, and the may be a good place to bite the bullet and just use a small Regex – in your case this would be:
realnumber = Regex(r”\d+\.\d+”)
Since realnumber is a terminal, it will be tested *many* times during parsing, so a single Regex object will parse faster than a composite Combine(whatever). You don’t have to overdo this though. Where possible, Word and oneOf use regular expressions internally, so converting them to backslash-ridden Regex’s won’t run any faster. The single case the comes up most often though is definition of a real number. Give it a try!
— Paul