Questions on Text processing with Python


No need to waste time on proving the importance of text processing, I suppose. Here is an automation use case I had in mind when I started my search: sucking out all domain\login inside a block of text.

Yes, I can build my own regular expressions, and I have done that in the past. But, another use case is log file processing: SQL Server, Apache, MySQL, and such. Therefore an existing module that is easy to code, read and maintain is better than me code everything. In the end, regular expression is still going to be used, but a layer of abstraction will help productivity and maintainability.

I came across 3 modules: pyparsing, SimpleParse, and NLTK. I am curious to hear your opinions on those 3 modules, or if you have suggestions other than the 3 mentioned here:

1. How easy/difficult to learn those modules? I haven’t tried SimpleParse or NLTK yet, but I have tried pyparsing, which looks easy to pick up and the author, Paul McGuire, is very helpful. NTLK might be an overkill for what I do, at first glance.

2. What about performance? In most of my use cases, this is probably not that important, but I’ve come across comments on StackOverflow saying that pyparsing does not perform very well when text volume is large.

3. What about support and ongoing development? Like I mentioned earlier, the author behind pyparsing seems to be very active in answering questions and incorporating new ideas.

Thanks in advance for any insights and Happy New Year!

PS, here are 2 solutions to get domain\login out with pyparsing that Paul helpfully provided when I asked:

[sourcecode language=”python”]
from pyparsing import *
grammar = Combine(Word(alphanums) + “\\” + Word(alphanums))

matches = grammar.searchString(“jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe”)
for m in matches:
print m[0]
########
for toks,start,end in grammar.scanString(“jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe”):
print “%d:%d:%s” % (start,end,toks[0])
[/sourcecode]

,

3 responses to “Questions on Text processing with Python”

  1. NLTK is probably overkill. If you’re just trying to eliminate urls in text, I’m sure there’s better options that pyparsing or SimpleParse, which won’t be any easier than doing your own regex. I’d suggest looking at BeautifulSoup or lxml as a way to find & eliminate urls.

  2. Thanks Jacob.

    By getting domain\login info out, I actually didn’t have url/html/xml processing in mind. It is more for getting them out for Windows Active Directory processing.

    Yes, I’ve done work with html/xml processing. I picked lxml over BeautifulSoup after some research. Speaking of which, I need to spend time building some sample code for lxml, ’cause I lost my Python toolbox files a while ago.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.