No need to waste time on proving the importance of text processing, I suppose. Here is an automation use case I had in mind when I started my search: sucking out all domain\login inside a block of text.
Yes, I can build my own regular expressions, and I have done that in the past. But, another use case is log file processing: SQL Server, Apache, MySQL, and such. Therefore an existing module that is easy to code, read and maintain is better than me code everything. In the end, regular expression is still going to be used, but a layer of abstraction will help productivity and maintainability.
1. How easy/difficult to learn those modules? I haven’t tried SimpleParse or NLTK yet, but I have tried pyparsing, which looks easy to pick up and the author, Paul McGuire, is very helpful. NTLK might be an overkill for what I do, at first glance.
2. What about performance? In most of my use cases, this is probably not that important, but I’ve come across comments on StackOverflow saying that pyparsing does not perform very well when text volume is large.
3. What about support and ongoing development? Like I mentioned earlier, the author behind pyparsing seems to be very active in answering questions and incorporating new ideas.
Thanks in advance for any insights and Happy New Year!
PS, here are 2 solutions to get domain\login out with pyparsing that Paul helpfully provided when I asked:
from pyparsing import * grammar = Combine(Word(alphanums) + "\\" + Word(alphanums)) matches = grammar.searchString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe") for m in matches: print m ######## for toks,start,end in grammar.scanString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe"): print "%d:%d:%s" % (start,end,toks)