Archive for Python

Questions on Text processing with Python

No need to waste time on proving the importance of text processing, I suppose. Here is an automation use case I had in mind when I started my search: sucking out all domain\login inside a block of text.

Yes, I can build my own regular expressions, and I have done that in the past. But, another use case is log file processing: SQL Server, Apache, MySQL, and such. Therefore an existing module that is easy to code, read and maintain is better than me code everything. In the end, regular expression is still going to be used, but a layer of abstraction will help productivity and maintainability.

I came across 3 modules: pyparsing, SimpleParse, and NLTK. I am curious to hear your opinions on those 3 modules, or if you have suggestions other than the 3 mentioned here:

1. How easy/difficult to learn those modules? I haven’t tried SimpleParse or NLTK yet, but I have tried pyparsing, which looks easy to pick up and the author, Paul McGuire, is very helpful. NTLK might be an overkill for what I do, at first glance.

2. What about performance? In most of my use cases, this is probably not that important, but I’ve come across comments on StackOverflow saying that pyparsing does not perform very well when text volume is large.

3. What about support and ongoing development? Like I mentioned earlier, the author behind pyparsing seems to be very active in answering questions and incorporating new ideas.

Thanks in advance for any insights and Happy New Year!

PS, here are 2 solutions to get domain\login out with pyparsing that Paul helpfully provided when I asked:

from pyparsing import *
grammar = Combine(Word(alphanums) + "\\" + Word(alphanums))
matches = grammar.searchString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe")
for m in matches:
	print m[0]
########
for toks,start,end in grammar.scanString("jwfoleow fjlwowe\jfoew lwfweolwfo\jofweojw lifewijowe"):
	print "%d:%d:%s" % (start,end,toks[0])

Comments (3)

List all database files in a SQL Server instance with Python and pyodbc

For SQL Server management from Linux, I’ve used pymssql on Linux before, but realized lately that it is hard to configure with the heterogeneous environment I am in. So I am giving pyodbc a try. The install on Windows is simple. Listed below is some code that I put together quickly for some testing. It lists all files used by a SQL Server instance. I will test pyodbc on Linux against a SQL Server database and will write my findings in a few days.

How to reproduce:
1. Install Python 2.7 on Windows. I am not bothering with 64-bit Python at the moment;
2. Download and install pyodbc 2.7 for Windows;
3. Add c:\Python27 to your PATH variable

""" The code below lists all files used by a SQL Server instance. Adjust instance and login credentials as necessary. ROWS means data file, LOG means log file """
import pyodbc
cn = pyodbc.connect('DRIVER={SQL Server};SERVER=MyInstance;DATABASE=master;Trusted_Connection=yes')
cursor = cn.cursor()
cursor.execute("select name from sys.databases")
rows = cursor.fetchall()
for row in rows:
	sql = ("select type_desc, physical_name from %s.sys.database_files" % row.name)
	cursor.execute(sql)
	l1rows = cursor.fetchall()
	print "Files for " + row.name + " :"
	for l1row in l1rows:
		print l1row.type_desc + " " + l1row.physical_name

Comments

Find and replace text in a file with Python

Continuing my Lego block building, here is the code for myself and whoever will find it useful.

Two blocks of code, the first one searches and replaces, then write the results to a new file, leaving the original intact. The second replaces things in place.

Highlights:

1. os.path.join is cool, but you’ve got to be careful while on Windows. To get c:\users\hji\junk\tt.txt, you can do it in two ways. See comment in code below.

2. For the block of code that does search and replace in place, pay attention to the comma right after the word line, which is necessary. I am not a big fan tricks like this, but it does get the job done.

3. I’ve done doctest, but I need to get unittest working properly as soon as possible. I haven’t got good sources explaining unittest, suggestion welcome.

#Block 1: Create a new file, leaving the original intact
import os
fin = open(os.path.join("c:\\", "users", "hji", "junk", "Lab_1_Ex_1_script_1.src"))
fout = open(os.path.join("c:/", "users", "hji", "junk", "tt.txt"), "w")
# Or you can do doubel back slash to escape the first one
# fout = open(os.path.join("c:\\", "users", "hji", "junk", "tt.txt"), "w")
for line in fin:
	fout.write( line.replace('XX', '01') )
fin.close()
fout.close()
#Block 2: Search and replace in place
import fileinput
def replaceAll(file,searchExp,replaceExp):
    for line in fileinput.input(file, inplace=1):
        if searchExp in line:
                line = line.replace(searchExp,replaceExp)
        print line,
replaceAll("Lab_1_Ex_1_script_1.SRC", "XX", "01")

Comments

Building up Python Lego pieces – some directory stuff

I lost all my Python code that I have written and collected over time, don’t ask why. So I am in the process of building that back up. Here is some code to create directories for a SQL Server gig I am working on.

Highlights:
1. Use os.path.join to take care of the forward and backward slashes on different operating systems. Somewhat similar to shlex module that take care of escaping. I’ve tested this successfully on both Windows and Linux;
2. Notice os.makedirs can make a directory and one of its sub-directories in one go.

import os
if not os.path.isdir(os.path.join("c:/", "storage", "data")):
	os.makedirs(os.path.join("c:/", "storage", "data"))
if not os.path.isdir(os.path.join("c:/", "storage", "log")):
	os.makedirs(os.path.join("c:/", "storage", "log"))
for i in [1,2,3,4,5,6]:
        for j in [1,2,3,4]:
        	for k in [0,1]:
			if not os.path.isdir(os.path.join("c:/", "storage", "data", "%s_%s_%s" % (i, j, k))):
				os.makedirs(os.path.join("c:/", "storage", "data", "%s_%s_%s" % (i,j, k)))
	if not os.path.isdir(os.path.join("c:/", "storage", "log", "%s_5_0" % i)):
		os.makedirs(os.path.join("c:/", "storage", "log", "%s_5_0" % i))
if not os.path.isdir(os.path.join("c:/", "FastTrack Labs", "Lab 2", "Exercise 2", "Output")):
        os.makedirs(os.path.join("c:/", "FastTrack Labs", "Lab 2", "Exercise 2", "Output"))

Comments

Having fun with MySQL and Python: converting MySQL character set to utf8

Lately I worked quite a bit with Python and Linux, writing monitoring and automation utilities. I am in a transition period, so I thought I ought to write some Python stuff interfacing with MySQL for fun, and start positioning myself for expanded career horizon, I hope.

To get started, I thought it would be fun to rewrite a Perl utility I wrote before with Python. That script converts MySQL character sets to utf8, a very common task for wikis and blogs during an upgrade. This time, I did everything from scratch: firing up an Amazon EC2 Linux instance, hand install and configuring MySQL 5.1.50 (creating mysql user, group, wget tarball, setting directory ownership and permissions, creating symbolic to MySQL binaries, editing my.cnf, /etc/init.d/ and chkconfig automatic startup, environmental variables, the works), compiled and configured Python 2.7, compiled and configured Python easy_install, compiled and installed MySQLdb module for Python, and finally successfully rewrote and tested the utility in Python. It is listed at the bottom of this post.

The original Perl program used some SQL code generation technique based on metadata within information_schema. As is hopefully well-known by now, information_schema should be used with caution because it can leads to server lockup. In fact, I noticed it when I ran my Perl program on a shared web hosting server. See Baron’s post here for an alternative way of doing it. But my Python program listed below didn’t use it, just to be consistent with the one used in Perl. I had a lot of fun going through this exercise.

Here are some notes I took.
1. It would be really nice if there is a catalog of every publicly available AMIs that contains detailed metadata that we can query, such as:
a. Version and distro of *nix;
b. What has been installed? Apache, MySQL, Python, PHP, Nagios, etc., etc.
c. For the software that has been installed, what is the version? What version(s) of MySQL was installed? What patches have been applied? Is it Python 2.4, 2.5, 2.6, 2.7, 3.0? Things like that.
d. Hardware related stuff: memory, disk size, etc.

Granted, I used AWS Management Console web interface to fire up and terminate instances, and accessed the servers via Putty. I lost my Linux workstation and haven’t built a new one yet. I do recall there were some EC2 command line tools on Linux that can do some of that when I last tried it a few months back. So it could be that things have improved and/or I just didn’t know the right way to get that out. If that is the case, I’d really appreciate it if you can give me some pointers. As it stands, the xml manifest file is too vague and not detailed enough.

2. Trying to do Python upgrade, say from 2.4 to 2.6 or 2.7, is generally not a good idea. Having multiple versions of Python side by side is probably a better way to go. Yes, it can be confusing, but I think in most cases, the cost/benefit analysis weighs against an upgrade: broken libraries and modules, failed cron jobs, you name it.

When having multiple versions of Python side by side, you can define a default version by some creative use of symbolic link. For instance, I did
./configure && make && sudo make install
for Python 2.7 on a server that has 2.4 pre-installed. I then did:
ln -s /usr/local/bin/python2.7 /usr/bin/python
to make Python 2.7 the default version. If I need Python 2.4, I will just call python2.4 binary.

While on that note, I found it surprising that 2 AMI instances (ami-3e836657 and ami-01b75668) I used both had Python 2.4 installed, which in my mind was pretty old.

3. Installing MySQLdb Python module was a bit problematic. On one server,
sudo easy_install mysql-python
did not work for me. I didn’t spend time digging because the following worked:
sudo yum install MySQL-python
On another server, easy_install mysql-python did work, but when I did import MySQLdb, I got:
ImportError: libmysqlclient_r.so.16: cannot open shared object file: No such file or directory
To fix that, do:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/mysql/lib
Here is the Python code

import MySQLdb
db = MySQLdb.connect(host="localhost", user="root", passwd="", db="charset_test", unix_socket="/tmp/mysql.sock")
cursor = db.cursor()
sql = "SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', REPLACE(column_type, 'char', 'binary')) FROM information_schema.columns WHERE table_schema = 'charset_test' and data_type LIKE '%char%'"
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
        """Convert char types to binary types first"""
        cursor.execute(row[0])
        """Now we convert binary type back to char with CHARACTER SET utf8 defined"""
        cursor.execute(row[0].rpartition(' ')[0] + ' ' + row[0].rpartition(' ')[2].replace('binary', 'char') + ' CHARACTER SET utf8')
sql = "SELECT CONCAT('ALTER TABLE ', table_name, ' MODIFY ', column_name, ' ', REPLACE(column_type, 'text', 'blob')) FROM information_schema.columns WHERE table_schema = 'MyDatabase' and data_type LIKE '%text%'"
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
        """Convert text types to blob types first"""
        cursor.execute(row[0])
        """Now we convert blob type back to text with CHARACTER SET utf8 defined"""
        cursor.execute(row[0].rpartition(' ')[0] + ' ' + row[0].rpartition(' ')[2].replace('blob', 'text') + ' CHARACTER SET utf8')
db.close()

Comments

Remove common lines from files with Python

I am digging Python. I am writing small pieces of code that does one thing and does it well, kind of like building a solid, reliable Lego piece. When I have a collection of them, I can snap ‘em together to make something useful. In fact, I’ve used Python to generate some content behind the wiki I built, http://www.haidongji.com/wiki

One useful thing that I wrote recently is to solve this problem: suppose you have two files, 1.txt and 2.txt, your objective is to remove lines that exist in both files from 1.txt. I came up with 4 lines of Python code (including the import statement) to solve it. I am a bit amused by this, although I don’t necessarily like this style of programming. It is clever, but can be hard to understand and maintain later on. Here is the code. Just for demo purposes, no error handling!

#!/usr/bin/env python
import fileinput
for line in fileinput.input("1.txt", inplace=1):
    if line not in open("2.txt", "r"):
        print line,

Note the comma at the end of the print statement. It is necessary, otherwise you will have extra newline characters in your file.

To create a simple test, create 1.txt with the English alphabet, with each letter occupying a line. Then create 2.txt, say with the letters in the word “haidong”, again with each letter taking a line. Run the code and see what happens.

Comments

Next entries »

Page optimized by WP Minify WordPress Plugin