I’ve done some benchmark testing of 2 Python modules for MySQL data retrieval: MySQLdb and pyhs. MySQLdb uses MySQL’s client libraries, whereas pyhs uses HandlerSocket that bypasses MySQL’s client layer and interfaces Innodb storage engine’s files directly. In my testing, HandlerSocket results in 82% improvement over mysql client libraries based on number of rows retrieved. The tests were conducted under different conditions: right after a start when cache is cold, a warmed up cache after running SELECT * FROM customer, and alternating the execution order of those 2 Python files. The results are fairly consistent in that they all fall in the same range. Below is a sample output.
root@ubuntu:~# python hanSolo.py Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key: Seconds elapsed: 61.0000810623 Rows retrieved: 509863 root@ubuntu:~# python mclient.py Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key: Seconds elapsed: 61.0001530647 Rows retrieved: 280120
Here is my setup:
Hardware and software:
1. Rackspace Cloud server Ubuntu 10.04 Lucid Lynx, 1 gig memory, 40 gig hard disk space, 64-bit
Linux ubuntu 188.8.131.52-rscloud #8 SMP Mon Sep 20 15:54:33 UTC 2010 x86_64 GNU/Linux
2. Following instruction here to get Percona’s APT repository;
3. apt-get install percona-server-client-5.5
4. apt-get install percona-server-server-5.5
Enable HandlerSocket plugin. HandlerSocket is bundled with Percona Server 5.5, so you don’t have to download source files, config, make, build yourself:
1. mysql> install plugin handlersocket soname 'handlersocket.so'
2 cp /usr/share/mysql/my-large.cnf /etc/mysql/my.cnf
3. vim /etc/mysql/my.cnf with the following under mysqld section
loose_handlersocket_port = 9998 loose_handlersocket_port_wr = 9999 loose_handlersocket_threads = 16 loose_handlersocket_threads_wr = 1 open_files_limit = 65535
4. service mysql restart
Acquire Python’s MySQLdb and pyhs modules:
1. apt-get install libmysqlclient-dev
Necessary for building Python’s MySQLdb
2. apt-get install python-dev
Necessary Python header files, gcc, make, etc., for building Python modules
2. wget the appropriate egg from this page. Get the one for your version of Python.
3. sh eggFileDownloadedFromTheStepAbove
4. easy_install MySQL-python
MySQLdb module, which uses mysql client for MySQL access
5. easy_install python-handler-socket
Prepare testing data
1. Follow instructions here to get dbgen compiled;
2. While at the proper directory, run
./dbgen -v -T c
It will generate a customer file that has 150000 rows
3. Create customer table in test. Here is the DDL:
CREATE TABLE customer ( C_CUSTKEY INTEGER NOT NULL, C_NAME VARCHAR(25) NOT NULL, C_ADDRESS VARCHAR(40) NOT NULL, C_NATIONKEY INTEGER NOT NULL, C_PHONE CHAR(15) NOT NULL, C_ACCTBAL DECIMAL(15,2) NOT NULL, C_MKTSEGMENT CHAR(10) NOT NULL, C_COMMENT VARCHAR(117) NOT NULL, primary key (C_CUSTKEY));
4. load data local infile ‘/root/dbgen/customer.tbl’ into table customer fields terminated by ‘|’ lines terminated by ‘\n’;
Adjust file location as necessary.
Finally, here is the content of my throwaway Python test scripts. One highlight of mclient.py is it demonstrate how to return results back in as dict with MySQLdb.
import time from pyhs import Manager hs = Manager() start = time.time() i = 1 j= 0 while i < 150000: data = hs.get('test', 'customer', ['C_CUSTKEY', 'C_NAME', 'C_ADDRESS'], '%s' % i) i=i+1 if i == 150000: i = 1 end = time.time() j = j + 1 if int(end - start) > 60: break print "Using HandlerSocket, below is a report of how many customer's name and address can be retrieved based on customer key:" print "Seconds elapsed: ", str(end - start) print "Rows retrieved: ", str(j)
import sys, MySQLdb, time my_host = "localhost" my_user = "root" my_pass = "" my_db = "test" try: db = MySQLdb.connect(host=my_host, user=my_user, passwd=my_pass, db=my_db) except MySQLdb.Error, e: print "Error %d: %s" % (e.args, e.args) sys.exit (1) cursor = db.cursor (MySQLdb.cursors.DictCursor) i=1 j=0 start = time.time() while i < 150000: sql = "select c_custkey, c_name, c_address from customer where c_custkey=%s" % i; cursor.execute(sql) results = cursor.fetchall() i=i+1 if i==150000: i=1 end = time.time() j=j+1 if int(end - start) > 60: break print "Using mysql client libraries, below is a report of how many customer's name and address can be retrieved based on customer key:" print "Seconds elapsed: ", str(end - start) print "Rows retrieved: ", str(j) db.close()