Mixing up some MySQL and SQL Server hacking activities can be fun. I’ve also noticed that when you work with different database platforms, it can bring additional clarity to certain concepts and techniques. One reason for that, I think, is that different terminologies and ways of explanations are used for people practicing on different platforms, and that tends to re-enforce or otherwise bring clarity to things that you may not have a good grasp to begin with.
Anyway, I am following Baron Schwartz et al’s High Performance MySQL book and thought the example of building your own hash index for URL lookup looks really useful and clever. On the topic of hash collison, the book discussed using words in /usr/share/dict/words as an example. By the way, High Performance MySQL 2nd Edition is a fantastic book, highly recommended!
I tried that myself. Here is the relavent code to play along, assuming you have the table and triggers set up:
mysql> load data infile ‘/usr/share/dict/words’ into table pseudohash (url);
Query OK, 98569 rows affected (1.99 sec)
Records: 98569 Deleted: 0 Skipped: 0 Warnings: 0
mysql> select url_crc, count(*) from pseudohash group by url_crc having count(*) > 1;
| url_crc | count(*) |
| 1774765869 | 2 |
1 row in set (0.37 sec)
mysql> select url, url_crc from pseudohash where url_crc = 1774765869;
| url | url_crc |
| codding | 1774765869 |
| gnu | 1774765869 |
2 rows in set (0.10 sec)
I then went to SQL Server to see if there is a built-in function that does CRC32 hashing. I didn’t see it in Books Online. One of those days, when I find some free time, I will write a C# function and expose it as a UDF inside SQL Server. I will then run some benchmark logical reads with this hash index idea implemented in SQL Server and see how it affects things.
A side benefit of this exercise is I learned the meaning of “codding”, which I found somewhat amusing.