Some notes and observations on ICE storage engine


I’ve used Vertica, a commercial column-based database storage engine, and was reasonably impressed. During O’Reilly MySQL conference last month, I checked out Infobright’s vendor booth and talked with some users. I became curious and wanted to test it out. Infobright has a free community version (ICE Infobright Community Edition)of its column-based storage engine that works with MySQL, which was what I used for my testing. I have no relationship with Infobright whatsoever, I happen to think that column-based storage can be a potentially disruptive technology in the BI/DW field. I’d love to hear your comments/experiences.

Here are some noteworthy points:

1. Setup is pretty easy. You can follow steps here. Note that the package has most relavant MySQL tools. A separate install of mysql client and server is NOT needed.

The ICE package has the following storage engines bundled:
BRIGHTHOUSE
MRG_MYISAM
CSV
MyISAM
MEMORY

2. I used TPC-H’s dbgen tool to generate data for testing. The raw text file is around 8.8 GB, about 72 million rows;

3. I used Rackspace’s cloud server, CentOS 5.5, 1 GB memory, 64-bit, 35 GB of hard drive space for testing. I created 2 databases, each with one table called lineitem: one table uses the BRIGHTHOUSE storage engine, the other uses the MyISAM storage engine. No index on the MyISAM table initially;

4. On this particular Rackspace server, below is the record of how long it took to load that amount of data into BRIGHTHOUSE table:

[sourcecode language=”text”]
# time mysql-ib infobright < load.sql real 22m46.974s user 0m2.320s sys 0m16.140s [/sourcecode] And here is the record of how long it takes to load into the MyISAM table: [sourcecode language="text"] # time mysql-ib test < load.sql real 6m11.966s user 0m1.960s sys 0m14.420s [/sourcecode] Here is what's inside load.sql: [sourcecode language="text"] load data local infile '/root/dbgen/lineitem.tbl' into table lineitem fields terminated by '|' lines terminated by '\n'; [/sourcecode] 5. Size comparison, in GB. Here you can see the power of the impressive compression rate of a column-based storage engine: Raw Text: 8.8 Raw Text After GZIP: 2.6 MyISAM File Size: 7.2 InfobrightFileSize: 1.5 6. I did a rudimentary performance comparison. The first one is on Infobright table, the second on MyISAM table: [sourcecode language="text"] mysql> select count(*) from lineitem where l_shipdate between ‘1993-01-01’ and ‘1995-01-01’;
+———-+
| count(*) |
+———-+
| 21880025 |
+———-+
1 row in set (15.06 sec)

mysql> use test;
Database changed
mysql> select count(*) from lineitem where l_shipdate between ‘1993-01-01’ and ‘1995-01-01’;
+———-+
| count(*) |
+———-+
| 21880025 |
+———-+
1 row in set (1 min 9.23 sec)
[/sourcecode]

I then created an index on l_shipdate for the MyISAM table, and improved the performance to a bit more than 10 seconds.


2 responses to “Some notes and observations on ICE storage engine”

  1. I used ICE at my last gig and I did quite a bit of testing of our data before we put it in production. I never tested it in a cloud environment and I found that the load was super painful if it wasn’t on fast disks. I put both mysql & ICE on similar hardware, same OS.

    In the end we changed the schema so no text what so ever went into the tables. NONE. Everything was an int or a decimal. This totally sped up the queries DRAMATICALLY.

    Due to diskspace limitations I never benchmarked MySQL past 2 Million rows in the table. But I did go on to test ICE with data upto 10BILLION rows & it didn’t fall down.

    I love ICE. But I do have some complaints:
    1) no read while loading.
    2) no replication.

    If I can get around #1 I would consider using it at my new gig.

    Between the data compression and the speed of queries ICE was/is extremely attractive. We had a dataset (granted with text) that had 350 million rows. BI query took 2.5 DAYS. moved it to all ints schema & on ICE it took 10.0 secs the 1st time I ran it & .006 after that! it was impressive. Now I probably could have gotten MySQL down to 8 hours with a tighter schema but that was still too long.

    YMMV
    erin

  2. Thanks Erin for the comment. It is great to hear candid discussions from real experience.

    I didn’t know the “no read while loading” limitation. I will check that out.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.