Index compression script generation

I wrote a script to generate script for non-clustered index page compression in SQL Server. I’ve since made improvements and here it is.

Highlights:

  • This script generates index compression script on a partition by partition basis, if the underlying index is partitioned;
  • Like the previous version, it still works for non-partitioned indexes;
  • Like the previous version, it compresses indexes from smallest to largest, progressively saving space as it nibbles forward. Therefore it is unlikely that it’ll grow data file(s) during rebuild;
  • Unlike the previous version, I’ve set maxdop = 0, which let’s SQL decides degree of parallelism;
  • There is no quick way, that I know of, to tell if an underlying index is partitioned. Hence the usage of two temp tables to differentiate the two for proper script generation.
IF Object_id('tempdb..#NonPartitionedIndex') IS NOT NULL 
  DROP TABLE #nonpartitionedindex 

SELECT object_id, 
       index_id 
INTO   #nonpartitionedindex 
FROM   sys.partitions 
WHERE  object_id > 255 
       AND data_compression IN ( 0, 1 ) -- Non-compressed 
       AND index_id > 1 
GROUP  BY object_id, 
          index_id 
HAVING Count(*) = 1 

IF Object_id('tempdb..#PartitionedIndex') IS NOT NULL 
  DROP TABLE #partitionedindex 

SELECT object_id, 
       index_id 
INTO   #partitionedindex 
FROM   sys.partitions 
WHERE  object_id > 255 
       AND data_compression IN ( 0, 1 ) -- Non-compressed 
       AND index_id > 1 
GROUP  BY object_id, 
          index_id 
HAVING Count(*) > 1 

SELECT s.NAME 
       AS 
       SchemaName, 
       t.NAME 
       AS TableName, 
       i.NAME 
       AS IndexName, 
       'ALTER INDEX ' + i.NAME + ' ON ' + s.NAME + '.' 
       + t.NAME 
       + 
' REBUILD WITH (SORT_IN_TEMPDB = ON, MAXDOP = 0, DATA_COMPRESSION = PAGE);' AS 
       AlterRebuild, 
       Sum(a.total_pages) * 8 
       AS TotalSpaceKB, 
       Sum(a.used_pages) * 8 
       AS UsedSpaceKB, 
       ( Sum(a.total_pages) - Sum(a.used_pages) ) * 8 
       AS UnusedSpaceKB
FROM   sys.tables t 
       JOIN sys.schemas s 
         ON s.schema_id = t.schema_id 
       JOIN sys.indexes i 
         ON t.object_id = i.object_id 
       JOIN sys.partitions p 
         ON i.object_id = p.object_id 
            AND i.index_id = p.index_id 
       JOIN sys.allocation_units a 
         ON p.partition_id = a.container_id 
       JOIN #nonpartitionedindex npi 
         ON p.object_id = npi.object_id 
            AND p.index_id = npi.index_id 
WHERE  i.index_id > 1 -- Non-clustered indexes  
       AND p.data_compression IN ( 0, 1 ) -- Non-compressed  
       AND t.NAME <> 'dtproperties' -- Ignore certain tables  
       AND t.is_ms_shipped = 0 
       AND i.object_id > 255 -- Non-system objects  
GROUP  BY s.NAME, 
          t.NAME, 
          i.NAME, 
          p.partition_number 
UNION ALL 
SELECT s.NAME                                         AS SchemaName, 
       t.NAME                                         AS TableName, 
       i.NAME                                         AS IndexName, 
       'ALTER INDEX ' + i.NAME + ' ON ' + s.NAME + '.' 
       + t.NAME 
       + 
' REBUILD WITH (SORT_IN_TEMPDB = ON, MAXDOP = 0, DATA_COMPRESSION = PAGE ON PARTITIONS (' 
+ Cast(p.partition_number AS VARCHAR(3)) 
+ '));'                                        AS AlterRebuild, 
Sum(a.total_pages) * 8                         AS TotalSpaceKB, 
Sum(a.used_pages) * 8                          AS UsedSpaceKB, 
( Sum(a.total_pages) - Sum(a.used_pages) ) * 8 AS UnusedSpaceKB
FROM   sys.tables t 
       JOIN sys.schemas s 
         ON s.schema_id = t.schema_id 
       JOIN sys.indexes i 
         ON t.object_id = i.object_id 
       JOIN sys.partitions p 
         ON i.object_id = p.object_id 
            AND i.index_id = p.index_id 
       JOIN sys.allocation_units a 
         ON p.partition_id = a.container_id 
       JOIN #partitionedindex pi 
         ON p.object_id = pi.object_id 
            AND p.index_id = pi.index_id 
WHERE  i.index_id > 1 -- Non-clustered indexes  
       AND p.data_compression IN ( 0, 1 ) -- Non-compressed  
       AND t.NAME <> 'dtproperties' -- Ignore certain tables  
       AND t.is_ms_shipped = 0 
       AND i.object_id > 255 -- Non-system objects  
GROUP  BY s.NAME, 
          t.NAME, 
          i.NAME, 
          p.partition_number 
ORDER  BY UsedSpaceKB

Yanking and sorting lines matching a pattern

One of the best investments I’ve ever made is to be proficient with a good cross-platform editor, in my case Vim. It took me a good few months before I really became comfortable with it, but those few months’ struggle yielded huge dividend since then!

So after years of Vim usage, I consider myself a power user. Yet from time to time, I come across some nifty tips that remind me why I fall in love with it in the first place: the sense of wonder, awe, beauty, and intelligence of its creators!

Here are two things I learned recently:

  • Sort based on a pattern
    I use sort and sort u all the time. sort does what the word implies, sorting all lines in the buffer. sort u (unique) does the sort, but in addition to that, removes duplicate lines. Those two commands are extremely useful.

    Yesterday I was doing some email log analysis, and had a bunch of email addresses in my file. And I thought, wouldn’t it be nice if I could sort those addresses based on domain names? So I searched the web, then looked through :help sort. Sure enough, I can absolutely do that.

    Say you’ve got the following lines:

    person1@b.com
    person2@a.com
    person3@a.com
    person4@c.com
    person5@a.com

    To sort them based on domain names, type :sort /.\+@/ in normal mode will do just that.

  • Yank all matching lines into a register
    I use :g/pattern/d fairly often. What that line does is to delete all lines inside the document that match the pattern. Since you can use regex with pattern, this can be pretty powerful.

    However, before deleting them, sometime it is a good idea to save them away. To do that, run
    :g/pattern/yank CapitalLetter

    This command will put matching lines into a register. Let use X as an example. At a different buffer, you can run

    "Xp

    And it’ll paste those lines!

Ubuntu更新、Python和R软件包安装、和Firefox下载插件演示

视频演示:
1. 怎么更新Ubuntu Linux;
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install build-essential
2. 怎么安装Python包;
sudo apt-get install python-pip
sudo apt-get install python-dev
sudo pip install numpy
sudo pip install ggplot…
3. 怎么安装R和R包;
sudo apt-get install r-base
sudo apt-get install openjdk-7-jdk
sudo R
install.packages(“xlsx”)
4. 如何方便快捷下载视频:
Firefox, 插件DownThemAll

建立中文版Linux虚拟机

最近在几个QQ和微信IT群里灌水,注意到不少同学在大学或工作中没接触过Linux。而很多IT项目如大数据、机器学习、服务器等都需要Linux技能,所以很多人想开始接触、学习Linux。我就动手做了以下视频给初学者,希望能有帮助。这是我第一次做screencast,很希望能听到你的批评和建议。

更新:视频上传到优酷后,效果并不理想。我又尝试了其它视频分享网站如乐视、QQ视频、新浪视频、和土豆。乐视的上传网页没有上传渠道,或许因为我的IP地址在国外?我在新浪视频网页也碰到同样问题。QQ视频倒是允许上传,但最终告诉我“您的视频可能包含有相关主管机关明确规定不能出现的违规内容,因此无法通过审核。请修改后再重新上传。”,真是令人匪夷所思。

最后上传到土豆,效果还可以。我是把几个小视频合并到一个文件,但由于操作错误,我没有把一开始介绍VirtualBox软件的那部分合并进去。你只要记得VirtualBox在Windows、Linux、Mac上都可以免费运行并去下载安装就可以啦。

Convert Excel file with hyperlinks to CSV

It seems the venerable file format CSV (Character/Comma-separated values) never goes out of style. According to CSV wikipedia page, IBM Fortran started supporting this format in 1967, before many of us were born. CSV has been with us, through thick and thin, silently but steadfastly, ready to spring into action when duty calls. For sure, it’s one of data professionals’ best friends! Often times, we’d convert spreadsheet files or dump data inside a database into a CSV before it can be distributed and consumed downstream.

Major league scripting languages, such as Perl, Python, and Ruby, all have their own way of converting Excel spreadsheet files into CSVs. Here I list their most popular libraries, based on my research: for Perl, there is Spreadsheet::ParseExcel; for Python, there is xlrd; for Ruby, there is roo.

However, none of these addressed a problem I had recently.

Here is my use case:
Given Excel files, in both xls and xlsx format, that have hyperlink columns in them, convert them to CSV. For hyperlink columns, save the text value (also known as Friendly_name) but not its URL. None of the libraries mentioned above can handle it.

So I ended up trying PHP, and found a PHP library called PHPExcel that addressed my needs. Below is a quick CLI PHP program I wrote.

Follow steps below to use it:

  1. Download PHPExcel library;
  2. Save the program below. On Linux, you can save it as excel2csv. On Windows, save it as excel2csv.php. Modify as needed so it points to the correct directory where PHPExcel is located;
  3. On Linux, you may want to run
    chmod +x excel2csv
    On Windows you should be ok if your system knows to use PHP when it sees a .php extension;
  4. To use it, on command line, run
    excel2csv inputExcel outputCsv
    Remember to replace the parameters to your liking!

Hope it helps!

#!/usr/bin/php -q
< ?php
require_once('/Directory2PHPExcel/PHPExcel/Classes/PHPExcel.php');

$inputFile = $argv[1];
$outputFile = $argv[2];
Xls2Csv($inputFile,$outputFile);

function Xls2Csv($infile,$outfile)
{
	$fileType = PHPExcel_IOFactory::identify($infile);
	$objReader = PHPExcel_IOFactory::createReader($fileType);

	$objReader->setReadDataOnly(true);
	$objPHPExcel = $objReader->load($infile);

	$objWriter = PHPExcel_IOFactory::createWriter($objPHPExcel, 'CSV');
	$objWriter->save($outfile);
}
?>