Convert Excel file with hyperlinks to CSV

It seems the venerable file format CSV (Character/Comma-separated values) never goes out of style. According to CSV wikipedia page, IBM Fortran started supporting this format in 1967, before many of us were born. CSV has been with us, through thick and thin, silently but steadfastly, ready to spring into action when duty calls. For sure, it’s one of data professionals’ best friends! Often times, we’d convert spreadsheet files or dump data inside a database into a CSV before it can be distributed and consumed downstream.

Major league scripting languages, such as Perl, Python, and Ruby, all have their own way of converting Excel spreadsheet files into CSVs. Here I list their most popular libraries, based on my research: for Perl, there is Spreadsheet::ParseExcel; for Python, there is xlrd; for Ruby, there is roo.

However, none of these addressed a problem I had recently.

Here is my use case:
Given Excel files, in both xls and xlsx format, that have hyperlink columns in them, convert them to CSV. For hyperlink columns, save the text value (also known as Friendly_name) but not its URL. None of the libraries mentioned above can handle it.

So I ended up trying PHP, and found a PHP library called PHPExcel that addressed my needs. Below is a quick CLI PHP program I wrote.

Follow steps below to use it:

  1. Download PHPExcel library;
  2. Save the program below. On Linux, you can save it as excel2csv. On Windows, save it as excel2csv.php. Modify as needed so it points to the correct directory where PHPExcel is located;
  3. On Linux, you may want to run
    chmod +x excel2csv
    On Windows you should be ok if your system knows to use PHP when it sees a .php extension;
  4. To use it, on command line, run
    excel2csv inputExcel outputCsv
    Remember to replace the parameters to your liking!

Hope it helps!

#!/usr/bin/php -q
< ?php

$inputFile = $argv[1];
$outputFile = $argv[2];

function Xls2Csv($infile,$outfile)
	$fileType = PHPExcel_IOFactory::identify($infile);
	$objReader = PHPExcel_IOFactory::createReader($fileType);

	$objPHPExcel = $objReader->load($infile);

	$objWriter = PHPExcel_IOFactory::createWriter($objPHPExcel, 'CSV');

Using rsync to backup remote n00 files

I had trouble rsync remote Linux 600 files (rw——-) today. I knew that I came across this issue before but couldn’t remember how I resolved it. Therefore I had to waste time looking for and verifying a solution. Hence this blog post.

This is the problem I had earlier:

rsync -zr userA@remoteServer:/var/www/website/ /home/user/Documents/webSiteBackup/website/www/
rsync: send_files failed to open "/var/www/website/wp-config.php": Permission denied (13)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1655) [generator=3.1.0]

So the issue is that wp-config.php is a 600 file, meaning only root can read and write it. Although userA@remoteServer has sudo privilege, I still need to run visudo so that it won’t ask for password when this user runs rsync.

Here is the line I added into visudo:

userA    ALL=(root) NOPASSWD: /usr/bin/rsync

And here is the slightly modified bash command to run:

rsync --rsync-path="sudo rsync" -zr userA@remoteServer:/var/www/website/ /home/user/Documents/webSiteBackup/website/www/

Hope it helps you as well.

Grant full permission to Windows folders and files

For a Windows application or service, it is often necessary for either the user or service startup account to have full control to its folders and files. In most cases, this is transparent to the user. It is generally being taken care of by various delegation mechanisms behind the scene. However, this can be a problem during application upgrade or migration, when one rearranges disk drive letters or mount point names, or moves things from one server to another. To resolve this, one could manually go to each file and folder property page through Windows Explorer, then the Security tab, then the Edit… button, then the Add… button, etc. This method isn’t too bad if only a handful of files and folders are involved. But it is easy to see that manual method gets old really quickly, is error prone, and does not scale.

For example, say you are doing SQL Server migration for a data warehouse instance, where there are multiple mount points and hundred of database files due to filegroup usage and table partitions. During the migration process, you’ll move them to a new location and attach those database files. To ensure SQL Server service SID has proper permission is no small task and we need an automated way.

Given this is Windows, I started searching around for a PowerShell solution. You’d think this is a common problem, and therefore it shouldn’t be too difficult to find a solution. But that’s not the case so I wrote my own. Have fun with it!


  • Make sure you start PowerShell with “Start as Administrator”. This is to get around the issue that the creator and permission granter are different identities;
  • The first thing the script does is to give ownership to the Administrators group. During execution, you’ll see output like:
    SUCCESS: The file (or folder): “x:\MountPoints\A1″ now owned by the administrators group.
    that is to be expected.
  • The script’s execution is recursive, meaning all sub-folders and files are covered.

Without further ado, here is the PowerShell function:

function grantFullPermission ($folderpath, $userAccount) {
    Invoke-Expression "takeown /f $folderpath /r /a /d Y"
    $aclWork = (Get-Item $folderPath).GetAccessControl("Access")
    $ssaf = New-Object system.Security.AccessControl.FileSystemAccessRule($userAccount, "FullControl", "ContainerInherit, ObjectInherit", "None", "Allow")
    Set-ACL $folderPath $aclWork }

As an example, to grant proper permission to 3 folders listd in the code, you can run the following script as administrator:

function grantFullPermission ($folderpath, $userAccount) {
    Invoke-Expression "takeown /f $folderpath /r /a /d Y"
    $aclWork = (Get-Item $folderPath).GetAccessControl("Access")
    $ssaf = New-Object system.Security.AccessControl.FileSystemAccessRule($userAccount, "FullControl", "ContainerInherit, ObjectInherit", "None", "Allow")
    Set-ACL $folderPath $aclWork }

$folders = @("x:\MountPoints\A1", "x:\MountPoints\A2\", "x:\MountPoints\B1")

$folders | foreach {
    grantFullPermission $_ 'NT SERVICE\MSSQLSERVER' }

You could also make this a short program that takes two parameters: a folder and an account. Remember to quote the parameters if they have space character.


function grantFullPermission ($folderpath, $userAccount) {
    Invoke-Expression "takeown /f $folderpath /r /a /d Y"
    $aclWork = (Get-Item $folderPath).GetAccessControl("Access")
    $ssaf = New-Object system.Security.AccessControl.FileSystemAccessRule($userAccount, "FullControl", "ContainerInherit, ObjectInherit", "None", "Allow")
    Set-ACL $folderPath $aclWork }

grantFullPermission $DirectoryPath $AccountName

sed tricks

I helped a charity to rebuild a MySQL server and to restore a database with a lot of data of longblob type in the last two days. Fortunately there was a dump backup file for the database in question.

However, tables with longblob column(s) were not defined with “ROW_FORMAT=COMPRESSED”. I’d like to restore that database with row compression before inserting the data. Therefore I need to modify the dump sql file. The problem is that the file is 2.5 GB and the server only has 4 GB memory. So editing it is a challenge. Fortunately, Linux has sed to save the day. Don’t you love open source free software?

I am power Vi/Vim user, so I am familiar with sed and have used it in the past. But there are still a few things that I searched around for quick answers. So I’ll record noteworthy points here. I couldn’t remember how many times my own past blog entries helped me over the years. And I hope you’ll find this helpful too!

  • The -n switch is important. sed is a stream editor. In many cases you’d like to supress data streaming to stdout, and -n does that. This was especially important in my case, because a) the file is large, b) it contains blob that may or may not “fit to print”;
  • To see a particular line, say line a, use the p (print) command: sed -n 'ap' file
  • To see all lines between line a and b: sed -n 'a,bp' file
  • To see multiple, non-adjacent lines, say line a, e, g: sed -n 'ap;ep;gp' file
  • To edit big files, you’d like to make in-place changes. Therefore the -i switch. For example, to put in InnoDB row compression, this is the command I used: sed -i 's/CHARSET=utf8;/CHARSET=utf8 ROW_FORMAT=COMPRESSED;/' file
  • Similarly, to delete line a: sed -i 'ad' file You can also do range delete as well

By the way, when restore InnoDB database with a lot of blob data, it makes a lot of sense to enable the following settings in my.cnf, if they are not enabled already. It’ll make administration much easier down the road:
innodb_file_format = Barracuda

You may also need to tweak the max_allowed_packet and innodb_log_file_size parameters for successful restore.

Something else to pay attention to:
If you use:

mysql -uuser -p database < dump.sql

to restore the database back, the program may report the wrong line where it had loading problems. In most cases, you need to search the surrounding lines to find where the problem is.

Additionally, if you are in a hurry and want to load data in without troubleshooting loading issues, you can try adding -f switch to the command above so the restore ignores errors it encountered and jump to the next line.

Script to generate index rebuild with PAGE compression

For BI data warehouse databases, since the data does not change much and they typically require a lot of space, it makes a lot of sense to compress the indexes to save space.

I came across some BI databases whose indexes were created without compression. We are in the process of migrating those databases to a new server so I took this opportunity to completely rebuild those indexes with PAGE compression.

Two things are of interest:

  • Index rebuild starts from the smallest to the largest. The rational is that during one index rebuild, it needs roughly twice of the actual index size. If we start with the largest index, it may need to expand the file size unnecessarily. On the other hand, if we start with the smallest one, there might be enough space inside to accommodate that rebuild. Once that rebuild is done, more space will be saved, leaving more room for the next rebuild. This way we’ll be able to accommodate index rebuild with no or minimal additional space requirement;
  • Re-indexing is done on a new server, therefore there are no or very few connections to it, so the script defines the MAXDOP parameter, to hopefully make the process faster.
    s.Name AS SchemaName,
    t.NAME AS TableName, AS IndexName,
    'ALTER INDEX ' + + ' ON ' + + '.' + + ' REBUILD WITH (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE, MAXDOP = 20);' AS AlterIndex,
    SUM(a.total_pages) * 8 AS TotalSpaceKB, 
    SUM(a.used_pages) * 8 AS UsedSpaceKB, 
    (SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
    sys.tables t
    sys.schemas s ON s.schema_id = t.schema_id
    sys.indexes i ON t.OBJECT_ID = i.object_id
    sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
    sys.allocation_units a ON p.partition_id = a.container_id
    t.NAME NOT LIKE 'dt%' 
    AND t.is_ms_shipped = 0
    AND i.OBJECT_ID > 255 
    AND i.index_id > 0
    s.Name, t.Name,