Author Archives: Jay Luong

Vim – File Formats, Line Feed, and Carriage Return

When given a comma or tab-delimited file, we usually want to import this into some kind of database. The first you need to find out is what type of file this is as it could make or break your import.

g _ – goes to the first non whispace

Mac OS Pre-X CR ASCII 13 Control Key: ^M
Mac X / Unix LF ASCII 10 Control Key: ^J
Windows CRLF N/A ^M$

This could easily be done at the command line with:

Doesn’t get much clearer than that. So this particular file is from a Windows system with CRLF terminators. Now, without doing this, it’s a bit vague what we’re importing from. For example, if we assumed the text file was from a OSX/Unix system and ran this:

You’ll get some funky carriage return characters at the beginning of each row (and they will count as characters and truncate your data). Now using vim editor, you won’t see much wrong with the file with the “:set list” command. You’ll see ^I where the tabs are and a $ for end-of-line. All looks well but that’s because vim auto-detected the file. You could see this with the vim command:

Now to see its true colors, let’s force it to read as unix/DOS.

You’ll start to see the infamous ^M in the file.

In conclusion, find out exactly what kind of file you’re importing, and let the importer know. In the case with MySQL, terminate lines by “\r\n” will provide a proper import. I hope this solves the mystery of imports that go wrong!


How Unicode Plays a Part in Your Software – Encoding & Character Sets


There was a time when you could determine the size of a file by counting the number of characters it had. One character equates to one byte. Simple. In fact, it was how I found the office perpetrator who printed out a nasty letter for everyone to see. I went through all the print logs and counted the bytes.


In many cases, this is still true. However, for languages, such as Chinese, with thousands of characters, 8 bits (2^8 = 256) is not enough. For this reason a multitude of encoding standards (ISO-8859, Mac OS Roman, Big5, MS-Windows char sets, etc) have been implemented but it has been a headache to make consistent across applications and delivery systems. In some cases, in order to have multiple encodings or character sets in one document would require yet another encoding standard or would just be impossible. This not only applies to text documents, but web pages and databases as well.


We needed a standard that encompass it all. That standard is called Unicode.


What is Unicode?

Unicode is just a giant mapping table of numbers (code points) to characters. That’s about it. The kicker is that it includes every character imaginable on this planet. Basically it’s the superset of all character sets in existence today. It even includes ancient scripts like Egyptian Hieroglyphs, Cuneiform, and Gothic. The characters make up the code space.


Unicode encodings (e.g., UTC-8) specify how these numbers (with their own code points) are represented as bits.


Consists of 17 planes of 65,536 (=2^16) code points each. That’s 1,114,112 code points. That’s enough code points to map all past, present, and future characters created by mankind. The first plane, Basic Multilingual Plane (BMP) contains most commonly used characters.


What’s the difference between a character set and an encoding?

Character sets are technically just list of distinct characters and symbols. They could be used by multiple languages (e.g., Latin-1 is used for the Americas and Western Europe).


Encoding is the way these characters are stored in memory. An encoding maps these characters to a binary representation.


Character sets that have encodings are called coded character sets. Unsurprisingly, this is a bit confusing because many systems use them interchangeably. For example, MySQL calls a characters and their encodings simply as a character set. What they really mean is a coded character set (or code pages).


Every encoding must have a character set associated with it but a character set could have multiple encodings. The most relevant example of this is the Unicode character set with multiple encodings (UTF-8, UTF-16 BE, UTF-16 LE, UTF-32, etc). The same character in one encoding could be represented by a larger/smaller number of bytes in another encoding.

This W3C article does a fine job explaining this.

What are code pages?

It’s mostly a Microsoft Windows-specific encoding that is based on standard encodings with a few modifications. It could also be generically a coded character set.


UTF-8 vs UTF-16 vs UTF-32


  • Variable-length 8 bit code units
  • Backward compatible with ASCII without having to do deal with endianness or byte order marks (BOM). The first 128 characters correspond one-to-one with ASCII.
  • Some commonly used characters could be various lengths which could cause indexing and calculating a code point slow.


  • Variable-length 16 bit code units
  • Great if ASCII doesn’t dominant the document. It’ll use 2 bytes total whereas UTF-8 will use 3 or more bytes. e.g., East Asian languages required 2 bytes in UTF-16 whereas in UTF-8 it would be at least 3.
  • If using primarily US-ASCII strings, there will be lots of null bytes.


  • 32 bit code units
  • You don’t need to decode the code point as it’s given to you in it’s purest 32-bit format.

How does character sets and encoding relate to fonts?

A font defines the “glyphs” for usually a single character set or a subset of a character set. If there’s a character undefined in the font, you’ll typically get a replacement character like a square box or question mark.


Basically, fonts are glyphs that are mapped to code points in a coded character set.


At this time, most systems are using UTF-8. It’s efficient as far as storage (as long as it’s mostly ASCII characters). It has the possibility of mapping any character imaginable so there’s really no reason not to use it.


When you type on your keyboard, you’re using a certain encoding scheme. When you save that file and display the text again using the same encoding, you’ll get consistent results. The biggest problem we run into is seeing random looking characters in our files. The only explanation for this is that the encoding used to view the file is incorrect.


It’ll be important to note: conversion from one encoding to another is not for the faint of heart. You have to know what you’re doing or you’ll lose your original bits forever. Sometimes it’s not even possible to perform the conversion.


From this point forward, a byte no longer equates to character. Be wary of the encoding scheme used, especially if you start to see a snowman and cellphones in your CSV file.


Bottomline: Use UTF-8.



How to Setup Let’s Encrypt on Apache2 and Ubuntu 14.04 LTS

After years of having to manually renew certificates (I’ve used StartSSL in the past), Let’s Encrypt is finally live and will allow you to automate this process by installing an agent and a cron job.

Here I’m trying to install certificates on multiple blogs on the same server.

Sites. secure all congrats


It’s stupidly easy to do:

  1. Go here:
  2. Follow the on-screen instructions
  3. THAT’S IT!

Amazing right? Well I did run into a few errors but they were easily solved:

  • No valid IP addresses found for [website]
    • Make sure your DNS A and CNAME records are correct with the correct IP
  • Incorrect validation certification for TLS-SNI-01 challenge
    • This I found was due to two issues I had:
      • I forgot the site was no longer hosted on my server so the DNS record was pointing to another host anyway
      • You need to have SSLEngine, SSLCertificateFile, and SSLCertificateKeyFile values set in your Apache configuration. Even if it points to empty files, it’ll work. Of course this is probably because I had values there earlier. I haven’t tested this but if only “SSLEngine on” is set, it should still work.
        turn off ssl3
  • DNS problem: NXDOMAIN looking up A for [website]
    • Make sure you have a CNAME record for the subdomain (e.g., “www”)
  • Redirect HTTP traffic to HTTPS no longer works
    cert generator

    • All you have to do is adjust the 000-default configuration to the following:           redirect-http-to-https

Also be sure to protect your sites from POODLE.  Analyze your site here:

That’s all folks. If you have any issues please let me know in the comments.

How to Revert/Undo Changes in Git

Generally there are three ways of reverting changes:

  1. checkout
  2. revert
  3. reset


If you just need to revert specific files, you could run git checkout to retrieve an exact version. In the below example, I wanted to revert the “app.rb” file so that it only contains “Some app work.



Revert will create a new commit undoing the changes made during a specific commit. It remove an entire commit in your project history. In this example, I’m going to undo the changes in last commit. As you could see, the history of the revert is kept.




Unlike revert, reset will undo all subsequent commits. It has the potential Only use this to undo local changes. Most use reset to unstage files to match the most recent commit and perhaps create more focused commits/snapshots. The working directory is unchanged unless “–hard” option is set.

You could also reset to a tag.

How to Setup a DigitalOcean Provider on Vagrant

If you haven’t done so, install Vagrant for your OS here.

Generate your key pairs. If using Windows, use puttygen. In Windows you’re going to have to use the OpenSSH key formats:

digitalocean-puttygen-private-key digitalocean-puttygen-public-key

Edit VagrantFile. Make sure the private key does not have an extension. The public key should have extension “.pub” with the same file name as the private key (e.g., “”).

Create your API V2 token from your DO control panel:


Most of these lines are self-explanatory. To get a list of images and regions you could click on “Create Droplet” from your web account or you could run the following:

Here’s a sample list of regions and images:

Regions Images
  • nyc1
  • ams1
  • sfo1
  • nyc2
  • ams2
  • sgp1
  • lon1
  • nyc3
  • ams3
  • fra1″
  • centos-5-8-x64
  • debian-6-0-x64
  • fedora-21-x64
  • ubuntu-12-04-x64
  • debian-7-0-x64
  • ruby-on-rails
  • wordpress

Now just change into the Vagrant project folder and run “vagrant up”

How to Expand a Boot Disk on a GCE Instance

Google now allows you specify the size of your boot disk larger than 10GB when you create your instance. In any case, if you need to resize your boot disk for any reason, these are the steps I followed.

Here I’ve attached a 250GB blank disk to the instance:

“pv” gives you a progress of the “dd” command. This will take some time. Once it’s finished:

Now we re-partition the disk:


Now you could run:

  1. Detach the disk,
  2. Clone the instance and choose this disk as the root disk.
  3. SSH and make sure everything looks good.
  4. Once it looks good, you could delete the original instance and create a new instance with this disk.

Google Cloud Services – How to Use Access Tokens Directly

For whatever reason you can’t use gsutil, gcloud, or a client library, you could request an access token directly. This example uses PHP 5.2 (the PHP client library only works  with PHP 5.3+). Please make sure your instance has access to the correct scopes and that a service account is enabled.

Related Resources:


Example Usage of Ruby Client LIbrary for Google Cloud Storage

“But it Works on My Box” – A Case for Vagrant

When I need to quickly spin up a server to do testing, debugging, development, etc., I usually spin up VirtualBox and build a VM based off a cloud image file from scratch. It is a tedious process but I figured it’s easy and quick to just clone the VM afterwards. The problem arises when I want to share this environment with my contract workers or other members of my team. These images tend to be large as they include the OS, disks, hardware configuration, etc.

Enter Vagrant.

Vagrant isn’t exactly a competitor to VirtualBox but more of a wrapper. It’s basically a VM manager.

Benefits of Vagrant:

  • You could check in the Vagrant files into source control with your project (e.g., Git) and a user can have an identical environment by running a single command “vagrant up”
    • You don’t have to share a 500MB+ VM image file. You only need two small files (Vagrantfile & a provision script)
    • Keeps absolute parity between all collaborators on a project. All they have to do is pull the source code and “vagrant up”
    • You could integrate with several Configuration Management systems such as Chef, Ansible, and Puppet for provisioning.
    • You could have different Virtual Machines for different purposes
    • The configuration is in plain-text so you have an idea of what kind of hardware is required to run the application and which apps will be installed.
  • It could manage and spin up multiple VM instances
    • If you need multiple servers for load-balancing web servers, clustering databases, etc. All you have to do is add these servers to the Vagrantfile and “vagrant up”
    • Simplifies multi-VM networks
    • It makes setting up a distributed system development environment very easy.
    • Centralized configuration of all VMs
  • Faster to setup a VM
    • Much quicker to get a VM up and running than on VirtualBox
    • Networking is automated
  • Share the environment
    • You could easily have users connect directly to your host via http, ssh, and custom ports.
  • You could manage other types of machines with Providers
    • VirtualBox
    • VMWare
    • AWS
    • GCE
  • You don’t have to change your development environment
    • Vagrant makes the code files seamless from the host to the guest. You could use your own OS, editor, browser, and other development tools.

In the end, Vagrant saves a lot of time and it’s easy to get started with. Keep in mind VMs do require actual hardware allocations and there could be limitations on your machine. If you need to run a large number of VMs, use Docker.

Git Best Practices

  1. Commit often
  2. All is not Lost
    1. git log -g
    2. git fsck –unreachable
    3. git stash list
  3. Backups
    1. Although a clone is a backup it does not include git configs, working directory/index, non-standard refs, or dangling objects.
  4. Once you push, don’t change history.
  5. Choose a Workflow
  6. Logically divide into repositories
  7. Useful commit messages
  8. Stay up to date
    1. Rebasing
    2. git pull –rebase
    3. git merge –no-ff
  9. Maintenance
    1. git fsck
    2. git gc –aggressive
    3. git remote update –prune
    4. git stash list
  10. Enforce standards
    1. Regression tests
    2. Complication tests
    3. Syntax/link checkers
    4. Commit message analysis
  11. Useful Tools
    1. gitolite
    2. gitslave
    3. gerrit
  12. Integrate with external tools
  13. Always name your stashes
  14. Protect against history rewriting