Vim – File Formats, Line Feed, and Carriage Return

When given a comma or tab-delimited file, we usually want to import this into some kind of database. The first you need to find out is what type of file this is as it could make or break your import.

g _ – goes to the first non whispace

Mac OS Pre-X CR ASCII 13 Control Key: ^M
Mac X / Unix LF ASCII 10 Control Key: ^J
Windows CRLF N/A ^M$

This could easily be done at the command line with:

Doesn’t get much clearer than that. So this particular file is from a Windows system with CRLF terminators. Now, without doing this, it’s a bit vague what we’re importing from. For example, if we assumed the text file was from a OSX/Unix system and ran this:

You’ll get some funky carriage return characters at the beginning of each row (and they will count as characters and truncate your data). Now using vim editor, you won’t see much wrong with the file with the “:set list” command. You’ll see ^I where the tabs are and a $ for end-of-line. All looks well but that’s because vim auto-detected the file. You could see this with the vim command:

Now to see its true colors, let’s force it to read as unix/DOS.

You’ll start to see the infamous ^M in the file.

In conclusion, find out exactly what kind of file you’re importing, and let the importer know. In the case with MySQL, terminate lines by “\r\n” will provide a proper import. I hope this solves the mystery of imports that go wrong!

 

How Unicode Plays a Part in Your Software – Encoding & Character Sets

Introduction

There was a time when you could determine the size of a file by counting the number of characters it had. One character equates to one byte. Simple. In fact, it was how I found the office perpetrator who printed out a nasty letter for everyone to see. I went through all the print logs and counted the bytes.

 

In many cases, this is still true. However, for languages, such as Chinese, with thousands of characters, 8 bits (2^8 = 256) is not enough. For this reason a multitude of encoding standards (ISO-8859, Mac OS Roman, Big5, MS-Windows char sets, etc) have been implemented but it has been a headache to make consistent across applications and delivery systems. In some cases, in order to have multiple encodings or character sets in one document would require yet another encoding standard or would just be impossible. This not only applies to text documents, but web pages and databases as well.

 

We needed a standard that encompass it all. That standard is called Unicode.

 

What is Unicode?

Unicode is just a giant mapping table of numbers (code points) to characters. That’s about it. The kicker is that it includes every character imaginable on this planet. Basically it’s the superset of all character sets in existence today. It even includes ancient scripts like Egyptian Hieroglyphs, Cuneiform, and Gothic. The characters make up the code space.

 

Unicode encodings (e.g., UTC-8) specify how these numbers (with their own code points) are represented as bits.

 

Consists of 17 planes of 65,536 (=2^16) code points each. That’s 1,114,112 code points. That’s enough code points to map all past, present, and future characters created by mankind. The first plane, Basic Multilingual Plane (BMP) contains most commonly used characters.

 

What’s the difference between a character set and an encoding?

Character sets are technically just list of distinct characters and symbols. They could be used by multiple languages (e.g., Latin-1 is used for the Americas and Western Europe).

 

Encoding is the way these characters are stored in memory. An encoding maps these characters to a binary representation.

 

Character sets that have encodings are called coded character sets. Unsurprisingly, this is a bit confusing because many systems use them interchangeably. For example, MySQL calls a characters and their encodings simply as a character set. What they really mean is a coded character set (or code pages).

 

Every encoding must have a character set associated with it but a character set could have multiple encodings. The most relevant example of this is the Unicode character set with multiple encodings (UTF-8, UTF-16 BE, UTF-16 LE, UTF-32, etc). The same character in one encoding could be represented by a larger/smaller number of bytes in another encoding.

This W3C article does a fine job explaining this.

What are code pages?

It’s mostly a Microsoft Windows-specific encoding that is based on standard encodings with a few modifications. It could also be generically a coded character set.

 

UTF-8 vs UTF-16 vs UTF-32

UTF-8

  • Variable-length 8 bit code units
  • Backward compatible with ASCII without having to do deal with endianness or byte order marks (BOM). The first 128 characters correspond one-to-one with ASCII.
  • Some commonly used characters could be various lengths which could cause indexing and calculating a code point slow.

UTF-16

  • Variable-length 16 bit code units
  • Great if ASCII doesn’t dominant the document. It’ll use 2 bytes total whereas UTF-8 will use 3 or more bytes. e.g., East Asian languages required 2 bytes in UTF-16 whereas in UTF-8 it would be at least 3.
  • If using primarily US-ASCII strings, there will be lots of null bytes.

UTF-32

  • 32 bit code units
  • You don’t need to decode the code point as it’s given to you in it’s purest 32-bit format.

How does character sets and encoding relate to fonts?

A font defines the “glyphs” for usually a single character set or a subset of a character set. If there’s a character undefined in the font, you’ll typically get a replacement character like a square box or question mark.

 

Basically, fonts are glyphs that are mapped to code points in a coded character set.

Conclusion

At this time, most systems are using UTF-8. It’s efficient as far as storage (as long as it’s mostly ASCII characters). It has the possibility of mapping any character imaginable so there’s really no reason not to use it.

 

When you type on your keyboard, you’re using a certain encoding scheme. When you save that file and display the text again using the same encoding, you’ll get consistent results. The biggest problem we run into is seeing random looking characters in our files. The only explanation for this is that the encoding used to view the file is incorrect.

 

It’ll be important to note: conversion from one encoding to another is not for the faint of heart. You have to know what you’re doing or you’ll lose your original bits forever. Sometimes it’s not even possible to perform the conversion.

 

From this point forward, a byte no longer equates to character. Be wary of the encoding scheme used, especially if you start to see a snowman and cellphones in your CSV file.

 

Bottomline: Use UTF-8.

 

 

How to Setup Let’s Encrypt on Apache2 and Ubuntu 14.04 LTS

After years of having to manually renew certificates (I’ve used StartSSL in the past), Let’s Encrypt is finally live and will allow you to automate this process by installing an agent and a cron job.

Here I’m trying to install certificates on multiple blogs on the same server.

Sites. secure all congrats

auto-create-keys

It’s stupidly easy to do:

  1. Go here: https://certbot.eff.org/
  2. Follow the on-screen instructions
  3. THAT’S IT!

Amazing right? Well I did run into a few errors but they were easily solved:

  • No valid IP addresses found for [website]
    • Make sure your DNS A and CNAME records are correct with the correct IP
  • Incorrect validation certification for TLS-SNI-01 challenge
    • This I found was due to two issues I had:
      • I forgot the site was no longer hosted on my server so the DNS record was pointing to another host anyway
      • You need to have SSLEngine, SSLCertificateFile, and SSLCertificateKeyFile values set in your Apache configuration. Even if it points to empty files, it’ll work. Of course this is probably because I had values there earlier. I haven’t tested this but if only “SSLEngine on” is set, it should still work.
        turn off ssl3
  • DNS problem: NXDOMAIN looking up A for [website]
    • Make sure you have a CNAME record for the subdomain (e.g., “www”)
  • Redirect HTTP traffic to HTTPS no longer works
    cert generator

    • All you have to do is adjust the 000-default configuration to the following:           redirect-http-to-https

Also be sure to protect your sites from POODLE.  Analyze your site here: https://www.ssllabs.com/ssltest/analyze.html

That’s all folks. If you have any issues please let me know in the comments.

google api client no such file to load

Getting the following error when running annotate for my models

…/activesupport-4.2.6/lib/active_support/dependencies.rb:274:in `require’: No such file to load — google/api_client (LoadError)

Try downgrading your google-api-client to a version below 0.9, namely…

Modify your Gemfile to have this line instead of what is there now for the google api gem

gem ‘google-api-client’, ‘<0.9’

How to Revert/Undo Changes in Git

Generally there are three ways of reverting changes:

  1. checkout
  2. revert
  3. reset

Checkout

If you just need to revert specific files, you could run git checkout to retrieve an exact version. In the below example, I wanted to revert the “app.rb” file so that it only contains “Some app work.

git-undo-checkout

Revert

Revert will create a new commit undoing the changes made during a specific commit. It remove an entire commit in your project history. In this example, I’m going to undo the changes in last commit. As you could see, the history of the revert is kept.

git-revert

 

Reset

Unlike revert, reset will undo all subsequent commits. It has the potential Only use this to undo local changes. Most use reset to unstage files to match the most recent commit and perhaps create more focused commits/snapshots. The working directory is unchanged unless “–hard” option is set.
git-reset

You could also reset to a tag.
git-reset-tag

How to Setup a DigitalOcean Provider on Vagrant

If you haven’t done so, install Vagrant for your OS here.

Generate your key pairs. If using Windows, use puttygen. In Windows you’re going to have to use the OpenSSH key formats:

digitalocean-puttygen-private-key digitalocean-puttygen-public-key

Edit VagrantFile. Make sure the private key does not have an extension. The public key should have extension “.pub” with the same file name as the private key (e.g., “do.pub”).

Create your API V2 token from your DO control panel:

digitalocean-generate-new-token

Most of these lines are self-explanatory. To get a list of images and regions you could click on “Create Droplet” from your web account or you could run the following:

Here’s a sample list of regions and images:

Regions Images
  • nyc1
  • ams1
  • sfo1
  • nyc2
  • ams2
  • sgp1
  • lon1
  • nyc3
  • ams3
  • fra1″
  • centos-5-8-x64
  • debian-6-0-x64
  • fedora-21-x64
  • ubuntu-12-04-x64
  • debian-7-0-x64
  • ruby-on-rails
  • wordpress

Now just change into the Vagrant project folder and run “vagrant up”

rails console hangs

“rails c” hangs with no error messages and no response.

solution: stop spring using the following command
“spring stop” or “bin/spring stop”

spring will automatically start back up when you run the rails c command.

How to Expand a Boot Disk on a GCE Instance

Google now allows you specify the size of your boot disk larger than 10GB when you create your instance. In any case, if you need to resize your boot disk for any reason, these are the steps I followed.

Here I’ve attached a 250GB blank disk to the instance:

“pv” gives you a progress of the “dd” command. This will take some time. Once it’s finished:

Now we re-partition the disk:

fdisk-expand

Now you could run:

  1. Detach the disk,
  2. Clone the instance and choose this disk as the root disk.
  3. SSH and make sure everything looks good.
  4. Once it looks good, you could delete the original instance and create a new instance with this disk.