String manipulation in Go

I find the way that strings have been implemented in Go to be really interesting but also a bit confusing when you are first introduced to them. Go has native UTF8 support, which flows through how the source code is written to how strings and “runes” are represented (runes is the thing that really got me by surprise but more on that later). So what does Go do differently with strings that is worth mentioning?

Strings are immutable objects

This is something that Go has similar with other languages. Strings are immutable and cannot be changed. Worth bearing this in mind while working with any language as it can affect your memory allocation and management, especially when loading and manipulating large strings.

Internal string representation

As we said, Go has native UTF8 support and in UTF8 every code point is represented by up to 4 bytes. The key thing here is the up to part of the sentence. Some of the code points in UTF8 are really familiar – like the characters I have used to type this post 😉 – others are less common like this symbol .

The “simple” latin alphabet characters existed way before we had UTF8 support and in order to store them in a computerised medium we only needed a single byte. More complex characters require more space => more bytes.

The interesting thing with UTF8 is that it tries to save us space so that we don’t have to allocate 4 bytes for characters (or better code points) that we could previously store with just a single byte. This means that when we store a string encoded as UTF8 its characters might occupy 1, 2, 3 or 4 bytes 🤯!

So how does Go handle this? Internally strings are actually represented as read only byte slices. Also the individual UTF8 code points are referred to as “runes” in Go. So a string consists of a series of bytes that are in turn combined in runes.

Let’s consider the following string “e⌘f“. This is a UTF8 string literal consisting of 3 code points (or in Go lang runes). The internal representation of this string is a read only slice whose bytes will look like this:

Byte slice representation of the string

As we can see the first and last rune require 1 byte for storage whereas the middle rune requires 3.

Iterating over a string

So now we can see why this is an entirely different story. There are 2 different ways we can iterate over strings in Go:

  • with the for loop and the index positions of the underlying byte slice
  • with the for loop and the range operator

Let’s use the example that we had in the previous section and write some code to iterate over the string using the 2 methods we mentioned. We are going to use fmt.Printf in both cases to print the type of the component and its byte value.

Iterating over a UTF8 string in Go

The output of the above code is the following:

The first for loop will print 5 lines, each containing a single byte of the string whereas the second for loop will only have 3 lines in the output one for each rune in the string.

Resources

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s