Clarify how string subscripting handles UTF-8.

2026-01-18 13:49:59 +01:00 · 2015-01-22 16:38:03 -08:00
parent a92e58c804
commit a5b00cebe7
8 changed files with 115 additions and 26 deletions
--- a/doc/site/core/string.markdown
+++ b/doc/site/core/string.markdown
@ -1,7 +1,37 @@
 ^title String Class
 ^category core

-A string of Unicode code points stored in UTF-8.
+Strings are immutable chunks of text. More formally, a string is a sequence of
+Unicode code points encoded in UTF-8.
+
+If you never work with any characters outside of the ASCII range, you can treat
+strings like a directly indexable array of characters. Once other characters
+get involved, it's important to understand the distinction.
+
+In UTF-8, a single Unicode code point (very roughly a single "character") may
+be encoded as one or more bytes. This means you can't directly index by code
+point. There's no way to find, say, the fifth code unit in a string without
+walking the string from the beginning and counting them as you go.
+
+Because counting code units is relatively slow, string methods generally index
+by *byte*, not *code unit*. When you do:
+
+    :::dart
+    someString[3]
+
+That means "get the code unit starting at *byte* three", not "get the third
+code unit in the string". This sounds scary, but keep in mind that the methods
+on string *return* byte indices too. So, for example, this does what you want:
+
+    :::dart
+    var metalBand = "Fäcëhämmër"
+    var hPosition = metalBand.indexOf("h")
+    IO.print(metalBand[hPosition]) // "h"
+
+In general, methods on strings will work in terms of code units if they can do
+so efficiently, and will otherwise deal in bytes.
+
+## Methods

 ### **contains**(other)

@ -13,20 +43,20 @@ It is a runtime error if `other` is not a string.

 Returns the length of the string.

-### **endsWith(suffix)**
+### **endsWith**(suffix)

 Checks if the string ends with `suffix`.

 It is a runtime error if `suffix` is not a string.

-### **indexOf(search)**
+### **indexOf**(search)

-Returns the index of `search` in the string or -1 if `search` is not a
-substring of the string.
+Returns the index of the first byte matching `search` in the string or `-1` if
+`search` was not found.

 It is a runtime error if `search` is not a string.

-### **startsWith(prefix)**
+### **startsWith**(prefix)

 Checks if the string starts with `prefix`.

@ -48,8 +78,19 @@ Check if the string is not equal to `other`.

 ### **[**index**]** operator

-Returns a one character string of the value at `index`.
+Returns a string containing the code unit starting at byte `index`.

-It is a runtime error if `index` is greater than the length of the string.
+    :::dart
+    IO.print("ʕ•ᴥ•ʔ"[5]) // "ᴥ".

-*Note: This does not currently handle UTF-8 characters correctly.*
+Since `ʕ` is two bytes in UTF-8 and `•` is three, the fifth byte points to the
+bear's nose.
+
+If `index` points into the middle of a UTF-8 sequence, this returns an empty
+string:
+
+    :::dart
+    IO.print("I ♥ NY"[3]) // "".
+
+It is a runtime error if `index` is greater than the number of bytes in the
+string.