Clarify how string subscripting handles UTF-8.

This commit is contained in:
Bob Nystrom
2015-01-22 16:38:03 -08:00
parent a92e58c804
commit a5b00cebe7
8 changed files with 115 additions and 26 deletions

View File

@ -1,7 +1,37 @@
^title String Class
^category core
A string of Unicode code points stored in UTF-8.
Strings are immutable chunks of text. More formally, a string is a sequence of
Unicode code points encoded in UTF-8.
If you never work with any characters outside of the ASCII range, you can treat
strings like a directly indexable array of characters. Once other characters
get involved, it's important to understand the distinction.
In UTF-8, a single Unicode code point (very roughly a single "character") may
be encoded as one or more bytes. This means you can't directly index by code
point. There's no way to find, say, the fifth code unit in a string without
walking the string from the beginning and counting them as you go.
Because counting code units is relatively slow, string methods generally index
by *byte*, not *code unit*. When you do:
:::dart
someString[3]
That means "get the code unit starting at *byte* three", not "get the third
code unit in the string". This sounds scary, but keep in mind that the methods
on string *return* byte indices too. So, for example, this does what you want:
:::dart
var metalBand = "Fäcëhämmër"
var hPosition = metalBand.indexOf("h")
IO.print(metalBand[hPosition]) // "h"
In general, methods on strings will work in terms of code units if they can do
so efficiently, and will otherwise deal in bytes.
## Methods
### **contains**(other)
@ -13,20 +43,20 @@ It is a runtime error if `other` is not a string.
Returns the length of the string.
### **endsWith(suffix)**
### **endsWith**(suffix)
Checks if the string ends with `suffix`.
It is a runtime error if `suffix` is not a string.
### **indexOf(search)**
### **indexOf**(search)
Returns the index of `search` in the string or -1 if `search` is not a
substring of the string.
Returns the index of the first byte matching `search` in the string or `-1` if
`search` was not found.
It is a runtime error if `search` is not a string.
### **startsWith(prefix)**
### **startsWith**(prefix)
Checks if the string starts with `prefix`.
@ -48,8 +78,19 @@ Check if the string is not equal to `other`.
### **[**index**]** operator
Returns a one character string of the value at `index`.
Returns a string containing the code unit starting at byte `index`.
It is a runtime error if `index` is greater than the length of the string.
:::dart
IO.print("ʕ•ᴥ•ʔ"[5]) // "ᴥ".
*Note: This does not currently handle UTF-8 characters correctly.*
Since `ʕ` is two bytes in UTF-8 and `•` is three, the fifth byte points to the
bear's nose.
If `index` points into the middle of a UTF-8 sequence, this returns an empty
string:
:::dart
IO.print("I ♥ NY"[3]) // "".
It is a runtime error if `index` is greater than the number of bytes in the
string.