mirror of
https://github.com/wren-lang/wren.git
synced 2026-01-18 13:49:59 +01:00
Clarify how string subscripting handles UTF-8.
This commit is contained in:
@ -1,7 +1,37 @@
|
||||
^title String Class
|
||||
^category core
|
||||
|
||||
A string of Unicode code points stored in UTF-8.
|
||||
Strings are immutable chunks of text. More formally, a string is a sequence of
|
||||
Unicode code points encoded in UTF-8.
|
||||
|
||||
If you never work with any characters outside of the ASCII range, you can treat
|
||||
strings like a directly indexable array of characters. Once other characters
|
||||
get involved, it's important to understand the distinction.
|
||||
|
||||
In UTF-8, a single Unicode code point (very roughly a single "character") may
|
||||
be encoded as one or more bytes. This means you can't directly index by code
|
||||
point. There's no way to find, say, the fifth code unit in a string without
|
||||
walking the string from the beginning and counting them as you go.
|
||||
|
||||
Because counting code units is relatively slow, string methods generally index
|
||||
by *byte*, not *code unit*. When you do:
|
||||
|
||||
:::dart
|
||||
someString[3]
|
||||
|
||||
That means "get the code unit starting at *byte* three", not "get the third
|
||||
code unit in the string". This sounds scary, but keep in mind that the methods
|
||||
on string *return* byte indices too. So, for example, this does what you want:
|
||||
|
||||
:::dart
|
||||
var metalBand = "Fäcëhämmër"
|
||||
var hPosition = metalBand.indexOf("h")
|
||||
IO.print(metalBand[hPosition]) // "h"
|
||||
|
||||
In general, methods on strings will work in terms of code units if they can do
|
||||
so efficiently, and will otherwise deal in bytes.
|
||||
|
||||
## Methods
|
||||
|
||||
### **contains**(other)
|
||||
|
||||
@ -13,20 +43,20 @@ It is a runtime error if `other` is not a string.
|
||||
|
||||
Returns the length of the string.
|
||||
|
||||
### **endsWith(suffix)**
|
||||
### **endsWith**(suffix)
|
||||
|
||||
Checks if the string ends with `suffix`.
|
||||
|
||||
It is a runtime error if `suffix` is not a string.
|
||||
|
||||
### **indexOf(search)**
|
||||
### **indexOf**(search)
|
||||
|
||||
Returns the index of `search` in the string or -1 if `search` is not a
|
||||
substring of the string.
|
||||
Returns the index of the first byte matching `search` in the string or `-1` if
|
||||
`search` was not found.
|
||||
|
||||
It is a runtime error if `search` is not a string.
|
||||
|
||||
### **startsWith(prefix)**
|
||||
### **startsWith**(prefix)
|
||||
|
||||
Checks if the string starts with `prefix`.
|
||||
|
||||
@ -48,8 +78,19 @@ Check if the string is not equal to `other`.
|
||||
|
||||
### **[**index**]** operator
|
||||
|
||||
Returns a one character string of the value at `index`.
|
||||
Returns a string containing the code unit starting at byte `index`.
|
||||
|
||||
It is a runtime error if `index` is greater than the length of the string.
|
||||
:::dart
|
||||
IO.print("ʕ•ᴥ•ʔ"[5]) // "ᴥ".
|
||||
|
||||
*Note: This does not currently handle UTF-8 characters correctly.*
|
||||
Since `ʕ` is two bytes in UTF-8 and `•` is three, the fifth byte points to the
|
||||
bear's nose.
|
||||
|
||||
If `index` points into the middle of a UTF-8 sequence, this returns an empty
|
||||
string:
|
||||
|
||||
:::dart
|
||||
IO.print("I ♥ NY"[3]) // "".
|
||||
|
||||
It is a runtime error if `index` is greater than the number of bytes in the
|
||||
string.
|
||||
|
||||
Reference in New Issue
Block a user