mirror of
https://github.com/wren-lang/wren.git
synced 2026-01-11 06:08:41 +01:00
Copy edit the string docs.
This commit is contained in:
@ -2,46 +2,49 @@
|
||||
^category core
|
||||
|
||||
A string is an immutable array of bytes. Strings usually store text, in which
|
||||
case it will be the UTF-8 encoding of the text's code points. But you can put
|
||||
any kind of byte values in there you want, including null bytes or invalid UTF-8
|
||||
sequences.
|
||||
case the bytes are the UTF-8 encoding of the text's code points. But you can put
|
||||
any kind of byte values in there you want, including null bytes or invalid
|
||||
UTF-8.
|
||||
|
||||
There are a few ways to think of a string:
|
||||
|
||||
* As a searchable chunk of text composed of a sequence of textual code points.
|
||||
|
||||
* As an iterable sequence of numeric code points.
|
||||
* As an iterable sequence of code point numbers.
|
||||
|
||||
* As a flat array of directly indexable bytes.
|
||||
|
||||
All of those are useful for some problems, so the string API supports all three.
|
||||
The first one is the most common, so that's what most methods directly on the
|
||||
string class cater towards.
|
||||
The first one is the most common, so that's what methods directly on the string
|
||||
class cater to.
|
||||
|
||||
In UTF-8, a single Unicode code point—very roughly a single
|
||||
"character"— may be encoded as one or more bytes. This means you can't
|
||||
"character"—may encode to one or more bytes. This means you can't
|
||||
efficiently index by code point. There's no way to jump directly to, say, the
|
||||
fifth code unit in a string without walking the string from the beginning and
|
||||
fifth code point in a string without walking the string from the beginning and
|
||||
counting them as you go.
|
||||
|
||||
Because counting code units is relatively slow, the indexes passed to string
|
||||
Because counting code points is relatively slow, the indexes passed to string
|
||||
methods are *byte* offsets, not *code point* offsets. When you do:
|
||||
|
||||
:::dart
|
||||
someString[3]
|
||||
|
||||
That means "get the code unit starting at *byte* three", not "get the third
|
||||
code unit in the string". This sounds scary, but keep in mind that the methods
|
||||
on string *return* byte indices too. So, for example, this does what you want:
|
||||
That means "get the code point starting at *byte* three", not "get the third
|
||||
code point in the string". This sounds scary, but keep in mind that the methods
|
||||
on strings *return* byte indexes too. So, for example, this does what you want:
|
||||
|
||||
:::dart
|
||||
var metalBand = "Fäcëhämmër"
|
||||
var hPosition = metalBand.indexOf("h")
|
||||
IO.print(metalBand[hPosition]) // "h"
|
||||
|
||||
If you want to work with a string as a sequence numeric code points, call the `codePoints` getter. It returns a [Sequence](sequence.html) that will decide UTF-8 and iterate over the code points, returning each as a number.
|
||||
If you want to work with a string as a sequence numeric code points, call the
|
||||
`codePoints` getter. It returns a [Sequence](sequence.html) that decodes UTF-8
|
||||
and iterates over the code points, returning each as a number.
|
||||
|
||||
If you want to get at the raw bytes, call `bytes`. This returns a Sequence that ignores any UTF-8 encoding and works directly at the byte level.
|
||||
If you want to get at the raw bytes, call `bytes`. This returns a Sequence that
|
||||
ignores any UTF-8 encoding and works directly at the byte level.
|
||||
|
||||
## Static Methods
|
||||
|
||||
@ -156,7 +159,7 @@ Check if the string is not equal to `other`.
|
||||
|
||||
### **[**index**]** operator
|
||||
|
||||
Returns a string containing the code unit starting at byte `index`.
|
||||
Returns a string containing the code point starting at byte `index`.
|
||||
|
||||
:::dart
|
||||
IO.print("ʕ•ᴥ•ʔ"[5]) // "ᴥ".
|
||||
@ -165,8 +168,7 @@ Since `ʕ` is two bytes in UTF-8 and `•` is three, the fifth byte points to th
|
||||
bear's nose.
|
||||
|
||||
If `index` points into the middle of a UTF-8 sequence or at otherwise invalid
|
||||
UTF-8, this returns a one-byte string containing the value of the byte at that
|
||||
index:
|
||||
UTF-8, this returns a one-byte string containing the byte at that index:
|
||||
|
||||
:::dart
|
||||
IO.print("I ♥ NY"[3]) // One-byte string whose value is 153.
|
||||
|
||||
Reference in New Issue
Block a user