Copy edit the string docs.

This commit is contained in:
Bob Nystrom
2015-09-12 09:42:31 -07:00
parent fe143644b3
commit 92c2b2d5e0

View File

@ -2,46 +2,49 @@
^category core
A string is an immutable array of bytes. Strings usually store text, in which
case it will be the UTF-8 encoding of the text's code points. But you can put
any kind of byte values in there you want, including null bytes or invalid UTF-8
sequences.
case the bytes are the UTF-8 encoding of the text's code points. But you can put
any kind of byte values in there you want, including null bytes or invalid
UTF-8.
There are a few ways to think of a string:
* As a searchable chunk of text composed of a sequence of textual code points.
* As an iterable sequence of numeric code points.
* As an iterable sequence of code point numbers.
* As a flat array of directly indexable bytes.
All of those are useful for some problems, so the string API supports all three.
The first one is the most common, so that's what most methods directly on the
string class cater towards.
The first one is the most common, so that's what methods directly on the string
class cater to.
In UTF-8, a single Unicode code point—very roughly a single
"character"— may be encoded as one or more bytes. This means you can't
"character"—may encode to one or more bytes. This means you can't
efficiently index by code point. There's no way to jump directly to, say, the
fifth code unit in a string without walking the string from the beginning and
fifth code point in a string without walking the string from the beginning and
counting them as you go.
Because counting code units is relatively slow, the indexes passed to string
Because counting code points is relatively slow, the indexes passed to string
methods are *byte* offsets, not *code point* offsets. When you do:
:::dart
someString[3]
That means "get the code unit starting at *byte* three", not "get the third
code unit in the string". This sounds scary, but keep in mind that the methods
on string *return* byte indices too. So, for example, this does what you want:
That means "get the code point starting at *byte* three", not "get the third
code point in the string". This sounds scary, but keep in mind that the methods
on strings *return* byte indexes too. So, for example, this does what you want:
:::dart
var metalBand = "Fäcëhämmër"
var hPosition = metalBand.indexOf("h")
IO.print(metalBand[hPosition]) // "h"
If you want to work with a string as a sequence numeric code points, call the `codePoints` getter. It returns a [Sequence](sequence.html) that will decide UTF-8 and iterate over the code points, returning each as a number.
If you want to work with a string as a sequence numeric code points, call the
`codePoints` getter. It returns a [Sequence](sequence.html) that decodes UTF-8
and iterates over the code points, returning each as a number.
If you want to get at the raw bytes, call `bytes`. This returns a Sequence that ignores any UTF-8 encoding and works directly at the byte level.
If you want to get at the raw bytes, call `bytes`. This returns a Sequence that
ignores any UTF-8 encoding and works directly at the byte level.
## Static Methods
@ -156,7 +159,7 @@ Check if the string is not equal to `other`.
### **[**index**]** operator
Returns a string containing the code unit starting at byte `index`.
Returns a string containing the code point starting at byte `index`.
:::dart
IO.print("ʕ•ᴥ•ʔ"[5]) // "ᴥ".
@ -165,8 +168,7 @@ Since `ʕ` is two bytes in UTF-8 and `•` is three, the fifth byte points to th
bear's nose.
If `index` points into the middle of a UTF-8 sequence or at otherwise invalid
UTF-8, this returns a one-byte string containing the value of the byte at that
index:
UTF-8, this returns a one-byte string containing the byte at that index:
:::dart
IO.print("I ♥ NY"[3]) // One-byte string whose value is 153.