diff --git a/doc/site/core/string.markdown b/doc/site/core/string.markdown index f5f1861d..2c68a7d9 100644 --- a/doc/site/core/string.markdown +++ b/doc/site/core/string.markdown @@ -2,46 +2,49 @@ ^category core A string is an immutable array of bytes. Strings usually store text, in which -case it will be the UTF-8 encoding of the text's code points. But you can put -any kind of byte values in there you want, including null bytes or invalid UTF-8 -sequences. +case the bytes are the UTF-8 encoding of the text's code points. But you can put +any kind of byte values in there you want, including null bytes or invalid +UTF-8. There are a few ways to think of a string: * As a searchable chunk of text composed of a sequence of textual code points. -* As an iterable sequence of numeric code points. +* As an iterable sequence of code point numbers. * As a flat array of directly indexable bytes. All of those are useful for some problems, so the string API supports all three. -The first one is the most common, so that's what most methods directly on the -string class cater towards. +The first one is the most common, so that's what methods directly on the string +class cater to. In UTF-8, a single Unicode code point—very roughly a single -"character"— may be encoded as one or more bytes. This means you can't +"character"—may encode to one or more bytes. This means you can't efficiently index by code point. There's no way to jump directly to, say, the -fifth code unit in a string without walking the string from the beginning and +fifth code point in a string without walking the string from the beginning and counting them as you go. -Because counting code units is relatively slow, the indexes passed to string +Because counting code points is relatively slow, the indexes passed to string methods are *byte* offsets, not *code point* offsets. When you do: :::dart someString[3] -That means "get the code unit starting at *byte* three", not "get the third -code unit in the string". This sounds scary, but keep in mind that the methods -on string *return* byte indices too. So, for example, this does what you want: +That means "get the code point starting at *byte* three", not "get the third +code point in the string". This sounds scary, but keep in mind that the methods +on strings *return* byte indexes too. So, for example, this does what you want: :::dart var metalBand = "Fäcëhämmër" var hPosition = metalBand.indexOf("h") IO.print(metalBand[hPosition]) // "h" -If you want to work with a string as a sequence numeric code points, call the `codePoints` getter. It returns a [Sequence](sequence.html) that will decide UTF-8 and iterate over the code points, returning each as a number. +If you want to work with a string as a sequence numeric code points, call the +`codePoints` getter. It returns a [Sequence](sequence.html) that decodes UTF-8 +and iterates over the code points, returning each as a number. -If you want to get at the raw bytes, call `bytes`. This returns a Sequence that ignores any UTF-8 encoding and works directly at the byte level. +If you want to get at the raw bytes, call `bytes`. This returns a Sequence that +ignores any UTF-8 encoding and works directly at the byte level. ## Static Methods @@ -156,7 +159,7 @@ Check if the string is not equal to `other`. ### **[**index**]** operator -Returns a string containing the code unit starting at byte `index`. +Returns a string containing the code point starting at byte `index`. :::dart IO.print("ʕ•ᴥ•ʔ"[5]) // "ᴥ". @@ -165,8 +168,7 @@ Since `ʕ` is two bytes in UTF-8 and `•` is three, the fifth byte points to th bear's nose. If `index` points into the middle of a UTF-8 sequence or at otherwise invalid -UTF-8, this returns a one-byte string containing the value of the byte at that -index: +UTF-8, this returns a one-byte string containing the byte at that index: :::dart IO.print("I ♥ NY"[3]) // One-byte string whose value is 153.