Utf8String
in package
implements
Stringable
uses
BackwardCompatibility
A class for manipulating UTF-8 strings.
This class is intended to be called from the string manipulation methods in SMF\Utils. It is generally better (and easier) to use those methods rather than creating instances of this class directly.
Table of Contents
Interfaces
- Stringable
Properties
- $language : string
- $string : string
- $use_intl_normalizer : bool
Methods
- __construct() : mixed
- Constructor.
- __toString() : string
- Returns $this->string.
- compose() : array<string|int, mixed>
- Helper method for normalizeC and normalizeKC.
- convertCase() : object
- Converts the case of this UTF-8 string.
- create() : object
- Static wrapper for constructor.
- decompose() : array<string|int, mixed>
- Helper method for normalizeD and normalizeKD.
- exportStatic() : void
- Provides a way to export a class's public static properties and methods to global namespace.
- extractWords() : array<string|int, mixed>
- Extracts all the words in this string.
- isNormalized() : bool
- Checks whether a string is already normalized to a given form.
- normalize() : object
- Performs Unicode normalization on this string.
- sanitizeInvisibles() : object
- Helper function for Utils::sanitizeChars() that deals with invisible characters.
- semanticSplit() : array<string|int, mixed>
- Splits the string into parts using the Unicode word break algorithm.
- normalizeC() : object
- Normalizes via Canonical Decomposition then Canonical Composition.
- normalizeD() : object
- Normalizes via Canonical Decomposition.
- normalizeKC() : object
- Normalizes via Compatibility Decomposition then Canonical Composition.
- normalizeKCFold() : object
- Casefolds UTF-8 via Compatibility Composition Casefolding.
- normalizeKD() : object
- Normalizes via Compatibility Decomposition.
- preserveEmoji() : void
- Replaces emoji characters and sequences in $this->string with placeholders in order to preserve them from further processing.
- sanitizeJoinControls() : void
- Replaces allowed join controls inside words in $this->string with placeholders in order to preserve them from further processing.
- sanitizeVariationSelectors() : void
- Replaces sanctioned variation sequences in $this->string with placeholders in order to preserve them from further processing.
Properties
$language
public
string
$language
The two-character locale code for the language of this string. E.g. 'de', 'en', 'fr', 'zh', etc.
$string
public
string
$string
The scalar string.
$use_intl_normalizer
protected
static bool
$use_intl_normalizer
Whether we can use the intl extension's Normalizer class.
Methods
__construct()
Constructor.
public
__construct(string $string[, string $language = null ]) : mixed
Parameters
- $string : string
-
The string.
- $language : string = null
-
Two-character locale code for the language of this string. If null, assumes the language currently in use by SMF (meaning the user's language, falling back to the forum's default).
__toString()
Returns $this->string.
public
__toString() : string
Return values
string —The string.
compose()
Helper method for normalizeC and normalizeKC.
public
static compose(array<string|int, mixed> $chars) : array<string|int, mixed>
Parameters
- $chars : array<string|int, mixed>
-
Array of decomposed Unicode characters
Return values
array<string|int, mixed> —Array of composed Unicode characters.
convertCase()
Converts the case of this UTF-8 string.
public
convertCase(string $case[, bool $simple = false ]) : object
Updates the value of $this->string. On failure, $this->string will be unset.
The supported cases are as follows:
- upper: Converts all letters to their upper case version.
- lower: Converts all letters to their lower case version.
- fold: Converts all letters to their default case version. For most languages that means lower case, but not all.
- title: Capitalizes the first letter of each word, and converts all other letters to lower case.
- ucwords: Like title case, except that letters that do not start a word are left as they are.
- ucfirst: Like ucwords, except that it acts only on the first word in the string.
Special conditional casing rules are applied for letters in certain languages that need them. These conditional casing rules are defined in the Unicode standard and implemented according to those instructions.
It is also worth noting that for certain letters in some languages, the capitalized form of the letter used for upper case may differ from the capitalized form use for title case. For example, the lower case character 'dz' becomes 'DZ' in upper case, but 'Dz' in title case. All such special title case rules are specified in the Unicode data files and are applied automatically.
Parameters
- $case : string
-
One of 'upper', 'lower', 'fold', 'title', 'ucwords', or 'ucfirst'.
- $simple : bool = false
-
If true, use simple maps instead of full maps. Default: false.
Return values
object —A reference to this object for method chaining.
create()
Static wrapper for constructor.
public
static create(string $string[, string $language = null ]) : object
This is just syntactical sugar to ease method chaining.
Parameters
- $string : string
-
The string.
- $language : string = null
-
Two-character locale code for the language of this string.
Return values
object —An instance of this class.
decompose()
Helper method for normalizeD and normalizeKD.
public
static decompose(array<string|int, mixed> $chars[, bool $compatibility = false ]) : array<string|int, mixed>
Parameters
- $chars : array<string|int, mixed>
-
Array of Unicode characters
- $compatibility : bool = false
-
If true, perform compatibility decomposition. Default: false.
Return values
array<string|int, mixed> —Array of decomposed Unicode characters.
exportStatic()
Provides a way to export a class's public static properties and methods to global namespace.
public
static exportStatic() : void
To do so:
- Use this trait in the class.
- At the END of the class's file, call its exportStatic() method.
Although it might not seem that way at first glance, this approach conforms to section 2.3 of PSR 1, since executing this method is simply a dynamic means of declaring functions when the file is included; it has no other side effects.
Regarding the $backcompat items:
A class's static properties are not exported to global variables unless explicitly included in $backcompat['prop_names'].
$backcompat['prop_names'] is a simple array where the keys are the names of one or more of a class's static properties, and the values are the names of global variables. In each case, the global variable will be set to a reference to the static property. Static properties that are not named in this array will not be exported.
Adding non-static properties to the $backcompat arrays will produce runtime errors. It is the responsibility of the developer to make sure not to do this.
extractWords()
Extracts all the words in this string.
public
extractWords(int $level) : array<string|int, mixed>
Emoji characters count as words. Punctuation and other symbols do not.
Parameters
- $level : int
-
See documentation for Utf8String::sanitizeInvisibles().
Return values
array<string|int, mixed> —The words in this string.
isNormalized()
Checks whether a string is already normalized to a given form.
public
isNormalized(string $form) : bool
Parameters
- $form : string
-
One of 'd', 'c', 'kd', 'kc', or 'kc_casefold'
Return values
bool —Whether the string is already normalized to the given form.
normalize()
Performs Unicode normalization on this string.
public
normalize([string $form = 'c' ]) : object
On failure, $this->string will be unset.
Parameters
- $form : string = 'c'
-
A Unicode normalization form: 'c', 'd', 'kc', 'kd' or 'kc_casefold'.
Return values
object —A reference to this object for method chaining.
sanitizeInvisibles()
Helper function for Utils::sanitizeChars() that deals with invisible characters.
public
sanitizeInvisibles(int $level, string $substitute) : object
This function deals with control characters, private use characters, non-characters, and characters that are invisible by definition in the Unicode standard. It does not deal with characters that are supposed to be visible according to the Unicode standard, and makes no attempt to compensate for possibly incomplete Unicode support in text rendering engines on client devices.
Parameters
- $level : int
-
Controls how invisible formatting characters are handled. 0: Allow valid formatting characters. Use for sanitizing text in posts. 1: Allow necessary formatting characters. Use for sanitizing usernames. 2: Disallow all formatting characters. Use for internal comparisons only, such as in the word censor, search contexts, etc.
- $substitute : string
-
Replacement string for the invalid characters.
Return values
object —A reference to this object for method chaining.
semanticSplit()
Splits the string into parts using the Unicode word break algorithm.
public
semanticSplit() : array<string|int, mixed>
Return values
array<string|int, mixed> —The parts of the string.
normalizeC()
Normalizes via Canonical Decomposition then Canonical Composition.
protected
normalizeC() : object
On failure, $this->string will be unset.
Return values
object —A reference to this object for method chaining.
normalizeD()
Normalizes via Canonical Decomposition.
protected
normalizeD() : object
On failure, $this->string will be unset.
Return values
object —A reference to this object for method chaining.
normalizeKC()
Normalizes via Compatibility Decomposition then Canonical Composition.
protected
normalizeKC() : object
On failure, $this->string will be unset.
Return values
object —A reference to this object for method chaining.
normalizeKCFold()
Casefolds UTF-8 via Compatibility Composition Casefolding.
protected
normalizeKCFold() : object
Used by idn_to_ascii polyfill in Subs-Compat.php.
On failure, $this->string will be unset.
Return values
object —A reference to this object for method chaining.
normalizeKD()
Normalizes via Compatibility Decomposition.
protected
normalizeKD() : object
On failure, $this->string will be unset.
Return values
object —A reference to this object for method chaining.
preserveEmoji()
Replaces emoji characters and sequences in $this->string with placeholders in order to preserve them from further processing.
protected
preserveEmoji(array<string|int, mixed> &$placeholders) : void
The placeholders are added to $placeholders.
Parameters
- $placeholders : array<string|int, mixed>
-
Array of placeholders that can be used to restore the original characters.
sanitizeJoinControls()
Replaces allowed join controls inside words in $this->string with placeholders in order to preserve them from further processing.
protected
sanitizeJoinControls(array<string|int, mixed> &$placeholders, int $level, string $substitute) : void
Join controls are only allowed inside words in special circumstances. See https://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters
The placeholders are added to $placeholders.
Parameters
- $placeholders : array<string|int, mixed>
-
Array of placeholders that can be used to restore the original characters.
- $level : int
- $substitute : string
-
Replacement string for the invalid characters.
sanitizeVariationSelectors()
Replaces sanctioned variation sequences in $this->string with placeholders in order to preserve them from further processing.
protected
sanitizeVariationSelectors(array<string|int, mixed> &$placeholders, string $substitute) : void
Unicode gives pre-defined lists of sanctioned variation sequences and says any use of variation selectors outside those sequences is unsanctioned.
The placeholders are added to $placeholders.
Parameters
- $placeholders : array<string|int, mixed>
-
Array of placeholders that can be used to restore the original characters.
- $substitute : string
-
Replacement string for the invalid characters.