Documentation

Utf8String
in package

SMF

implements Stringable uses BackwardCompatibility

A class for manipulating UTF-8 strings.

This class is intended to be called from the string manipulation methods in SMF\Utils. It is generally better (and easier) to use those methods rather than creating instances of this class directly.

Interfaces

Stringable

Properties

$language : string
$string : string
$use_intl_normalizer : bool

Methods

__construct() : mixed: Constructor.
__toString() : string: Returns $this->string.
compose() : array<string|int, mixed>: Helper method for normalizeC and normalizeKC.
convertCase() : object: Converts the case of this UTF-8 string.
create() : object: Static wrapper for constructor.
decompose() : array<string|int, mixed>: Helper method for normalizeD and normalizeKD.
exportStatic() : void: Provides a way to export a class's public static properties and methods to global namespace.
extractWords() : array<string|int, mixed>: Extracts all the words in this string.
isNormalized() : bool: Checks whether a string is already normalized to a given form.
normalize() : object: Performs Unicode normalization on this string.
sanitizeInvisibles() : object: Helper function for Utils::sanitizeChars() that deals with invisible characters.
semanticSplit() : array<string|int, mixed>: Splits the string into parts using the Unicode word break algorithm.
normalizeC() : object: Normalizes via Canonical Decomposition then Canonical Composition.
normalizeD() : object: Normalizes via Canonical Decomposition.
normalizeKC() : object: Normalizes via Compatibility Decomposition then Canonical Composition.
normalizeKCFold() : object: Casefolds UTF-8 via Compatibility Composition Casefolding.
normalizeKD() : object: Normalizes via Compatibility Decomposition.
preserveEmoji() : void: Replaces emoji characters and sequences in $this->string with placeholders in order to preserve them from further processing.
sanitizeJoinControls() : void: Replaces allowed join controls inside words in $this->string with placeholders in order to preserve them from further processing.
sanitizeVariationSelectors() : void: Replaces sanctioned variation sequences in $this->string with placeholders in order to preserve them from further processing.

$language


    public
        string
    $language

The two-character locale code for the language of this string. E.g. 'de', 'en', 'fr', 'zh', etc.

$string


    public
        string
    $string

The scalar string.

$use_intl_normalizer


    protected
    static    bool
    $use_intl_normalizer

Whether we can use the intl extension's Normalizer class.

__construct()

Constructor.


    public
                    __construct(string $string[, string $language = null ]) : mixed

Parameters

$string : string: The string.
$language : string = null: Two-character locale code for the language of this string. If null, assumes the language currently in use by SMF (meaning the user's language, falling back to the forum's default).

__toString()

Returns $this->string.


    public
                    __toString() : string

Return values

string —

The string.

compose()

Helper method for normalizeC and normalizeKC.


    public
            static        compose(array<string|int, mixed> $chars) : array<string|int, mixed>

Parameters

$chars : array<string|int, mixed>: Array of decomposed Unicode characters

Return values

array<string|int, mixed> —

Array of composed Unicode characters.

convertCase()

Converts the case of this UTF-8 string.


    public
                    convertCase(string $case[, bool $simple = false ]) : object

Updates the value of $this->string. On failure, $this->string will be unset.

The supported cases are as follows:

upper: Converts all letters to their upper case version.
lower: Converts all letters to their lower case version.
fold: Converts all letters to their default case version. For most languages that means lower case, but not all.
title: Capitalizes the first letter of each word, and converts all other letters to lower case.
ucwords: Like title case, except that letters that do not start a word are left as they are.
ucfirst: Like ucwords, except that it acts only on the first word in the string.

Special conditional casing rules are applied for letters in certain languages that need them. These conditional casing rules are defined in the Unicode standard and implemented according to those instructions.

It is also worth noting that for certain letters in some languages, the capitalized form of the letter used for upper case may differ from the capitalized form use for title case. For example, the lower case character 'ǳ' becomes 'Ǳ' in upper case, but 'ǲ' in title case. All such special title case rules are specified in the Unicode data files and are applied automatically.

Parameters

$case : string: One of 'upper', 'lower', 'fold', 'title', 'ucwords', or 'ucfirst'.
$simple : bool = false: If true, use simple maps instead of full maps. Default: false.

Return values

object —

A reference to this object for method chaining.

create()

Static wrapper for constructor.


    public
            static        create(string $string[, string $language = null ]) : object

This is just syntactical sugar to ease method chaining.

Parameters

$string : string: The string.
$language : string = null: Two-character locale code for the language of this string.

Return values

object —

An instance of this class.

decompose()

Helper method for normalizeD and normalizeKD.


    public
            static        decompose(array<string|int, mixed> $chars[, bool $compatibility = false ]) : array<string|int, mixed>

Parameters

$chars : array<string|int, mixed>: Array of Unicode characters
$compatibility : bool = false: If true, perform compatibility decomposition. Default: false.

Return values

array<string|int, mixed> —

Array of decomposed Unicode characters.

exportStatic()

Provides a way to export a class's public static properties and methods to global namespace.


    public
            static        exportStatic() : void

To do so:

Use this trait in the class.
At the END of the class's file, call its exportStatic() method.

Although it might not seem that way at first glance, this approach conforms to section 2.3 of PSR 1, since executing this method is simply a dynamic means of declaring functions when the file is included; it has no other side effects.

Regarding the $backcompat items:

A class's static properties are not exported to global variables unless explicitly included in $backcompat['prop_names'].

$backcompat['prop_names'] is a simple array where the keys are the names of one or more of a class's static properties, and the values are the names of global variables. In each case, the global variable will be set to a reference to the static property. Static properties that are not named in this array will not be exported.

Adding non-static properties to the $backcompat arrays will produce runtime errors. It is the responsibility of the developer to make sure not to do this.

extractWords()

Extracts all the words in this string.


    public
                    extractWords(int $level) : array<string|int, mixed>

Emoji characters count as words. Punctuation and other symbols do not.

Parameters

$level : int: See documentation for Utf8String::sanitizeInvisibles().

Return values

array<string|int, mixed> —

The words in this string.

isNormalized()

Checks whether a string is already normalized to a given form.


    public
                    isNormalized(string $form) : bool

Parameters

$form : string: One of 'd', 'c', 'kd', 'kc', or 'kc_casefold'

Return values

bool —

Whether the string is already normalized to the given form.

normalize()

Performs Unicode normalization on this string.


    public
                    normalize([string $form = 'c' ]) : object

On failure, $this->string will be unset.

Parameters

$form : string = 'c': A Unicode normalization form: 'c', 'd', 'kc', 'kd' or 'kc_casefold'.

Return values

object —

A reference to this object for method chaining.

sanitizeInvisibles()

Helper function for Utils::sanitizeChars() that deals with invisible characters.


    public
                    sanitizeInvisibles(int $level, string $substitute) : object

This function deals with control characters, private use characters, non-characters, and characters that are invisible by definition in the Unicode standard. It does not deal with characters that are supposed to be visible according to the Unicode standard, and makes no attempt to compensate for possibly incomplete Unicode support in text rendering engines on client devices.

Parameters

$level : int: Controls how invisible formatting characters are handled. 0: Allow valid formatting characters. Use for sanitizing text in posts. 1: Allow necessary formatting characters. Use for sanitizing usernames. 2: Disallow all formatting characters. Use for internal comparisons only, such as in the word censor, search contexts, etc.
$substitute : string: Replacement string for the invalid characters.

Return values

object —

A reference to this object for method chaining.

semanticSplit()

Splits the string into parts using the Unicode word break algorithm.


    public
                    semanticSplit() : array<string|int, mixed>

Return values

array<string|int, mixed> —

The parts of the string.

normalizeC()

Normalizes via Canonical Decomposition then Canonical Composition.


    protected
                    normalizeC() : object

On failure, $this->string will be unset.

Return values

object —

A reference to this object for method chaining.

normalizeD()

Normalizes via Canonical Decomposition.


    protected
                    normalizeD() : object

On failure, $this->string will be unset.

Return values

object —

A reference to this object for method chaining.

normalizeKC()

Normalizes via Compatibility Decomposition then Canonical Composition.


    protected
                    normalizeKC() : object

On failure, $this->string will be unset.

Return values

object —

A reference to this object for method chaining.

normalizeKCFold()

Casefolds UTF-8 via Compatibility Composition Casefolding.


    protected
                    normalizeKCFold() : object

Used by idn_to_ascii polyfill in Subs-Compat.php.

On failure, $this->string will be unset.

Return values

object —

A reference to this object for method chaining.

normalizeKD()

Normalizes via Compatibility Decomposition.


    protected
                    normalizeKD() : object

On failure, $this->string will be unset.

Return values

object —

A reference to this object for method chaining.

preserveEmoji()

Replaces emoji characters and sequences in $this->string with placeholders in order to preserve them from further processing.


    protected
                    preserveEmoji(array<string|int, mixed> &$placeholders) : void

The placeholders are added to $placeholders.

Parameters

$placeholders : array<string|int, mixed>: Array of placeholders that can be used to restore the original characters.

sanitizeJoinControls()

Replaces allowed join controls inside words in $this->string with placeholders in order to preserve them from further processing.


    protected
                    sanitizeJoinControls(array<string|int, mixed> &$placeholders, int $level, string $substitute) : void

Join controls are only allowed inside words in special circumstances. See https://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters

The placeholders are added to $placeholders.

Parameters

$placeholders : array<string|int, mixed>: Array of placeholders that can be used to restore the original characters.
$level : int
$substitute : string: Replacement string for the invalid characters.

sanitizeVariationSelectors()

Replaces sanctioned variation sequences in $this->string with placeholders in order to preserve them from further processing.


    protected
                    sanitizeVariationSelectors(array<string|int, mixed> &$placeholders, string $substitute) : void

Unicode gives pre-defined lists of sanctioned variation sequences and says any use of variation selectors outside those sequences is unsanctioned.

The placeholders are added to $placeholders.

Parameters

$placeholders : array<string|int, mixed>: Array of placeholders that can be used to restore the original characters.
$substitute : string: Replacement string for the invalid characters.

Utf8String in package SMF implements Stringable uses BackwardCompatibility

Table of Contents

Interfaces

Properties

Methods

Properties

$language

$string

$use_intl_normalizer

Methods

__construct()

Parameters

__toString()

Return values

compose()

Parameters

Return values

convertCase()

Parameters

Return values

create()

Parameters

Return values

decompose()

Parameters

Return values

exportStatic()

extractWords()

Parameters

Return values

isNormalized()

Parameters

Return values

normalize()

Parameters

Return values

sanitizeInvisibles()

Parameters

Return values

semanticSplit()

Return values

normalizeC()

Return values

normalizeD()

Return values

normalizeKC()

Return values

normalizeKCFold()

Return values

normalizeKD()

Return values

preserveEmoji()

Parameters

sanitizeJoinControls()

Parameters

sanitizeVariationSelectors()

Parameters

Utf8String
in package

SMF

implements Stringable uses BackwardCompatibility