Documentation

Utf8String
in package
implements Stringable uses BackwardCompatibility

A class for manipulating UTF-8 strings.

This class is intended to be called from the string manipulation methods in SMF\Utils. It is generally better (and easier) to use those methods rather than creating instances of this class directly.

Table of Contents

Interfaces

Stringable

Properties

$language  : string
$string  : string
$use_intl_normalizer  : bool

Methods

__construct()  : mixed
Constructor.
__toString()  : string
Returns $this->string.
compose()  : array<string|int, mixed>
Helper method for normalizeC and normalizeKC.
convertCase()  : object
Converts the case of this UTF-8 string.
create()  : object
Static wrapper for constructor.
decompose()  : array<string|int, mixed>
Helper method for normalizeD and normalizeKD.
exportStatic()  : void
Provides a way to export a class's public static properties and methods to global namespace.
extractWords()  : array<string|int, mixed>
Extracts all the words in this string.
isNormalized()  : bool
Checks whether a string is already normalized to a given form.
normalize()  : object
Performs Unicode normalization on this string.
sanitizeInvisibles()  : object
Helper function for Utils::sanitizeChars() that deals with invisible characters.
semanticSplit()  : array<string|int, mixed>
Splits the string into parts using the Unicode word break algorithm.
normalizeC()  : object
Normalizes via Canonical Decomposition then Canonical Composition.
normalizeD()  : object
Normalizes via Canonical Decomposition.
normalizeKC()  : object
Normalizes via Compatibility Decomposition then Canonical Composition.
normalizeKCFold()  : object
Casefolds UTF-8 via Compatibility Composition Casefolding.
normalizeKD()  : object
Normalizes via Compatibility Decomposition.
preserveEmoji()  : void
Replaces emoji characters and sequences in $this->string with placeholders in order to preserve them from further processing.
sanitizeJoinControls()  : void
Replaces allowed join controls inside words in $this->string with placeholders in order to preserve them from further processing.
sanitizeVariationSelectors()  : void
Replaces sanctioned variation sequences in $this->string with placeholders in order to preserve them from further processing.

Properties

$language

public string $language

The two-character locale code for the language of this string. E.g. 'de', 'en', 'fr', 'zh', etc.

$string

public string $string

The scalar string.

$use_intl_normalizer

protected static bool $use_intl_normalizer

Whether we can use the intl extension's Normalizer class.

Methods

__construct()

Constructor.

public __construct(string $string[, string $language = null ]) : mixed
Parameters
$string : string

The string.

$language : string = null

Two-character locale code for the language of this string. If null, assumes the language currently in use by SMF (meaning the user's language, falling back to the forum's default).

__toString()

Returns $this->string.

public __toString() : string
Return values
string

The string.

compose()

Helper method for normalizeC and normalizeKC.

public static compose(array<string|int, mixed> $chars) : array<string|int, mixed>
Parameters
$chars : array<string|int, mixed>

Array of decomposed Unicode characters

Return values
array<string|int, mixed>

Array of composed Unicode characters.

convertCase()

Converts the case of this UTF-8 string.

public convertCase(string $case[, bool $simple = false ]) : object

Updates the value of $this->string. On failure, $this->string will be unset.

The supported cases are as follows:

  • upper: Converts all letters to their upper case version.
  • lower: Converts all letters to their lower case version.
  • fold: Converts all letters to their default case version. For most languages that means lower case, but not all.
  • title: Capitalizes the first letter of each word, and converts all other letters to lower case.
  • ucwords: Like title case, except that letters that do not start a word are left as they are.
  • ucfirst: Like ucwords, except that it acts only on the first word in the string.

Special conditional casing rules are applied for letters in certain languages that need them. These conditional casing rules are defined in the Unicode standard and implemented according to those instructions.

It is also worth noting that for certain letters in some languages, the capitalized form of the letter used for upper case may differ from the capitalized form use for title case. For example, the lower case character 'dz' becomes 'DZ' in upper case, but 'Dz' in title case. All such special title case rules are specified in the Unicode data files and are applied automatically.

Parameters
$case : string

One of 'upper', 'lower', 'fold', 'title', 'ucwords', or 'ucfirst'.

$simple : bool = false

If true, use simple maps instead of full maps. Default: false.

Return values
object

A reference to this object for method chaining.

create()

Static wrapper for constructor.

public static create(string $string[, string $language = null ]) : object

This is just syntactical sugar to ease method chaining.

Parameters
$string : string

The string.

$language : string = null

Two-character locale code for the language of this string.

Return values
object

An instance of this class.

decompose()

Helper method for normalizeD and normalizeKD.

public static decompose(array<string|int, mixed> $chars[, bool $compatibility = false ]) : array<string|int, mixed>
Parameters
$chars : array<string|int, mixed>

Array of Unicode characters

$compatibility : bool = false

If true, perform compatibility decomposition. Default: false.

Return values
array<string|int, mixed>

Array of decomposed Unicode characters.

exportStatic()

Provides a way to export a class's public static properties and methods to global namespace.

public static exportStatic() : void

To do so:

  1. Use this trait in the class.
  2. At the END of the class's file, call its exportStatic() method.

Although it might not seem that way at first glance, this approach conforms to section 2.3 of PSR 1, since executing this method is simply a dynamic means of declaring functions when the file is included; it has no other side effects.

Regarding the $backcompat items:

A class's static properties are not exported to global variables unless explicitly included in $backcompat['prop_names'].

$backcompat['prop_names'] is a simple array where the keys are the names of one or more of a class's static properties, and the values are the names of global variables. In each case, the global variable will be set to a reference to the static property. Static properties that are not named in this array will not be exported.

Adding non-static properties to the $backcompat arrays will produce runtime errors. It is the responsibility of the developer to make sure not to do this.

extractWords()

Extracts all the words in this string.

public extractWords(int $level) : array<string|int, mixed>

Emoji characters count as words. Punctuation and other symbols do not.

Parameters
$level : int

See documentation for Utf8String::sanitizeInvisibles().

Return values
array<string|int, mixed>

The words in this string.

isNormalized()

Checks whether a string is already normalized to a given form.

public isNormalized(string $form) : bool
Parameters
$form : string

One of 'd', 'c', 'kd', 'kc', or 'kc_casefold'

Return values
bool

Whether the string is already normalized to the given form.

normalize()

Performs Unicode normalization on this string.

public normalize([string $form = 'c' ]) : object

On failure, $this->string will be unset.

Parameters
$form : string = 'c'

A Unicode normalization form: 'c', 'd', 'kc', 'kd' or 'kc_casefold'.

Return values
object

A reference to this object for method chaining.

sanitizeInvisibles()

Helper function for Utils::sanitizeChars() that deals with invisible characters.

public sanitizeInvisibles(int $level, string $substitute) : object

This function deals with control characters, private use characters, non-characters, and characters that are invisible by definition in the Unicode standard. It does not deal with characters that are supposed to be visible according to the Unicode standard, and makes no attempt to compensate for possibly incomplete Unicode support in text rendering engines on client devices.

Parameters
$level : int

Controls how invisible formatting characters are handled. 0: Allow valid formatting characters. Use for sanitizing text in posts. 1: Allow necessary formatting characters. Use for sanitizing usernames. 2: Disallow all formatting characters. Use for internal comparisons only, such as in the word censor, search contexts, etc.

$substitute : string

Replacement string for the invalid characters.

Return values
object

A reference to this object for method chaining.

semanticSplit()

Splits the string into parts using the Unicode word break algorithm.

public semanticSplit() : array<string|int, mixed>
Return values
array<string|int, mixed>

The parts of the string.

normalizeC()

Normalizes via Canonical Decomposition then Canonical Composition.

protected normalizeC() : object

On failure, $this->string will be unset.

Return values
object

A reference to this object for method chaining.

normalizeD()

Normalizes via Canonical Decomposition.

protected normalizeD() : object

On failure, $this->string will be unset.

Return values
object

A reference to this object for method chaining.

normalizeKC()

Normalizes via Compatibility Decomposition then Canonical Composition.

protected normalizeKC() : object

On failure, $this->string will be unset.

Return values
object

A reference to this object for method chaining.

normalizeKCFold()

Casefolds UTF-8 via Compatibility Composition Casefolding.

protected normalizeKCFold() : object

Used by idn_to_ascii polyfill in Subs-Compat.php.

On failure, $this->string will be unset.

Return values
object

A reference to this object for method chaining.

normalizeKD()

Normalizes via Compatibility Decomposition.

protected normalizeKD() : object

On failure, $this->string will be unset.

Return values
object

A reference to this object for method chaining.

preserveEmoji()

Replaces emoji characters and sequences in $this->string with placeholders in order to preserve them from further processing.

protected preserveEmoji(array<string|int, mixed> &$placeholders) : void

The placeholders are added to $placeholders.

Parameters
$placeholders : array<string|int, mixed>

Array of placeholders that can be used to restore the original characters.

sanitizeJoinControls()

Replaces allowed join controls inside words in $this->string with placeholders in order to preserve them from further processing.

protected sanitizeJoinControls(array<string|int, mixed> &$placeholders, int $level, string $substitute) : void

Join controls are only allowed inside words in special circumstances. See https://unicode.org/reports/tr31/#Layout_and_Format_Control_Characters

The placeholders are added to $placeholders.

Parameters
$placeholders : array<string|int, mixed>

Array of placeholders that can be used to restore the original characters.

$level : int
$substitute : string

Replacement string for the invalid characters.

sanitizeVariationSelectors()

Replaces sanctioned variation sequences in $this->string with placeholders in order to preserve them from further processing.

protected sanitizeVariationSelectors(array<string|int, mixed> &$placeholders, string $substitute) : void

Unicode gives pre-defined lists of sanctioned variation sequences and says any use of variation selectors outside those sequences is unsanctioned.

The placeholders are added to $placeholders.

Parameters
$placeholders : array<string|int, mixed>

Array of placeholders that can be used to restore the original characters.

$substitute : string

Replacement string for the invalid characters.


        
On this page

Search results