Documentation

UpdateUnicode extends BackgroundTask
in package

This class contains code used to update SMF's Unicode data files.

Table of Contents

Constants

DATA_URL_CLDR  = 'https://raw.githubusercontent.com/unicode-org/cldr-json/main/cldr-json'
DATA_URL_IDNA  = 'https://www.unicode.org/Public/idna/latest'
DATA_URL_SECURITY  = 'https://www.unicode.org/Public/security/latest'
DATA_URL_UCD  = 'https://www.unicode.org/Public/UCD/latest/ucd'
URLs where we can fetch the Unicode data files.
RECEIVE_NOTIFY_ALERT  = 0x1
RECEIVE_NOTIFY_EMAIL  = 0x2
Constants for notification types.

Properties

$temp_dir  : string
$ucd_version  : string
$unicodedir  : string
$_details  : array<string|int, mixed>
$char_data  : array<string|int, mixed>
$derived_normalization_props  : array<string|int, mixed>
$full_decomposition_maps  : array<string|int, mixed>
$funcs  : array<string|int, mixed>
$prefetch  : array<string|int, mixed>
$script_aliases  : array<string|int, mixed>
$script_stats  : array<string|int, mixed>
$time_limit  : int

Methods

__construct()  : mixed
The constructor.
execute()  : bool
This executes the task.
export_funcs_to_file()  : void
Updates Unicode data functions in their designated files.
getMinUserInfo()  : array<string|int, mixed>
Loads minimal info for the previously loaded user ids
build_compressed_character_script_data()  : bool
Builds confusables data for the spoof detector.
build_confusables()  : bool
Builds confusables data for the spoof detector.
build_currencies()  : bool
Builds information about different currencies.
build_func_array()  : void
Helper for get_function_code_and_regex(). Builds the function's data array.
build_idna()  : bool
Builds maps and regex classes for IDNA purposes.
build_plurals()  : bool
Builds pluralization rules for different languages.
build_quick_check()  : bool
Builds regular expressions for normalization quick check.
build_regex_identifier_status()  : bool
Builds regex to distinguish characters' Identifier_Status value.
build_regex_indic()  : bool
Builds regex classes for join control tests in utf8_sanitize_invisibles.
build_regex_joining_type()  : bool
Builds regex classes for join control tests in utf8_sanitize_invisibles.
build_regex_properties()  : bool
Builds regular expression classes for extended Unicode properties.
build_regex_variation_selectors()  : bool
Builds regular expression classes for filtering variation selectors.
build_script_stats()  : bool
Helper function for build_regex_joining_type and build_regex_indic.
deltree()  : void
Deletes a directory and its contents.
fetch_unicode_file()  : string|bool
Fetches the contents of a Unicode data file.
finalize_decomposition_forms()  : bool
Finalizes all the decomposition forms.
get_function_code_and_regex()  : array<string|int, mixed>
Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the function is already present in the file.
lookup_ucd_version()  : bool
Sets $this->ucd_version to latest version number of the UCD.
make_temp_dir()  : void
Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.
process_casing_data()  : bool
Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.
process_derived_normalization_props()  : bool
Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.
process_main_unicode_data()  : bool
Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.
should_update()  : bool
Compares version of SMF's local Unicode data with the latest release.
smf_file_header()  : string
Gets basic boilerplate for the PHP files that will be created.

Constants

DATA_URL_CLDR

public mixed DATA_URL_CLDR = 'https://raw.githubusercontent.com/unicode-org/cldr-json/main/cldr-json'

DATA_URL_IDNA

public mixed DATA_URL_IDNA = 'https://www.unicode.org/Public/idna/latest'

DATA_URL_SECURITY

public mixed DATA_URL_SECURITY = 'https://www.unicode.org/Public/security/latest'

DATA_URL_UCD

URLs where we can fetch the Unicode data files.

public mixed DATA_URL_UCD = 'https://www.unicode.org/Public/UCD/latest/ucd'

RECEIVE_NOTIFY_EMAIL

Constants for notification types.

public mixed RECEIVE_NOTIFY_EMAIL = 0x2

Properties

$temp_dir

public string $temp_dir = ''

Path to temporary working directory.

$ucd_version

public string $ucd_version = ''

The latest official release of the Unicode Character Database.

$unicodedir

public string $unicodedir = ''

Convenience alias of Config::$sourcedir . '/Unicode'.

$_details

protected array<string|int, mixed> $_details

Holds the details for the task

$char_data

private array<string|int, mixed> $char_data = []

Assorted info about Unicode characters.

$derived_normalization_props

private array<string|int, mixed> $derived_normalization_props = []

Character properties used during normalization.

$full_decomposition_maps

private array<string|int, mixed> $full_decomposition_maps = []

Key-value pairs of character decompositions.

$funcs

private array<string|int, mixed> $funcs = [['file' => 'Metadata.php', 'regex' => '/if \(!defined\(\'SMF_UNICODE_VERSION\'\)\)(?:\s*{)?\n\tdefine\(\'SMF_UNICODE_VERSION\', \'\d+(\.\d+)*\'\);(?:\n})?/', 'data' => [ // 0.0.0.0 will be replaced with correct value at runtime. "if (!defined('SMF_UNICODE_VERSION')) {\n\tdefine('SMF_UNICODE_VERSION', '0.0.0.0');\n}", ]], 'utf8_normalize_d_maps' => ['file' => 'DecompositionCanonical.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Canonical Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_normalize_kd_maps' => ['file' => 'DecompositionCompatibility.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kd.'], 'return' => ['type' => 'array', 'desc' => 'Compatibility Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_compose_maps' => ['file' => 'Composition.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_compose.'], 'return' => ['type' => 'array', 'desc' => 'Composition maps for Unicode normalization.'], 'data' => []], 'utf8_combining_classes' => ['file' => 'CombiningClasses.php', 'key_type' => 'hexchar', 'val_type' => 'int', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Combining Class data for Unicode normalization.'], 'data' => []], 'utf8_strtolower_simple_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtolower_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtoupper_simple_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_strtoupper_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_titlecase_simple_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Simple title case maps.'], 'data' => []], 'utf8_titlecase_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Full title case maps.'], 'data' => []], 'utf8_casefold_simple_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_casefold_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_default_ignorables' => ['file' => 'DefaultIgnorables.php', 'key_type' => 'int', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kc_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Characters with the \'Default_Ignorable_Code_Point\' property.'], 'data' => []], 'utf8_regex_properties' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'propfiles' => ['DerivedCoreProperties.txt', 'PropList.txt', 'emoji/emoji-data.txt', 'extracted/DerivedGeneralCategory.txt', 'auxiliary/WordBreakProperty.txt'], 'props' => ['ALetter', 'Bidi_Control', 'Case_Ignorable', 'Cn', 'Default_Ignorable_Code_Point', 'Emoji', 'Emoji_Modifier', 'Extend', 'ExtendNumLet', 'Format', 'Hebrew_Letter', 'Ideographic', 'Join_Control', 'Katakana', 'MidLetter', 'MidNum', 'MidNumLet', 'Numeric', 'Regional_Indicator', 'Variation_Selector', 'WSegSpace'], 'desc' => ['Helper function for utf8_sanitize_invisibles and utf8_convert_case.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/DerivedCoreProperties.txt', self::DATA_URL_UCD . '/PropList.txt', self::DATA_URL_UCD . '/emoji/emoji-data.txt', self::DATA_URL_UCD . '/extracted/DerivedGeneralCategory.txt', self::DATA_URL_UCD . '/auxiliary/WordBreakProperty.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for various Unicode properties.'], 'data' => []], 'utf8_regex_variation_selectors' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/StandardizedVariants.txt', self::DATA_URL_UCD . '/emoji/emoji-variation-sequences.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for filtering variation selectors.'], 'data' => []], 'utf8_regex_joining_type' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/extracted/DerivedJoiningType.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for joining characters in certain scripts.'], 'data' => []], 'utf8_regex_indic' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/extracted/DerivedCombiningClass.txt', self::DATA_URL_UCD . '/IndicSyllabicCategory.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for Indic scripts that use viramas.'], 'data' => []], 'utf8_regex_quick_check' => ['file' => 'QuickCheck.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_is_normalized.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/extracted/DerivedNormalizationProps.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for disallowed characters in normalization forms.'], 'data' => []], 'idna_maps' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Character maps for IDNA processing.'], 'data' => []], 'idna_maps_deviation' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => '"Deviation" character maps for IDNA processing.'], 'data' => []], 'idna_maps_not_std3' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Non-STD3 character maps for IDNA processing.'], 'data' => []], 'idna_regex' => ['file' => 'Idna.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Regular expressions useful for IDNA processing.'], 'data' => []], 'plurals' => ['file' => 'Plurals.php', 'key_type' => 'string', 'val_type' => 'array', 'desc' => ['Helper function for SMF\Localization\MessageFormatter::formatMessage.', '', 'Rules compiled from:', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/plurals.json', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/ordinals.json'], 'return' => ['type' => 'array', 'desc' => 'Pluralization rules for different languages'], 'data' => []], 'currencies' => ['file' => 'Currencies.php', 'key_type' => 'string', 'val_type' => 'array', 'desc' => ['Helper function for SMF\Localization\MessageFormatter::formatMessage.', '', 'Rules compiled from:', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/currencyData.json'], 'return' => ['type' => 'array', 'desc' => 'Information about different currencies'], 'data' => []], 'country_currencies' => ['file' => 'Currencies.php', 'key_type' => 'string', 'val_type' => 'array', 'desc' => ['Helper function for SMF\Localization\MessageFormatter::formatMessage.', '', 'Rules compiled from:', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/currencyData.json'], 'return' => ['type' => 'array', 'desc' => 'Information about currencies used in different countries'], 'data' => []], 'utf8_confusables' => ['file' => 'Confusables.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for SMF\Unicode\SpoofDetector::getSkeletonString.', '', 'Returns an array of "confusables" maps that can be used for confusable string', 'detection.', '', 'Data compiled from:', self::DATA_URL_SECURITY . '/confusables.txt'], 'return' => ['type' => 'array', 'desc' => '"Confusables" maps.'], 'data' => []], 'utf8_character_scripts' => ['file' => 'Confusables.php', 'key_type' => 'hexchar', 'val_type' => 'array', 'desc' => ['Helper function for SpoofDetector::resolveScriptSet.', '', 'Each key in the returned array defines the END of a range of characters that', 'all have the same script set. For example, the first key, "\x40", means the', 'range of characters from "\x0" to "\x40". Then the second key, "\x5A",', 'means the range from "\x41" to "\x5A".', '', 'The first entry in each value array indicates the primary script (i.e. the', 'value of the Script property) for that set of characters. If those characters', 'can also occur in a limited number of other scripts (i.e. the Script_Extensions', 'property for those characters is not empty), those additional scripts are', 'listed after the first.', '', 'See https://www.unicode.org/reports/tr24/ for more info.'], 'return' => ['type' => 'array', 'desc' => 'Script data for ranges of Unicode characters.'], 'data' => []], 'utf8_regex_identifier_status' => ['file' => 'Confusables.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for SpoofDetector::checkHomographNames.', '', 'Returns an array of regexes that can be used to check the "identifier status"', 'of characters in a string.'], 'return' => ['type' => 'array', 'desc' => 'Character classes for identifier statuses.'], 'data' => []]]

Info about functions to build in SMF's Unicode data files.

$prefetch

private array<string|int, mixed> $prefetch = [self::DATA_URL_UCD => ['CaseFolding.txt', 'DerivedAge.txt', 'DerivedCoreProperties.txt', 'DerivedNormalizationProps.txt', 'IndicSyllabicCategory.txt', 'PropertyValueAliases.txt', 'PropList.txt', 'ScriptExtensions.txt', 'Scripts.txt', 'SpecialCasing.txt', 'StandardizedVariants.txt', 'UnicodeData.txt', 'emoji/emoji-data.txt', 'emoji/emoji-variation-sequences.txt', 'extracted/DerivedGeneralCategory.txt', 'extracted/DerivedJoiningType.txt', 'auxiliary/WordBreakProperty.txt'], self::DATA_URL_IDNA => ['IdnaMappingTable.txt'], self::DATA_URL_CLDR => ['cldr-core/supplemental/plurals.json', 'cldr-core/supplemental/ordinals.json', 'cldr-core/supplemental/currencyData.json'], self::DATA_URL_SECURITY => ['confusables.txt', 'IdentifierStatus.txt']]

Files to fetch from unicode.org.

$script_aliases

private array<string|int, mixed> $script_aliases = []

Tracks associations between character scripts' short and long names.

$script_stats

private array<string|int, mixed> $script_stats = []

Statistical info about character scripts (e.g. Latin, Greek, Cyrillic, etc.)

$time_limit

private int $time_limit = 30

Used to ensure we exit long running tasks cleanly.

Methods

__construct()

The constructor.

public __construct(array<string|int, mixed> $details) : mixed
Parameters
$details : array<string|int, mixed>

The details for the task

execute()

This executes the task.

public execute() : bool
Tags
todo

PHP 8.2: This can be changed to return type: true.

Return values
bool

Always returns true.

export_funcs_to_file()

Updates Unicode data functions in their designated files.

public export_funcs_to_file() : void

getMinUserInfo()

Loads minimal info for the previously loaded user ids

public getMinUserInfo([array<string|int, mixed> $user_ids = [] ]) : array<string|int, mixed>
Parameters
$user_ids : array<string|int, mixed> = []
Tags
throws
Exception
Return values
array<string|int, mixed>

build_compressed_character_script_data()

Builds confusables data for the spoof detector.

private build_compressed_character_script_data() : bool
Return values
bool

build_confusables()

Builds confusables data for the spoof detector.

private build_confusables() : bool
Return values
bool

build_currencies()

Builds information about different currencies.

private build_currencies() : bool
Return values
bool

build_func_array()

Helper for get_function_code_and_regex(). Builds the function's data array.

private build_func_array(string &$func_code, array<string|int, mixed> $data, string $key_type, string $val_type) : void
Parameters
$func_code : string

The raw string that contains function code.

$data : array<string|int, mixed>

Data to format as an array.

$key_type : string

How to format the array keys.

$val_type : string

How to format the array values.

build_idna()

Builds maps and regex classes for IDNA purposes.

private build_idna() : bool
Return values
bool

build_plurals()

Builds pluralization rules for different languages.

private build_plurals() : bool
Return values
bool

build_quick_check()

Builds regular expressions for normalization quick check.

private build_quick_check() : bool
Return values
bool

build_regex_identifier_status()

Builds regex to distinguish characters' Identifier_Status value.

private build_regex_identifier_status() : bool
Return values
bool

build_regex_indic()

Builds regex classes for join control tests in utf8_sanitize_invisibles.

private build_regex_indic() : bool

Specifically, for Indic scripts like Devanagari.

Return values
bool

build_regex_joining_type()

Builds regex classes for join control tests in utf8_sanitize_invisibles.

private build_regex_joining_type() : bool

Specifically, for cursive scripts like Arabic.

Return values
bool

build_regex_properties()

Builds regular expression classes for extended Unicode properties.

private build_regex_properties() : bool
Return values
bool

build_regex_variation_selectors()

Builds regular expression classes for filtering variation selectors.

private build_regex_variation_selectors() : bool
Return values
bool

build_script_stats()

Helper function for build_regex_joining_type and build_regex_indic.

private build_script_stats() : bool
Return values
bool

deltree()

Deletes a directory and its contents.

private deltree(string $dir_path) : void
Parameters
$dir_path : string

fetch_unicode_file()

Fetches the contents of a Unicode data file.

private fetch_unicode_file(string $filename, string $data_url) : string|bool

Caches a local copy for subsequent lookups.

Parameters
$filename : string

Name of a Unicode datafile, relative to $data_url.

$data_url : string

One of this class's DATA_URL_* constants.

Return values
string|bool

Path to locally saved copy of the file.

finalize_decomposition_forms()

Finalizes all the decomposition forms.

private finalize_decomposition_forms() : bool

This is necessary because some characters decompose to other characters that themselves decompose further.

Return values
bool

get_function_code_and_regex()

Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the function is already present in the file.

private get_function_code_and_regex(string|int $func_name) : array<string|int, mixed>
Parameters
$func_name : string|int

Key of an element in $this->funcs. If an int is provided, it is considered raw code such as a header, and does not replace a function in the file.

Return values
array<string|int, mixed>

PHP code and a regular expression.

lookup_ucd_version()

Sets $this->ucd_version to latest version number of the UCD.

private lookup_ucd_version() : bool
Return values
bool

make_temp_dir()

Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.

private make_temp_dir() : void

process_casing_data()

Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.

private process_casing_data() : bool
Return values
bool

process_derived_normalization_props()

Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.

private process_derived_normalization_props() : bool
Return values
bool

process_main_unicode_data()

Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.

private process_main_unicode_data() : bool
Return values
bool

should_update()

Compares version of SMF's local Unicode data with the latest release.

private should_update() : bool
Return values
bool

Whether SMF should update its local Unicode data or not.

smf_file_header()

Gets basic boilerplate for the PHP files that will be created.

private smf_file_header() : string
Return values
string

Standard SMF file header.


        
On this page

Search results