UpdateUnicode
extends BackgroundTask
in package
This class contains code used to update SMF's Unicode data files.
Table of Contents
Constants
- DATA_URL_CLDR = 'https://raw.githubusercontent.com/unicode-org/cldr-json/main/cldr-json'
- DATA_URL_IDNA = 'https://www.unicode.org/Public/idna/latest'
- DATA_URL_SECURITY = 'https://www.unicode.org/Public/security/latest'
- DATA_URL_UCD = 'https://www.unicode.org/Public/UCD/latest/ucd'
- URLs where we can fetch the Unicode data files.
- RECEIVE_NOTIFY_ALERT = 0x1
- RECEIVE_NOTIFY_EMAIL = 0x2
- Constants for notification types.
Properties
- $temp_dir : string
- $ucd_version : string
- $unicodedir : string
- $_details : array<string|int, mixed>
- $char_data : array<string|int, mixed>
- $derived_normalization_props : array<string|int, mixed>
- $full_decomposition_maps : array<string|int, mixed>
- $funcs : array<string|int, mixed>
- $prefetch : array<string|int, mixed>
- $script_aliases : array<string|int, mixed>
- $script_stats : array<string|int, mixed>
- $time_limit : int
Methods
- __construct() : mixed
- The constructor.
- execute() : bool
- This executes the task.
- export_funcs_to_file() : void
- Updates Unicode data functions in their designated files.
- getMinUserInfo() : array<string|int, mixed>
- Loads minimal info for the previously loaded user ids
- build_compressed_character_script_data() : bool
- Builds confusables data for the spoof detector.
- build_confusables() : bool
- Builds confusables data for the spoof detector.
- build_currencies() : bool
- Builds information about different currencies.
- build_func_array() : void
- Helper for get_function_code_and_regex(). Builds the function's data array.
- build_idna() : bool
- Builds maps and regex classes for IDNA purposes.
- build_plurals() : bool
- Builds pluralization rules for different languages.
- build_quick_check() : bool
- Builds regular expressions for normalization quick check.
- build_regex_identifier_status() : bool
- Builds regex to distinguish characters' Identifier_Status value.
- build_regex_indic() : bool
- Builds regex classes for join control tests in utf8_sanitize_invisibles.
- build_regex_joining_type() : bool
- Builds regex classes for join control tests in utf8_sanitize_invisibles.
- build_regex_properties() : bool
- Builds regular expression classes for extended Unicode properties.
- build_regex_variation_selectors() : bool
- Builds regular expression classes for filtering variation selectors.
- build_script_stats() : bool
- Helper function for build_regex_joining_type and build_regex_indic.
- deltree() : void
- Deletes a directory and its contents.
- fetch_unicode_file() : string|bool
- Fetches the contents of a Unicode data file.
- finalize_decomposition_forms() : bool
- Finalizes all the decomposition forms.
- get_function_code_and_regex() : array<string|int, mixed>
- Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the function is already present in the file.
- lookup_ucd_version() : bool
- Sets $this->ucd_version to latest version number of the UCD.
- make_temp_dir() : void
- Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.
- process_casing_data() : bool
- Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.
- process_derived_normalization_props() : bool
- Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.
- process_main_unicode_data() : bool
- Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.
- should_update() : bool
- Compares version of SMF's local Unicode data with the latest release.
- smf_file_header() : string
- Gets basic boilerplate for the PHP files that will be created.
Constants
DATA_URL_CLDR
public
mixed
DATA_URL_CLDR
= 'https://raw.githubusercontent.com/unicode-org/cldr-json/main/cldr-json'
DATA_URL_IDNA
public
mixed
DATA_URL_IDNA
= 'https://www.unicode.org/Public/idna/latest'
DATA_URL_SECURITY
public
mixed
DATA_URL_SECURITY
= 'https://www.unicode.org/Public/security/latest'
DATA_URL_UCD
URLs where we can fetch the Unicode data files.
public
mixed
DATA_URL_UCD
= 'https://www.unicode.org/Public/UCD/latest/ucd'
RECEIVE_NOTIFY_ALERT
public
mixed
RECEIVE_NOTIFY_ALERT
= 0x1
RECEIVE_NOTIFY_EMAIL
Constants for notification types.
public
mixed
RECEIVE_NOTIFY_EMAIL
= 0x2
Properties
$temp_dir
public
string
$temp_dir
= ''
Path to temporary working directory.
$ucd_version
public
string
$ucd_version
= ''
The latest official release of the Unicode Character Database.
$unicodedir
public
string
$unicodedir
= ''
Convenience alias of Config::$sourcedir . '/Unicode'.
$_details
protected
array<string|int, mixed>
$_details
Holds the details for the task
$char_data
private
array<string|int, mixed>
$char_data
= []
Assorted info about Unicode characters.
$derived_normalization_props
private
array<string|int, mixed>
$derived_normalization_props
= []
Character properties used during normalization.
$full_decomposition_maps
private
array<string|int, mixed>
$full_decomposition_maps
= []
Key-value pairs of character decompositions.
$funcs
private
array<string|int, mixed>
$funcs
= [['file' => 'Metadata.php', 'regex' => '/if \(!defined\(\'SMF_UNICODE_VERSION\'\)\)(?:\s*{)?\n\tdefine\(\'SMF_UNICODE_VERSION\', \'\d+(\.\d+)*\'\);(?:\n})?/', 'data' => [
// 0.0.0.0 will be replaced with correct value at runtime.
"if (!defined('SMF_UNICODE_VERSION')) {\n\tdefine('SMF_UNICODE_VERSION', '0.0.0.0');\n}",
]], 'utf8_normalize_d_maps' => ['file' => 'DecompositionCanonical.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Canonical Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_normalize_kd_maps' => ['file' => 'DecompositionCompatibility.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kd.'], 'return' => ['type' => 'array', 'desc' => 'Compatibility Decomposition maps for Unicode normalization.'], 'data' => []], 'utf8_compose_maps' => ['file' => 'Composition.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_compose.'], 'return' => ['type' => 'array', 'desc' => 'Composition maps for Unicode normalization.'], 'data' => []], 'utf8_combining_classes' => ['file' => 'CombiningClasses.php', 'key_type' => 'hexchar', 'val_type' => 'int', 'desc' => ['Helper function for utf8_normalize_d.'], 'return' => ['type' => 'array', 'desc' => 'Combining Class data for Unicode normalization.'], 'data' => []], 'utf8_strtolower_simple_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtolower_maps' => ['file' => 'CaseLower.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtolower.'], 'return' => ['type' => 'array', 'desc' => 'Uppercase to lowercase maps.'], 'data' => []], 'utf8_strtoupper_simple_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_strtoupper_maps' => ['file' => 'CaseUpper.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_strtoupper.'], 'return' => ['type' => 'array', 'desc' => 'Lowercase to uppercase maps.'], 'data' => []], 'utf8_titlecase_simple_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Simple title case maps.'], 'data' => []], 'utf8_titlecase_maps' => ['file' => 'CaseTitle.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_convert_case.'], 'return' => ['type' => 'array', 'desc' => 'Full title case maps.'], 'data' => []], 'utf8_casefold_simple_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_casefold_maps' => ['file' => 'CaseFold.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Casefolding maps.'], 'data' => []], 'utf8_default_ignorables' => ['file' => 'DefaultIgnorables.php', 'key_type' => 'int', 'val_type' => 'hexchar', 'desc' => ['Helper function for utf8_normalize_kc_casefold.'], 'return' => ['type' => 'array', 'desc' => 'Characters with the \'Default_Ignorable_Code_Point\' property.'], 'data' => []], 'utf8_regex_properties' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'propfiles' => ['DerivedCoreProperties.txt', 'PropList.txt', 'emoji/emoji-data.txt', 'extracted/DerivedGeneralCategory.txt', 'auxiliary/WordBreakProperty.txt'], 'props' => ['ALetter', 'Bidi_Control', 'Case_Ignorable', 'Cn', 'Default_Ignorable_Code_Point', 'Emoji', 'Emoji_Modifier', 'Extend', 'ExtendNumLet', 'Format', 'Hebrew_Letter', 'Ideographic', 'Join_Control', 'Katakana', 'MidLetter', 'MidNum', 'MidNumLet', 'Numeric', 'Regional_Indicator', 'Variation_Selector', 'WSegSpace'], 'desc' => ['Helper function for utf8_sanitize_invisibles and utf8_convert_case.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/DerivedCoreProperties.txt', self::DATA_URL_UCD . '/PropList.txt', self::DATA_URL_UCD . '/emoji/emoji-data.txt', self::DATA_URL_UCD . '/extracted/DerivedGeneralCategory.txt', self::DATA_URL_UCD . '/auxiliary/WordBreakProperty.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for various Unicode properties.'], 'data' => []], 'utf8_regex_variation_selectors' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/StandardizedVariants.txt', self::DATA_URL_UCD . '/emoji/emoji-variation-sequences.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for filtering variation selectors.'], 'data' => []], 'utf8_regex_joining_type' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/extracted/DerivedJoiningType.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for joining characters in certain scripts.'], 'data' => []], 'utf8_regex_indic' => ['file' => 'RegularExpressions.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_sanitize_invisibles.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/extracted/DerivedCombiningClass.txt', self::DATA_URL_UCD . '/IndicSyllabicCategory.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for Indic scripts that use viramas.'], 'data' => []], 'utf8_regex_quick_check' => ['file' => 'QuickCheck.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for utf8_is_normalized.', '', 'Character class lists compiled from:', self::DATA_URL_UCD . '/extracted/DerivedNormalizationProps.txt'], 'return' => ['type' => 'array', 'desc' => 'Character classes for disallowed characters in normalization forms.'], 'data' => []], 'idna_maps' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Character maps for IDNA processing.'], 'data' => []], 'idna_maps_deviation' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => '"Deviation" character maps for IDNA processing.'], 'data' => []], 'idna_maps_not_std3' => ['file' => 'Idna.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Non-STD3 character maps for IDNA processing.'], 'data' => []], 'idna_regex' => ['file' => 'Idna.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for idn_to_* polyfills.'], 'return' => ['type' => 'array', 'desc' => 'Regular expressions useful for IDNA processing.'], 'data' => []], 'plurals' => ['file' => 'Plurals.php', 'key_type' => 'string', 'val_type' => 'array', 'desc' => ['Helper function for SMF\Localization\MessageFormatter::formatMessage.', '', 'Rules compiled from:', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/plurals.json', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/ordinals.json'], 'return' => ['type' => 'array', 'desc' => 'Pluralization rules for different languages'], 'data' => []], 'currencies' => ['file' => 'Currencies.php', 'key_type' => 'string', 'val_type' => 'array', 'desc' => ['Helper function for SMF\Localization\MessageFormatter::formatMessage.', '', 'Rules compiled from:', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/currencyData.json'], 'return' => ['type' => 'array', 'desc' => 'Information about different currencies'], 'data' => []], 'country_currencies' => ['file' => 'Currencies.php', 'key_type' => 'string', 'val_type' => 'array', 'desc' => ['Helper function for SMF\Localization\MessageFormatter::formatMessage.', '', 'Rules compiled from:', 'https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/currencyData.json'], 'return' => ['type' => 'array', 'desc' => 'Information about currencies used in different countries'], 'data' => []], 'utf8_confusables' => ['file' => 'Confusables.php', 'key_type' => 'hexchar', 'val_type' => 'hexchar', 'desc' => ['Helper function for SMF\Unicode\SpoofDetector::getSkeletonString.', '', 'Returns an array of "confusables" maps that can be used for confusable string', 'detection.', '', 'Data compiled from:', self::DATA_URL_SECURITY . '/confusables.txt'], 'return' => ['type' => 'array', 'desc' => '"Confusables" maps.'], 'data' => []], 'utf8_character_scripts' => ['file' => 'Confusables.php', 'key_type' => 'hexchar', 'val_type' => 'array', 'desc' => ['Helper function for SpoofDetector::resolveScriptSet.', '', 'Each key in the returned array defines the END of a range of characters that', 'all have the same script set. For example, the first key, "\x40", means the', 'range of characters from "\x0" to "\x40". Then the second key, "\x5A",', 'means the range from "\x41" to "\x5A".', '', 'The first entry in each value array indicates the primary script (i.e. the', 'value of the Script property) for that set of characters. If those characters', 'can also occur in a limited number of other scripts (i.e. the Script_Extensions', 'property for those characters is not empty), those additional scripts are', 'listed after the first.', '', 'See https://www.unicode.org/reports/tr24/ for more info.'], 'return' => ['type' => 'array', 'desc' => 'Script data for ranges of Unicode characters.'], 'data' => []], 'utf8_regex_identifier_status' => ['file' => 'Confusables.php', 'key_type' => 'string', 'val_type' => 'string', 'desc' => ['Helper function for SpoofDetector::checkHomographNames.', '', 'Returns an array of regexes that can be used to check the "identifier status"', 'of characters in a string.'], 'return' => ['type' => 'array', 'desc' => 'Character classes for identifier statuses.'], 'data' => []]]
Info about functions to build in SMF's Unicode data files.
$prefetch
private
array<string|int, mixed>
$prefetch
= [self::DATA_URL_UCD => ['CaseFolding.txt', 'DerivedAge.txt', 'DerivedCoreProperties.txt', 'DerivedNormalizationProps.txt', 'IndicSyllabicCategory.txt', 'PropertyValueAliases.txt', 'PropList.txt', 'ScriptExtensions.txt', 'Scripts.txt', 'SpecialCasing.txt', 'StandardizedVariants.txt', 'UnicodeData.txt', 'emoji/emoji-data.txt', 'emoji/emoji-variation-sequences.txt', 'extracted/DerivedGeneralCategory.txt', 'extracted/DerivedJoiningType.txt', 'auxiliary/WordBreakProperty.txt'], self::DATA_URL_IDNA => ['IdnaMappingTable.txt'], self::DATA_URL_CLDR => ['cldr-core/supplemental/plurals.json', 'cldr-core/supplemental/ordinals.json', 'cldr-core/supplemental/currencyData.json'], self::DATA_URL_SECURITY => ['confusables.txt', 'IdentifierStatus.txt']]
Files to fetch from unicode.org.
$script_aliases
private
array<string|int, mixed>
$script_aliases
= []
Tracks associations between character scripts' short and long names.
$script_stats
private
array<string|int, mixed>
$script_stats
= []
Statistical info about character scripts (e.g. Latin, Greek, Cyrillic, etc.)
$time_limit
private
int
$time_limit
= 30
Used to ensure we exit long running tasks cleanly.
Methods
__construct()
The constructor.
public
__construct(array<string|int, mixed> $details) : mixed
Parameters
- $details : array<string|int, mixed>
-
The details for the task
execute()
This executes the task.
public
execute() : bool
Tags
Return values
bool —Always returns true.
export_funcs_to_file()
Updates Unicode data functions in their designated files.
public
export_funcs_to_file() : void
getMinUserInfo()
Loads minimal info for the previously loaded user ids
public
getMinUserInfo([array<string|int, mixed> $user_ids = [] ]) : array<string|int, mixed>
Parameters
- $user_ids : array<string|int, mixed> = []
Tags
Return values
array<string|int, mixed>build_compressed_character_script_data()
Builds confusables data for the spoof detector.
private
build_compressed_character_script_data() : bool
Return values
boolbuild_confusables()
Builds confusables data for the spoof detector.
private
build_confusables() : bool
Return values
boolbuild_currencies()
Builds information about different currencies.
private
build_currencies() : bool
Return values
boolbuild_func_array()
Helper for get_function_code_and_regex(). Builds the function's data array.
private
build_func_array(string &$func_code, array<string|int, mixed> $data, string $key_type, string $val_type) : void
Parameters
- $func_code : string
-
The raw string that contains function code.
- $data : array<string|int, mixed>
-
Data to format as an array.
- $key_type : string
-
How to format the array keys.
- $val_type : string
-
How to format the array values.
build_idna()
Builds maps and regex classes for IDNA purposes.
private
build_idna() : bool
Return values
boolbuild_plurals()
Builds pluralization rules for different languages.
private
build_plurals() : bool
Return values
boolbuild_quick_check()
Builds regular expressions for normalization quick check.
private
build_quick_check() : bool
Return values
boolbuild_regex_identifier_status()
Builds regex to distinguish characters' Identifier_Status value.
private
build_regex_identifier_status() : bool
Return values
boolbuild_regex_indic()
Builds regex classes for join control tests in utf8_sanitize_invisibles.
private
build_regex_indic() : bool
Specifically, for Indic scripts like Devanagari.
Return values
boolbuild_regex_joining_type()
Builds regex classes for join control tests in utf8_sanitize_invisibles.
private
build_regex_joining_type() : bool
Specifically, for cursive scripts like Arabic.
Return values
boolbuild_regex_properties()
Builds regular expression classes for extended Unicode properties.
private
build_regex_properties() : bool
Return values
boolbuild_regex_variation_selectors()
Builds regular expression classes for filtering variation selectors.
private
build_regex_variation_selectors() : bool
Return values
boolbuild_script_stats()
Helper function for build_regex_joining_type and build_regex_indic.
private
build_script_stats() : bool
Return values
booldeltree()
Deletes a directory and its contents.
private
deltree(string $dir_path) : void
Parameters
- $dir_path : string
fetch_unicode_file()
Fetches the contents of a Unicode data file.
private
fetch_unicode_file(string $filename, string $data_url) : string|bool
Caches a local copy for subsequent lookups.
Parameters
- $filename : string
-
Name of a Unicode datafile, relative to $data_url.
- $data_url : string
-
One of this class's DATA_URL_* constants.
Return values
string|bool —Path to locally saved copy of the file.
finalize_decomposition_forms()
Finalizes all the decomposition forms.
private
finalize_decomposition_forms() : bool
This is necessary because some characters decompose to other characters that themselves decompose further.
Return values
boolget_function_code_and_regex()
Builds complete code for the specified element in $this->funcs to be inserted into the relevant PHP file. Also builds a regex to check whether a copy of the function is already present in the file.
private
get_function_code_and_regex(string|int $func_name) : array<string|int, mixed>
Parameters
- $func_name : string|int
-
Key of an element in $this->funcs. If an int is provided, it is considered raw code such as a header, and does not replace a function in the file.
Return values
array<string|int, mixed> —PHP code and a regular expression.
lookup_ucd_version()
Sets $this->ucd_version to latest version number of the UCD.
private
lookup_ucd_version() : bool
Return values
boolmake_temp_dir()
Makes a temporary directory to hold our working files, and sets $this->temp_dir to the path of the created directory.
private
make_temp_dir() : void
process_casing_data()
Processes SpecialCasing.txt and CaseFolding.txt in order to get finalized versions of all case conversion data.
private
process_casing_data() : bool
Return values
boolprocess_derived_normalization_props()
Processes DerivedNormalizationProps.txt in order to populate $this->derived_normalization_props.
private
process_derived_normalization_props() : bool
Return values
boolprocess_main_unicode_data()
Processes UnicodeData.txt in order to populate $this->char_data, $this->full_decomposition_maps, and the 'data' element of most elements of $this->funcs.
private
process_main_unicode_data() : bool
Return values
boolshould_update()
Compares version of SMF's local Unicode data with the latest release.
private
should_update() : bool
Return values
bool —Whether SMF should update its local Unicode data or not.
smf_file_header()
Gets basic boilerplate for the PHP files that will be created.
private
smf_file_header() : string
Return values
string —Standard SMF file header.