Try Live
Add Docs
Rankings
Pricing
Enterprise
Docs
Install
Install
Docs
Pricing
Enterprise
More...
More...
Try Live
Rankings
Add Docs
ForceUTF8
https://github.com/neitanod/forceutf8
Admin
ForceUTF8 is a PHP class encoding library that automatically converts mixed-encoded strings (Latin1,
...
Tokens:
5,554
Snippets:
33
Trust Score:
9.2
Update:
1 month ago
Context
Skills
Chat
Benchmark
97.9
Suggestions
Latest
Show doc for...
Code
Info
Show Results
Context Summary (auto-generated)
Raw
Copy
Link
# ForceUTF8 ForceUTF8 is a PHP library that intelligently converts strings to UTF-8 encoding regardless of their original encoding. It handles mixed-encoded strings containing Latin1 (ISO 8859-1), Windows-1252, and UTF-8 characters, automatically detecting and converting them to proper UTF-8 without corrupting already-valid UTF-8 content. The library solves a common problem where applying PHP's native `utf8_encode()` to an already-UTF8 string produces garbled output. ForceUTF8 provides static methods that safely handle encoding conversion, including fixing double-encoded or multiple-encoded UTF-8 strings that appear garbled due to repeated encoding operations. ## Encoding::toUTF8 Converts any string to UTF-8, automatically detecting whether the input is Latin1 (ISO 8859-1), Windows-1252, UTF-8, or a mix of encodings. The function leaves valid UTF-8 characters unchanged while converting non-UTF8 characters appropriately. It also accepts arrays and will recursively convert all string values. ```php <?php use \ForceUTF8\Encoding; // Convert a Latin1 encoded string to UTF-8 $latin1_string = "Caf\xe9"; // "Café" in Latin1 $utf8_string = Encoding::toUTF8($latin1_string); echo $utf8_string; // Output: Café // Already UTF-8 strings pass through unchanged $already_utf8 = "Fédération Camerounaise de Football"; $result = Encoding::toUTF8($already_utf8); echo $result; // Output: Fédération Camerounaise de Football // Mixed encoding strings are handled correctly $mixed = "Caf\xe9 " . "résumé"; // Latin1 + UTF-8 mixed $normalized = Encoding::toUTF8($mixed); echo $normalized; // Output: Café résumé // Arrays are processed recursively $data = array( "name" => "Jos\xe9", // Latin1 "city" => "São Paulo", // UTF-8 "nested" => array( "text" => "Stra\xdfe" // Latin1 (German "Straße") ) ); $converted = Encoding::toUTF8($data); print_r($converted); // Output: Array ( [name] => José [city] => São Paulo [nested] => Array ( [text] => Straße ) ) ``` ## Encoding::fixUTF8 Repairs garbled UTF-8 strings that have been double-encoded or multiple-encoded. This commonly happens when UTF-8 data is mistakenly treated as Latin1 and encoded again. The function iteratively decodes and re-encodes until the string stabilizes, supporting optional iconv flags for handling special characters. ```php <?php use \ForceUTF8\Encoding; // Fix double-encoded UTF-8 strings $garbled1 = "Fédération Camerounaise de Football"; echo Encoding::fixUTF8($garbled1); // Output: Fédération Camerounaise de Football // Fix triple-encoded strings $garbled2 = "Fédération Camerounaise de Football"; echo Encoding::fixUTF8($garbled2); // Output: Fédération Camerounaise de Football // Fix quadruple-encoded strings $garbled3 = "Fédération Camerounaise de Football"; echo Encoding::fixUTF8($garbled3); // Output: Fédération Camerounaise de Football // Handle extreme multiple encoding $garbled4 = "Fédération Camerounaise de Football"; echo Encoding::fixUTF8($garbled4); // Output: Fédération Camerounaise de Football // Using iconv options for special characters (Windows-1252 specific) $str_with_dash = "Fédération Camerounaise—de—Football"; // U+2014 em dash echo Encoding::fixUTF8($str_with_dash); // May break em dash // Output: Fédération Camerounaise?de?Football echo Encoding::fixUTF8($str_with_dash, Encoding::ICONV_IGNORE); // Preserves em dash // Output: Fédération Camerounaise—de—Football echo Encoding::fixUTF8($str_with_dash, Encoding::ICONV_TRANSLIT); // Transliterates if needed // Output: Fédération Camerounaise—de—Football // Handle characters outside ISO8859-1/Win1252 $baltic_chars = "čęėįšųūž"; echo Encoding::fixUTF8($baltic_chars, Encoding::ICONV_TRANSLIT); // Output: ceeišuuž (transliterated where possible) ``` ## Encoding::toLatin1 Converts UTF-8 strings back to Latin1 (ISO 8859-1) encoding, also known as Windows-1252. This is useful when interfacing with legacy systems that require single-byte encoding. The function has aliases `toWin1252()` and `toISO8859()`. ```php <?php use \ForceUTF8\Encoding; // Convert UTF-8 to Latin1 $utf8_string = "Café résumé"; $latin1 = Encoding::toLatin1($utf8_string); echo bin2hex($latin1); // Output: 436166e9207265cc73756de9 (single-byte characters) // Using alias methods (all equivalent) $result1 = Encoding::toLatin1($utf8_string); $result2 = Encoding::toWin1252($utf8_string); $result3 = Encoding::toISO8859($utf8_string); // All three produce identical output // Process arrays $data = array("José", "María", "Niño"); $latin1_array = Encoding::toLatin1($data); // Converts all array elements to Latin1 // With iconv options for better character handling $text = "Smart quotes: "Hello""; $latin1_ignore = Encoding::toLatin1($text, Encoding::ICONV_IGNORE); $latin1_translit = Encoding::toLatin1($text, Encoding::ICONV_TRANSLIT); ``` ## Encoding::encode A convenience method that normalizes encoding labels and converts text to the specified target encoding. It accepts various encoding name formats (UTF8, utf-8, LATIN1, ISO88591, WIN1252) and standardizes them before conversion. ```php <?php use \ForceUTF8\Encoding; // Convert to UTF-8 using various label formats $text = "Caf\xe9"; echo Encoding::encode('UTF-8', $text); // Standard format echo Encoding::encode('UTF8', $text); // Without hyphen echo Encoding::encode('utf', $text); // Abbreviated // All output: Café // Convert to Latin1 using various label formats $utf8_text = "Café"; Encoding::encode('ISO-8859-1', $utf8_text); // Standard ISO format Encoding::encode('LATIN1', $utf8_text); // Latin alias Encoding::encode('WIN1252', $utf8_text); // Windows codepage name Encoding::encode('WINDOWS1252', $utf8_text); // Full Windows name // All produce equivalent Latin1 output ``` ## Encoding::UTF8FixWin1252Chars Fixes UTF-8 strings that were incorrectly converted from Windows-1252 as if they were ISO 8859-1. This addresses the specific case where Windows-1252 special characters (code points 0x80-0x9F) were not properly handled during initial conversion. ```php <?php use \ForceUTF8\Encoding; // Fix Windows-1252 specific characters that were mishandled // Characters like €, „, …, †, ‡, ˆ, ‰, Š, ‹, Œ, Ž, ', ', ", ", •, –, —, ˜, ™, š, ›, œ, ž, Ÿ $broken_text = "Price: \xc2\x80 100"; // Broken euro sign $fixed = Encoding::UTF8FixWin1252Chars($broken_text); echo $fixed; // Output: Price: € 100 // Fix trademark and other special characters $text_with_tm = "Product\xc2\x99"; // Broken trademark $fixed_tm = Encoding::UTF8FixWin1252Chars($text_with_tm); echo $fixed_tm; // Output: Product™ ``` ## Encoding::removeBOM Removes the UTF-8 Byte Order Mark (BOM) from the beginning of a string if present. The BOM (EF BB BF) is sometimes added by text editors and can cause issues with JSON parsing, HTTP headers, or string comparisons. ```php <?php use \ForceUTF8\Encoding; // Remove BOM from string $with_bom = "\xef\xbb\xbf" . "Hello World"; $clean = Encoding::removeBOM($with_bom); echo $clean; // Output: Hello World // Safe to call on strings without BOM $no_bom = "Hello World"; $result = Encoding::removeBOM($no_bom); echo $result; // Output: Hello World // Useful when reading files that may have BOM $file_content = file_get_contents('data.txt'); $clean_content = Encoding::removeBOM($file_content); $json = json_decode($clean_content); // Now JSON parsing won't fail due to BOM ``` ## Encoding Constants The library provides constants for controlling iconv behavior when using `fixUTF8()` and `toLatin1()` methods. These determine how characters outside the target character set are handled. ```php <?php use \ForceUTF8\Encoding; // Available constants: // Encoding::WITHOUT_ICONV - Default, don't use iconv (fastest, but may break some chars) // Encoding::ICONV_TRANSLIT - Transliterate characters that can't be represented // Encoding::ICONV_IGNORE - Silently discard characters that can't be represented // Example with different modes $text = "Ąąą"; // Polish characters not in Latin1 // Without iconv - may produce incorrect results $result1 = Encoding::fixUTF8($text, Encoding::WITHOUT_ICONV); // With transliteration - converts to nearest ASCII equivalent $result2 = Encoding::fixUTF8($text, Encoding::ICONV_TRANSLIT); echo $result2; // Output: Aaa // With ignore - removes unsupported characters $result3 = Encoding::fixUTF8($text, Encoding::ICONV_IGNORE); echo $result3; // Output: (empty or partial) ``` ## Summary ForceUTF8 is essential for PHP applications that process text from unreliable sources where encoding consistency cannot be guaranteed. Common use cases include: importing data from legacy databases with mixed encodings, processing user input from various locales, cleaning up data from web scraping, handling email content with incorrect charset headers, and normalizing text before storage in UTF-8 databases. The library is particularly valuable when integrating with third-party APIs or services that may return inconsistently encoded responses. Integration is straightforward via Composer (`neitanod/forceutf8`). The static method interface requires no instantiation—simply call `Encoding::toUTF8()` on any string or array. For batch processing, the automatic array handling eliminates the need for manual iteration. When working with data that may have been double-encoded (common in database migrations or multi-system data flows), `Encoding::fixUTF8()` provides automatic repair without requiring knowledge of how many encoding passes occurred.