Last time I had a very strange case for simple XML file processing.
- the file was downloaded from the external server,
- there was no response header about content-encoding but the browser (Chrome) could display it correctly,
- the file was processed by file_get_contents function.
The issue was that preg_match couldn’t find anything and str_replace was not… replacing :) Also, I couldn’t build a proper XML object inside the PHP script.
After checking code I found it’s caused probably by the issue with detecting the character encoding. That’s why I used mb_detect_encoding, simple, right? …Partially :)
Unable to detect character encoding
Time to read some documentation.
mb_detect_encoding ( string $str
[, mixed $encoding_list
= mb_detect_order() [, bool $strict
= FALSE
]] ) : string
We can use “auto” as the second argument which is filled with the value of mbstring.language. So… what it is exactly?
print ini_get('mbstring.language');
prints: neutral which on my server resolves to utf-8. So basically our mb_detect_encoding can detect UTF-8 only :)
Solution:
Use a list of encodings the text could be encoded with. For example:
$encodings = [
'CP1251',
'UCS-2LE',
'UCS-2BE',
'UTF-8',
'UTF-16',
'UTF-16BE',
'UTF-16LE',
'UTF-32',
'CP866',
];
mb_detect_encoding($content, $encodings, true);
Important: don’t forget to use strict mode (3rd parameter)! Without it the result can be misleading.
Alternative solution
Use bash! After saving this XML file we can use file command:
$ file file.xml
and see:
file.xml: Little-endian UTF-16 Unicode text, with very long lines, with no line terminators
And we have it! UTF-16.