Last time I had a very strange case for simple XML file processing.
- the file was downloaded from the external server,
- there was no response header about content-encoding but the browser (Chrome) could display it correctly,
- the file was processed by file_get_contents function.
The issue was that preg_match couldn’t find anything and str_replace was not… replacing :) Also, I couldn’t build a proper XML object inside the PHP script.
After checking code I found it’s caused probably by the issue with detecting the character encoding. That’s why I used mb_detect_encoding, simple, right? …Partially :)
Unable to detect character encoding
Time to read some documentation.
mb_detect_encoding ( string
$encoding_list= mb_detect_order() [, bool
FALSE]] ) : string
We can use “auto” as the second argument which is filled with the value of mbstring.language. So… what it is exactly?
prints: neutral which on my server resolves to utf-8. So basically our mb_detect_encoding can detect UTF-8 only :)
Use a list of encodings the text could be encoded with. For example:
$encodings = [ 'CP1251', 'UCS-2LE', 'UCS-2BE', 'UTF-8', 'UTF-16', 'UTF-16BE', 'UTF-16LE', 'UTF-32', 'CP866', ]; mb_detect_encoding($content, $encodings, true);
Important: don’t forget to use strict mode (3rd parameter)! Without it the result can be misleading.
Use bash! After saving this XML file we can use file command:
$ file file.xml
file.xml: Little-endian UTF-16 Unicode text, with very long lines, with no line terminators
And we have it! UTF-16.