[PHP] Unable to detect character encoding – how to detect it properly?

Last time I had a very strange case for simple XML file processing.

  1. the file was downloaded from the external server,
  2. there was no response header about content-encoding but the browser (Chrome) could display it correctly,
  3. the file was processed by file_get_contents function.

The issue was that preg_match couldn’t find anything and str_replace was not… replacing :) Also, I couldn’t build a proper XML object inside the PHP script.

After checking code I found it’s caused probably by the issue with detecting the character encoding. That’s why I used mb_detect_encoding, simple, right? …Partially :)

Unable to detect character encoding

Time to read some documentation.

mb_detect_encoding ( string $str [, mixed $encoding_list = mb_detect_order() [, bool $strict = FALSE ]] ) : string

We can use “auto” as the second argument which is filled with the value of mbstring.language. So… what it is exactly?

print ini_get('mbstring.language');

prints: neutral which on my server resolves to utf-8. So basically our mb_detect_encoding can detect UTF-8 only :)

Solution:

Use a list of encodings the text could be encoded with. For example:

$encodings = [
  'CP1251',
  'UCS-2LE',
  'UCS-2BE',
  'UTF-8',
  'UTF-16',
  'UTF-16BE',
  'UTF-16LE',
  'UTF-32',
  'CP866',
];
mb_detect_encoding($content, $encodings, true);

Important: don’t forget to use strict mode (3rd parameter)! Without it the result can be misleading.

Alternative solution

Use bash! After saving this XML file we can use file command:

$ file file.xml

and see:

file.xml: Little-endian UTF-16 Unicode text, with very long lines, with no line terminators

And we have it! UTF-16.