Avoid naive string indexing in PHP

2015-04-23 (Thursday) | 300 words (~2 minutes reading)

PHP allows treating a string as an array, so that you can use indexing syntax to get or set a single position in the string:

<?php

echo 'hello'[0];
// h

$foo = 'bar';
$foo[2] = 'z';
echo $foo;
// baz

This seems handy but you should never do it.

The reason to avoid string indexing like this in PHP is that PHP strings are not multibyte character strings, they’re just bytes.

You won’t notice this with plain ASCII strings like the above, as each character happens to be one byte anyway so they’re equivalent.

As soon as you get a multi-byte string, which you will as everything is UTF-8 and internationalised now, that kind of naive string indexing will break.

<?php

echo '葛修远'[0];
// '�'

// You might expect to get '葛' here, but you won't as it's multi-byte.
// Instead you get the mangled '�', which is the first byte of the UTF-8
// encoding of '葛'.

Compare this to JavaScript, which has native multi-byte strings:

"葛修远"[0];
// "葛"

The correct way to handle this in PHP is to never use naive string indexing, and instead use the mb_ functions, in this case mb_substr:

<?php

echo mb_substr('葛修远', 0, 1);
// '葛'

Unfortunately neither of the most popular linters for PHP, PHPMD and PHPCS, seem to have standard rules for banning naive string indexing, as it would be handy to automatically reject it in a codebase.

NotesToSelf.Dev

Avoid naive string indexing in PHP

Tech mentioned