Stripping unprintable Unicode variation characters in PHP

Emoji characters like 🥳, 🎉 and 💖 can be represented in a string in different ways. The first way is to include a Unicode point as a single character. For example 💖 is Unicode point U+1F496, which can be included in a PHP string either directly ('💖') or using the Unicode code point escape syntax from PHP 7 onwards: "\u{1F496}".

You can try these on the PHP REPL (php -a):

php > echo '💖';
💖
php > echo "\u{1F496}";
💖

Emoji characters might also be represented using a zero width joiner Unicode point, which lets you combine two separate Unicode points together to be displayed as a single character. For example, the rainbow flag emoji 🏳️‍🌈 can be produced by the sequence [Waving white flag] [ZWJ] [Rainbow] (flags are a common use-case for the ZWJ character).

See also: Fun Emoji Hacks: Zero Width Joiners.

As well as zero width joiners, there are also Unicode variation selectors. These can be used in conjunction with ZWJ sequences “where one or more characters in the sequence have text and emoji presentation”.

For example, the 💖 emoji above might end up as ️💖 (️💖 in HTML) by the time it makes it into your system.

If you’re trying to print text containing these sequences, or render them into an image, this can lead to unwanted ? question mark characters appearing in the rendered text, e.g. ?💖.

You can eliminate all or most of these in PHP like this:

<?php

use Normalizer;

preg_replace('/[\x{FE00}-\x{FE0F}]/u', '', Normalizer::normalize(trim($text)));

That should clear up the ZWJ and variation selectors into a printable string.

We use this to be able to print emoji characters that customers add to personal gift messages in Pop Robin Cards.


View post: Stripping unprintable Unicode variation characters in PHP