[Discussion] Introducing str_mask() for partial string masking

3 hours ago by sepehrphpr@gmail.com — view source — reply

unread

Hi everyone,

I would like to propose a new native utility function: str_mask().

The Problem:

Developers frequently need to mask sensitive information (like phone
numbers, email addresses, or token IDs) before displaying them or
logging them. Currently, this is achieved in userland using various
combinations of substr(), str_repeat(), or preg_replace(). These
implementations are often error-prone, especially when dealing with
multibyte character encodings.

The Proposal:

I am suggesting a simple, native helper:

str_mask(string $string, int $start, int $length, string $mask_char =
'*'): string

Why native?

Consistency: Providing a reliable, standard way to mask strings across projects.
Multibyte Support: Unlike custom userland implementations that might
break on non-Latin strings (like Persian/Arabic/CJK), a native
implementation can seamlessly handle multibyte characters.

Example:

// Standard usage (ASCII)
echo str_mask("1234567890", 3, 4, "");
// Output: 123***890

// Multibyte support (UTF-8)
echo str_mask("Internalization", 2, 5, "#");
// Output: In#####alization

I have a draft implementation ready for review. I would appreciate
feedback on whether this utility fits within the scope of the PHP
core, or if there are specific concerns regarding such an addition.

Best regards,

Sepehr

3 hours ago by Rowan Tommins [IMSoP] — view source — reply

unread

Hi everyone,

I would like to propose a new native utility function: str_mask().

I can definitely see the use case for this function - in fact, I was just reviewing a change which could have used it.

Multibyte Support: Unlike custom userland implementations that might
break on non-Latin strings (like Persian/Arabic/CJK), a native
implementation can seamlessly handle multibyte characters.

As a rule, PHP's str_* functions operate on byte strings with no knowledge of encoding. Multibyte support would belong in the mbstring or intl extensions, which have conventions for specifying the encoding in use.

In fact, a correct Unicode implementation would need to operate on "graphemes", using the bindings to ICU in the intl extension. Otherwise, it would incorrectly handle things like combining diacritics and emoji variation selectors.

Rowan Tommins
[IMSoP]