Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:106002
To: internals@lists.php.net
Date: Thu, 20 Jun 2019 16:36:33 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.7.1
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-GB
Content-Transfer-Encoding: 7bit
Subject: [Discussion] Scalar Object Strings and Multibyte Encodings
From: markyr@gmail.com (Mark Randall)
Message-ID: <php.internals-106002@news.php.net>

Greetings,

I have noticed a lot of recent comments, posts, and even Nikita's recent 
PHP Russia video discussing scalar objects, a potential future feature 
that I believe already has widespread support, and would have widespread 
usage once it arrived.

I think most scalars would be self explanatory, but spurred on by 
discussion on here and other places about string functions, I would like 
to debate the string object in particular, and specifically the use of 
encoding in combination with such a scalar object.

I see two options available:



## Option 1

Every String() scalar-object would expose methods for standard byte-safe 
strings, ascii and multibyte functions, this may result in something like:

"Hello".substr(1)
"Hello".mbSubstr(1)



## Option 2

Allow the string to be bound to a specific encoding which would require 
_zend_string to be extended with a pointer to a structure containing 
encoding helpers. All of the php-src macros would need updating to take 
these into account.

The scalar object methods would then use that to detect which 
implementation to use.

"Hello".substr(1) // would work as expected regardless of encoding

My question to everyone is, what mechanism would be used to mark a 
string as being of a specific encoding? Naturally a .toUTF8() would be 
possible, but I'm not sure that would be as tidy as it could be.

"Hello".toUTF8()
$_GET['example'].toUTF8()

In certain languages, a basic string can be prefixed with L to treat it 
as a 16 bit wide character, particularly useful for Windows API calls. 
Perhaps that could be the way to go for interned strings in the code itself?

L"Hello"
L$_GET['example']

But most strings we use will be coming from an external source, such as 
user input or a database, what would be the cleanest way to mark them as 
having a specific encoding?

Perhaps going a bit more out-there, would this perhaps bring about the 
necessity of a adding a specific encoded native type for at least the 
defacto encoding for the web?

function x(utf8_string $x): utf8_string { ... }

These are all just questions I have no answer to or firm opinion on, but 
I would be interested to know people's general ideas as to solutions.



--
Mark Randall