Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:106002 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 48406 invoked from network); 20 Jun 2019 18:21:36 -0000 Received: from unknown (HELO localhost.localdomain) (76.75.200.58) by pb1.pair.com with SMTP; 20 Jun 2019 18:21:36 -0000 To: internals@lists.php.net X-Mozilla-News-Host: news://news.php.net:119 Date: Thu, 20 Jun 2019 16:36:33 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.7.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Posted-By: 94.1.167.52 Subject: [Discussion] Scalar Object Strings and Multibyte Encodings From: markyr@gmail.com (Mark Randall) Message-ID: Greetings, I have noticed a lot of recent comments, posts, and even Nikita's recent PHP Russia video discussing scalar objects, a potential future feature that I believe already has widespread support, and would have widespread usage once it arrived. I think most scalars would be self explanatory, but spurred on by discussion on here and other places about string functions, I would like to debate the string object in particular, and specifically the use of encoding in combination with such a scalar object. I see two options available: ## Option 1 Every String() scalar-object would expose methods for standard byte-safe strings, ascii and multibyte functions, this may result in something like: "Hello".substr(1) "Hello".mbSubstr(1) ## Option 2 Allow the string to be bound to a specific encoding which would require _zend_string to be extended with a pointer to a structure containing encoding helpers. All of the php-src macros would need updating to take these into account. The scalar object methods would then use that to detect which implementation to use. "Hello".substr(1) // would work as expected regardless of encoding My question to everyone is, what mechanism would be used to mark a string as being of a specific encoding? Naturally a .toUTF8() would be possible, but I'm not sure that would be as tidy as it could be. "Hello".toUTF8() $_GET['example'].toUTF8() In certain languages, a basic string can be prefixed with L to treat it as a 16 bit wide character, particularly useful for Windows API calls. Perhaps that could be the way to go for interned strings in the code itself? L"Hello" L$_GET['example'] But most strings we use will be coming from an external source, such as user input or a database, what would be the cleanest way to mark them as having a specific encoding? Perhaps going a bit more out-there, would this perhaps bring about the necessity of a adding a specific encoded native type for at least the defacto encoding for the web? function x(utf8_string $x): utf8_string { ... } These are all just questions I have no answer to or firm opinion on, but I would be interested to know people's general ideas as to solutions. -- Mark Randall