Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:121860 Return-Path: Delivered-To: mailing list internals@lists.php.net Received: (qmail 14446 invoked from network); 29 Nov 2023 13:07:42 -0000 Received: from unknown (HELO php-smtp4.php.net) (45.112.84.5) by pb1.pair.com with SMTP; 29 Nov 2023 13:07:42 -0000 Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 84D0618003D for ; Wed, 29 Nov 2023 05:07:46 -0800 (PST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from forward501a.mail.yandex.net (forward501a.mail.yandex.net [178.154.239.81]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Wed, 29 Nov 2023 05:07:45 -0800 (PST) Received: from mail-nwsmtp-smtp-production-main-33.iva.yp-c.yandex.net (mail-nwsmtp-smtp-production-main-33.iva.yp-c.yandex.net [IPv6:2a02:6b8:c0c:6094:0:640:a0fc:0]) by forward501a.mail.yandex.net (Yandex) with ESMTP id 5ADD061B85 for ; Wed, 29 Nov 2023 16:07:35 +0300 (MSK) Received: by mail-nwsmtp-smtp-production-main-33.iva.yp-c.yandex.net (smtp/Yandex) with ESMTPSA id Y7Venx1oAKo0-7HcCVs1E; Wed, 29 Nov 2023 16:07:34 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=php.watch; s=mail; t=1701263254; bh=IwZQHD/JEmxVuFhXmfj3rejI04PbmCrMbIJ7Ru+B7Cs=; h=To:Subject:Message-ID:References:Date:From:In-Reply-To:Cc; b=gJrLcyp11GaFR6m0DRtwOFM5JQuksKJ4BmdCSD0ifXxaaOnWFbOJBDBDe+ry9oa7a ucOH5RCNH1/7qeSvosYi7EfQz11Pq4z2RfCcoVyya4UWEEEC3l14XAeQ5jBHACyL5F xUgYMDys5QCIc67g3zlM4BQ/bOwK5vuYo/sUIAEc= Authentication-Results: mail-nwsmtp-smtp-production-main-33.iva.yp-c.yandex.net; dkim=pass header.i=@php.watch Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-50a6ff9881fso10052519e87.1 for ; Wed, 29 Nov 2023 05:07:34 -0800 (PST) X-Gm-Message-State: AOJu0YyfdyPARFW6UXahPo5cvbYMCZ98VFpArzr9LVzbU/ZJxUxm362t 60T3BnSqF3uWYrOCrZB3xNEFmK8BKyALRTQ29AY= X-Google-Smtp-Source: AGHT+IHlAEtx+pV0cnD0Nf+e1Wi4WZVWF7XCVsX0xGjX6TPp8dqbMFLwwyLDLUIl/PqfvZjWYiUl6rlgwkWTTfkIdjQ= X-Received: by 2002:a19:7b0c:0:b0:509:8a5e:654d with SMTP id w12-20020a197b0c000000b005098a5e654dmr10636402lfc.21.1701263254243; Wed, 29 Nov 2023 05:07:34 -0800 (PST) MIME-Version: 1.0 References: <1BA05C1A-AFAE-4E86-BAA2-420B22549519@gmail.com> <0D8856BC-DDEE-47F8-8C59-7F4DC7A64237@woofle.net> In-Reply-To: Date: Wed, 29 Nov 2023 20:07:06 +0700 X-Gmail-Original-Message-ID: Message-ID: To: youkidearitai Cc: internals@lists.php.net Content-Type: text/plain; charset="UTF-8" Subject: Re: [PHP-DEV] Deprecate declare(encoding='...') + zend.multibyte + zend.script_encoding + zend.detect_unicode ? From: ayesh@php.watch (Ayesh Karunaratne) > Shift_JIS is very ambiguous, What will we do if SJIS-2004 or SJIS-win comes? > How do we guess(detect) SJIS-2004, SJIS-win and SJIS-mac? I'm not the person you replied to in your previous email, but I thought to weigh in with what I can. My native language also uses multiple bytes, and have done a fair bit of character encoding conversions from one to another. The very reason why we have character encoding sets is to be able to reassign the same byte values to multiple real-life characters, so changing the character encodings from a non-UTF charset always carries some sort of "risk" of detecting the wrong source text encoding. Like Yuya Hamada mentioned in the rest of the previous email, 0xFC40 for example can map to two different characters. These are quite common occurrences, and there is even a word (Mojibake) for it! The most robust projects in this space are probably `enca` and `Chardet` (Python). However, theoretically, all tools can only guess the text encoding by inspecting common patterns and by checking if all bytes map to a meaningful glyph. When there is not a lot of text to inspect, these tools are very prone to make wrong results. When the source encoding is correctly detected or known, it's easy to re-encode files using `iconv`, followed by a quick `sed` to remove the `declare()` calls. --- That said, I'm hugely in favor of dropping support for non-UTF8 encodings. Because the source encoding is present in the INI settings or the declare statement, the site owners should be able to mass-encode text to UTF-8. Many languages like Rust only support UTF-8 (https://doc.rust-lang.org/reference/input-format.html), and I don't think any new PHP developers will expect PHP to work with non-UTF8 encodings in the first place.