Newsgroups: php.internals Path: news.php.net Xref: news.php.net php.internals:125797 X-Original-To: internals@lists.php.net Delivered-To: internals@lists.php.net Received: from php-smtp4.php.net (php-smtp4.php.net [45.112.84.5]) by qa.php.net (Postfix) with ESMTPS id A56521A00BD for ; Mon, 14 Oct 2024 07:38:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=php.net; s=mail; t=1728891625; bh=ScXeefPmz+R3zSC+wWT9V+FQtFnCk3SNTgidH82ux+I=; h=References:In-Reply-To:From:Date:Subject:To:From; b=ku7IUTWZmfplj2rxMS++x8INQuAvvQsD7ymszGGakC0tGqGf48mWbNXulldv5rgbv k4MNuIkU4CLO9uSNg8G+R7cAPFvbv+dr1m6KCaZAIoZRHt0vHxzGQffEQolgfTm+R7 2w5z15o+ZLGfWTQR+HUCxD8CIzBEyvcBTdf0a4nweAD/tUSlEIo/V0hdukIr4syJhC rO6ACDrP4ytQ9oSWWvQngpvHFnik4tSc52Vt+HNH7Ez3gpj31Ls6LPJulEB47snfF3 KdGAhnZ+QPpH3Xb751HbvYGNYJL8PT44lQQ7jF9RB/gk4XzolIszCsZxC/2aAHX3Bz L4JlL7e7onbPA== Received: from php-smtp4.php.net (localhost [127.0.0.1]) by php-smtp4.php.net (Postfix) with ESMTP id 8565418005B for ; Mon, 14 Oct 2024 07:40:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on php-smtp4.php.net X-Spam-Level: X-Spam-Status: No, score=0.6 required=5.0 tests=BAYES_50,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,DMARC_PASS,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=4.0.0 X-Spam-Virus: No X-Envelope-From: Received: from mail-wm1-f48.google.com (mail-wm1-f48.google.com [209.85.128.48]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by php-smtp4.php.net (Postfix) with ESMTPS for ; Mon, 14 Oct 2024 07:40:24 +0000 (UTC) Received: by mail-wm1-f48.google.com with SMTP id 5b1f17b1804b1-4305724c12eso29625975e9.1 for ; Mon, 14 Oct 2024 00:38:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1728891483; x=1729496283; darn=lists.php.net; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ScXeefPmz+R3zSC+wWT9V+FQtFnCk3SNTgidH82ux+I=; b=K0aAlYM2SKnaldH9zqtOkPj+O7AtdxTDWRcn+e64iS4j11AcqIZSnOHEi64CQitOR7 Sux59R+Szq6BCDQdbiXQxWVuHA+HX8RLbeQLOV2r/jpzT9MVc1sa4IrUq2TBbGw25x08 E61S+jwDZM/ymCZ4G2CtFys1Px/KqeZPcrfGUnfxf4PgvjcV/x08IGJiDThsO1npBVKt Jye/ffv8QnCg9p3TECxfA4I3oKDqsTIUVxcSsiPI1rZiZrPRopCCQC7lAS6KyRVBwcPt TSwLpIkbzoDBEjlVJp2+1cFAUg02YQA1f1BPA7jNzyfUDi6ZsEU1WdDH1RqQpkWtd9eQ t9Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728891483; x=1729496283; h=content-transfer-encoding:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ScXeefPmz+R3zSC+wWT9V+FQtFnCk3SNTgidH82ux+I=; b=EPprWY10NSXsXI5PE1NDPIaA7x5gmSnK5VqHpQQ0fbsoviaVlkwp4h1jFR16d5LxdE 4xuIsCDZT3Hd3e4F9W962BOjc02fRxKNBmeQbPDhrrect/MCPxPEXGyXo3aNGCkff0M5 g9RTW7fxTiR2VbzkV3/Ho9q/ff1SDo2pA2m9VArp1lqq+fBzV9EmuoYPz3UIuZlJRCDf wIyMJVkkglidzjl61K9kJf3JOn76YPASujbnU0qJABN/SjNzM//vw/KGlhaROmUoEZic 6kPnPx3fHmpxDs8C1zBw6RmG2XrnHndFU1bbjl31DEyI0fLtUkHz/AdsSB2R527lvYNd 6jmQ== X-Gm-Message-State: AOJu0YxrFkLmCCMjQxHR9mzEf9pzN2BkxyDqELMw4aOtk8cwgWOm/ELU H4JcQfPZrdH6593yNml/5cYiR4nAEHsozLgXV1/CDpJooRJuLL43TegC/TNau7FFW8qHYCcQaal mNiqRMphzuEDLgE+uutdu+talFYHyvIpvxw== X-Google-Smtp-Source: AGHT+IG48FAPUZWLEIsu8UYNa4398Tpv+/5nhL6XpsBmdsTERTBJwmFYQe6YgaI17TDaWOCIePpduFJ9G/nYlgeHKd4= X-Received: by 2002:a05:600c:4f43:b0:42a:a6d2:3270 with SMTP id 5b1f17b1804b1-431255e9df5mr66515535e9.21.1728891482830; Mon, 14 Oct 2024 00:38:02 -0700 (PDT) Precedence: bulk list-help: list-post: List-Id: internals.lists.php.net x-ms-reactions: disallow MIME-Version: 1.0 References: <754710ced33bdb2f9840d96ba0c58424@bastelstu.be> In-Reply-To: Date: Mon, 14 Oct 2024 16:37:53 +0900 Message-ID: Subject: Re: [PHP-DEV][DISCUSSION] Multibyte for levenshtein function To: php internals Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable From: youkidearitai@gmail.com (youkidearitai) 2024=E5=B9=B410=E6=9C=886=E6=97=A5(=E6=97=A5) 14:45 youkidearitai : > > 2024=E5=B9=B410=E6=9C=885=E6=97=A5(=E5=9C=9F) 1:20 Tim D=C3=BCsterhus : > > > > Hi > > > > Am 2024-09-25 09:21, schrieb youkidearitai: > > > I tried implement mb_levenshtein function and create an RFC. > > > https://wiki.php.net/rfc/mb_levenshtein > > > https://github.com/php/php-src/pull/16043 > > > > > > I would like discussion, feel free to comment. > > > > Thank you for your RFC. I share the concern raised by cmb in the PR > > discussion: > > https://github.com/php/php-src/pull/16043#issuecomment-2374574538 > > > > Generally working with codepoints is going to be confusing for a user, > > but sometimes it is necessary when dealing with external systems that > > themselves work with codepoints (MySQL comes to my mind). However > > calculating the Levenshtein distance is most certainly something that > > purely is "user-facing" and not constrained by external systems. > > Calculating the distance of codepoints is going to be extremely > > confusing when dealing with things like Emoji. It would probably best t= o > > either only offer a `grapheme_*` function here or to leave this fully t= o > > userland. > > > > Best regards > > Tim D=C3=BCsterhus > > Hi, Tim > > Thank you for response. > I thinking about wants users what is levenshtein distance. > Surely, I think Levenshtein distance should be measured in terms of > grapheme clusters. > > In most userland codes that based on UTF-8. So seems move to grapheme > function is make sense. > I more thinking usecase of levenshtein. Probably I'm going to grapheme fu= nction. > > Thanks > Yuya > > -- > --------------------------- > Yuya Hamada (tekimen) > - https://tekitoh-memdhoi.info > - https://github.com/youkidearitai > ----------------------------- Hi, internals I'm thinking more about use case of mb_levenshtein. I added test case of mb_levenshtein that compare emoji per code point. https://github.com/php/php-src/pull/16043/files#diff-d6aca000d2b0ac5982f9f9= a0fe0425246cfd8411fdfb8645cdfe6f786d526597R86 It means make sense to compare Unicode codepoint. I think need mb_levenshtein, and also needs grapheme_levenshtein. What do you think? Regards Yuya --=20 --------------------------- Yuya Hamada (tekimen) - https://tekitoh-memdhoi.info - https://github.com/youkidearitai -----------------------------