Hello Internals.
I tried implement mb_levenshtein function and create an RFC.
https://wiki.php.net/rfc/mb_levenshtein
https://github.com/php/php-src/pull/16043
I would like discussion, feel free to comment.
Thank you.
Yuya.
--
Yuya Hamada (tekimen)
Hi
Am 2024-09-25 09:21, schrieb youkidearitai:
I tried implement mb_levenshtein function and create an RFC.
https://wiki.php.net/rfc/mb_levenshtein
https://github.com/php/php-src/pull/16043I would like discussion, feel free to comment.
Thank you for your RFC. I share the concern raised by cmb in the PR
discussion:
https://github.com/php/php-src/pull/16043#issuecomment-2374574538
Generally working with codepoints is going to be confusing for a user,
but sometimes it is necessary when dealing with external systems that
themselves work with codepoints (MySQL comes to my mind). However
calculating the Levenshtein distance is most certainly something that
purely is "user-facing" and not constrained by external systems.
Calculating the distance of codepoints is going to be extremely
confusing when dealing with things like Emoji. It would probably best to
either only offer a grapheme_*
function here or to leave this fully to
userland.
Best regards
Tim Düsterhus
2024年10月5日(土) 1:20 Tim Düsterhus tim@bastelstu.be:
Hi
Am 2024-09-25 09:21, schrieb youkidearitai:
I tried implement mb_levenshtein function and create an RFC.
https://wiki.php.net/rfc/mb_levenshtein
https://github.com/php/php-src/pull/16043I would like discussion, feel free to comment.
Thank you for your RFC. I share the concern raised by cmb in the PR
discussion:
https://github.com/php/php-src/pull/16043#issuecomment-2374574538Generally working with codepoints is going to be confusing for a user,
but sometimes it is necessary when dealing with external systems that
themselves work with codepoints (MySQL comes to my mind). However
calculating the Levenshtein distance is most certainly something that
purely is "user-facing" and not constrained by external systems.
Calculating the distance of codepoints is going to be extremely
confusing when dealing with things like Emoji. It would probably best to
either only offer agrapheme_*
function here or to leave this fully to
userland.Best regards
Tim Düsterhus
Hi, Tim
Thank you for response.
I thinking about wants users what is levenshtein distance.
Surely, I think Levenshtein distance should be measured in terms of
grapheme clusters.
In most userland codes that based on UTF-8. So seems move to grapheme
function is make sense.
I more thinking usecase of levenshtein. Probably I'm going to grapheme function.
Thanks
Yuya
--
Yuya Hamada (tekimen)
2024年10月6日(日) 14:45 youkidearitai youkidearitai@gmail.com:
2024年10月5日(土) 1:20 Tim Düsterhus tim@bastelstu.be:
Hi
Am 2024-09-25 09:21, schrieb youkidearitai:
I tried implement mb_levenshtein function and create an RFC.
https://wiki.php.net/rfc/mb_levenshtein
https://github.com/php/php-src/pull/16043I would like discussion, feel free to comment.
Thank you for your RFC. I share the concern raised by cmb in the PR
discussion:
https://github.com/php/php-src/pull/16043#issuecomment-2374574538Generally working with codepoints is going to be confusing for a user,
but sometimes it is necessary when dealing with external systems that
themselves work with codepoints (MySQL comes to my mind). However
calculating the Levenshtein distance is most certainly something that
purely is "user-facing" and not constrained by external systems.
Calculating the distance of codepoints is going to be extremely
confusing when dealing with things like Emoji. It would probably best to
either only offer agrapheme_*
function here or to leave this fully to
userland.Best regards
Tim DüsterhusHi, Tim
Thank you for response.
I thinking about wants users what is levenshtein distance.
Surely, I think Levenshtein distance should be measured in terms of
grapheme clusters.In most userland codes that based on UTF-8. So seems move to grapheme
function is make sense.
I more thinking usecase of levenshtein. Probably I'm going to grapheme function.Thanks
Yuya--
Yuya Hamada (tekimen)
Hi, internals
I'm thinking more about use case of mb_levenshtein.
I added test case of mb_levenshtein that compare emoji per code point.
https://github.com/php/php-src/pull/16043/files#diff-d6aca000d2b0ac5982f9f9a0fe0425246cfd8411fdfb8645cdfe6f786d526597R86
It means make sense to compare Unicode codepoint.
I think need mb_levenshtein, and also needs grapheme_levenshtein.
What do you think?
Regards
Yuya
--
Yuya Hamada (tekimen)