mbstring does not support numeric entities in HTML code. For example:
echo urlencode( mb_convert_encoding("Е", "UTF-8", "HTML-ENTITIES") );
displays %F2%AF%B8%9F rather than the expected %D0%95.
This bug was detected by Nick Wedd nick@maproom.co.uk and reported in the
newsgroup comp.lang.php, Message-ID: EU9zOoNGJAVGFAaa@maproom.demon.co.uk.
I'd found the bug in the file ext/mbstring/libmbfl/filters/mbfilter_htmlent.c
and added these features:
- decode hex entities &xHHHH;
- detect invalid digits
- detect digits missing at all
- detect values out of the range 0-0xffff
Invalid values are returned verbatim.
Apparently the right place for this patch should be
http://cvs.sourceforge.jp/cgi-bin/viewcvs.cgi/php-i18n/
but currently the project isn't no more hosted there.
The patch for ext/mbstring/libmbfl/filters/mbfilter_htmlent.c follows:
173a174,217
static int mbfl_decode_numeric_entity(char s, int s_len)
/
s = numeric entity "ddd" or "xhhhh"
return: numeric value or -1 if not inside [0,0xffff] or invalid digits
*/
{
int ent, pos, c, d;ent = 0;
if (*s == 'x' || s == 'X') {
/ hexadecimal base /
if ( s_len < 2 )
return -1; / no digits found /
for (pos=1; pos<s_len; pos++) {
c = s[pos];
if (isdigit(c))
d = c - '0';
else if (isxdigit(c))
d = tolower(c) - 'a' + 10;
else
return -1; / invalid hex digit /
ent = (ent << 4) + d;
if (ent > 0xffff)
return -1; / too big */
}} else {
/* decimal base /
if ( s_len < 1 )
return -1; / no digits found /
for (pos=0; pos<s_len; pos++) {
c = s[pos];
if (! isdigit(c) )
return -1; / invalid dec char /
ent = ent10 + (c - '0');
if (ent > 0xffff)
return -1; /* too big */
}
}return ent;
}
192,193c236,246
< for (pos=2; pos<filter->status; pos++) {
< ent = ent*10 + (buffer[pos] - '0');
ent = mbfl_decode_numeric_entity(&buffer[2], filter->status - 2); if( ent >= 0 ){ CK((*filter->output_function)(ent, filter->data)); filter->status = 0; /*php_error_docref("ref.mbstring" TSRMLS_CC, E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);*/ } else { /* failure */ buffer[filter->status++] = ';'; buffer[filter->status] = 0; /* php_error_docref("ref.mbstring" TSRMLS_CC, E_WARNING, "mbstring cannot decode '%s'", buffer); */ mbfl_filt_conv_html_dec_flush(filter);
195,197d247
< CK((*filter->output_function)(ent, filter->data));
< filter->status = 0;
< /php_error_docref("ref.mbstring" TSRMLS_CC, E_NOTICE, "mbstring decoded '%s'=%d", buffer, ent);/
Best regards,
/|\ Umberto Salsi
/_/ www.icosaedro.it