Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:19627
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
	s=s1024; d=yahoo.com;
	h=Message-ID:Received:Date:From:Subject:To:Cc:MIME-Version:Content-Type:Content-Transfer-Encoding;
	b=qdpLQ/nBwDtP+y+eOhzdwcVfGzvou1RTRyynuMZZWyYEVkCclwu3nj78uBqka/STXjZvgx6janJM6DC9NmU4DVdgwG9n+RQJhDQ4oahM6y5/HtoV289u51E6MQaPSOhOHbqs76UEQVYsypYKGfMbp/1rNt08Il7HqdxPzweLn7E=  ;
Message-ID: <20051016131222.28963.qmail@web31807.mail.mud.yahoo.com>
Date: Sun, 16 Oct 2005 06:12:22 -0700 (PDT)
To: Andrei Zmievski <andrei@yahoo-inc.com>, Tex Texin <tex@yahoo-inc.com>
Cc: internals <internals@lists.php.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Subject: Next batch of strings funcns ...
From: rollandsantimano@yahoo.com (Rolland Santimano)

... to be converted to handle Unicode - comments ?

TIA,
Rolland
--

[1] int strrpos(string haystack, string needle [, int offset])
Finds position of last occurrence of 'needle' within 'haystack' -
'offset' determines how much of the string is searched.
To be impl'ed in terms of u_strFindLast(), which matches codepoints.

[2] int stripos(string haystack, string needle [, int offset])
Case-insensitive version of strpos(), finds position of first
occurrence of 'needle' within 'haystack'.
To be impl'ed in terms of zend_u_memnstr().

[3] int strspn(string str, string mask [, start [, len]])
Returns length of initial segment of 'str' consisting entirely of
characters found in 'mask'
[4] int strcspn(string str, string mask [, start [, len]])
Returns length of initial segment of 'str' consisting entirely of
characters NOT found in 'mask'
Currently impl'ed by stepping through character arrays, Unicode
handling will process codepoints. u_strspn() & u_strcspn() are
available, but these require input to be NULL-terminated.

[5] string stristr(string haystack, string needle[, bool part])
Case-insensitive version of strstr(), finds first occurrence of a
string within another - 'part' determines portion of 'haystack' to be
returned.
To be impl'ed in terms of zend_u_memnstr().

[6] int strripos(string haystack, string needle [, int offset])
Case-insensitive version of strrpos()[1]

[7] mixed str_replace(mixed search, mixed replace, mixed subject
    [, int &replace_count])
Replaces 'search' with 'replace' in 'subject' - these maybe arrays
[8] mixed str_ireplace(mixed search, mixed replace, mixed subject
    [, int &replace_count])
Case-insensitive version of str_replace()[7]
Currently impl'ed in terms of zend_memnstr() to locate 'search'
within
'subject', Unicode handling will use zend_u_memnstr() to match at
codepoints boundaries.

[9] string strrchr(string haystack, string needle)
Finds the last occurrence of 'needle' character within 'haystack'.
Currently impl'ed in terms of libc strrchr(), Unicode handling can be
impl'ed in terms of zend_u_memnstr() or simply the U16_PREV()
macro. u_strrchr32() is available, but assumes input is
NULL-terminated.

[10] string strpbrk(string haystack, string char_list)
Search 'haystack' for any character in 'char_list'.
Currently impl'ed in terms of libc strpbrk(), Unicode handling to be
impl'ed in terms of U16_FWD() macro. u_strpbrk() is available, but
assumes input is NULL-terminated.

[11] string chunk_split(string str [, int chunklen [, string
ending]])
"Wraps" 'str' by inserting 'ending' every 'chunklen' number of
characters.
For Unicode, 'chunklen' will be treated as no of codepoints.
Q: Handling of base+combining sequences ?

[12] array str_split(string str [, int split_length])
Convert 'str' to an array by splitting it into 'split_length'-sized
chunks.
For Unicode, splits will be made at codepoint boundaries ?
Q: Handling of base+combining sequences ?

[13] string strtr(string str, string from, string to)
Translates/replaces characters in 'str'; characters in 'from' are
replaced by 'to'

[14] string quotemeta(string str)
Escapes meta characters ".\+*?[]^$()" with backslash

[15] string nl2br(string str)
Converts newlines(\n\r) to HTML line breaks(<br />)

[16] array pathinfo(string path)
Breaks 'path' into dir, file & extension like [17] & [18] below

[17] string basename(string path [, string suffix])
Returns the filename component of 'path', after removing 'suffix'

[18] string dirname(string path)
Returns the directory name component of 'path'

[19] int strnatcmp(string s1, string s2)
Returns the result of string comparison using "natural" algorithm to
collate numeric portions.
[20] int strnatcasecmp(string s1, string s2)
Case-insensitive version of strnatcmp()[19]
For handling Unicode, adapt existing code to handle codepoints.

[21] int substr_compare(string main_str, string str, int offset
     [, int length [, bool case_sensitivity]])
Compares 'main_str' with 'str' from 'offset' up to 'length'
characters
For handling Unicode, adapt existing code to handle codepoints.

[22] string wordwrap(string str [, int width [, string break [,
boolean cut]]])
Like chunk_split()[11], "wraps" 'str' by inserting 'break' every
'width' characters
Q: Handling of base+combining sequences ?

[23] void str_shuffle(string str)
Shuffles 'str', one random permutation of all possible is created.
For Unicode, handle codepoints & base+combining sequences.

[24] mixed str_word_count(string str, [int format [, string
charlist]])
Count the number of words in 'str'. Words consists of alphabetic
chars, chars in 'charlist', single-quote & hyphen; everything else is
a delimiter. 'format' specifies how output is returned.
Existing algo can be adapted for Unicode. ICU provides a break
iteration API, but current docs donot specify facilities to add
custom
rules - to be investigated further.