Newsgroups: php.internals
Path: news.php.net
Xref: news.php.net php.internals:61366
Mailing-List: contact internals-help@lists.php.net; run by ezmlm
Received-SPF: pass (pb1.pair.com: domain gmail.com designates 74.125.83.42 as permitted sender)
Message-ID: <500592AE.4060305@gmail.com>
Date: Tue, 17 Jul 2012 18:28:30 +0200
User-Agent: Thunderbird
MIME-Version: 1.0
To: Alex Aulbach <alex.aulbach@gmail.com>
CC: Anthony Ferrara <ircmaxell@gmail.com>, 
 Andrew Faulds <ajfweb@googlemail.com>,
 Nikita Popov <nikita.ppv@gmail.com>, 
 PHP internals <internals@lists.php.net>
References: <CAF+90c9c-LuEzXyWJYfkQ5ycNYYRTqBB+0NhZDEkhGeQQaeEpw@mail.gmail.com> <CAN+gjSe7OR2S0gUON27_thx4dPPvsnj_24a4dxfieL_0mzvFZg@mail.gmail.com> <CAAyV7nHz_0HEV3_E2qJzYMvKUZUzz3vui-2vHZbH=hu3wBLXSQ@mail.gmail.com> <CAKZjf5SUR_Tt4Y4jwK2oYgYusBq4dGvQ75YLOOf8J=kDD=nmDw@mail.gmail.com> <5004775D.601@gmail.com> <CAKZjf5R16k5MJ=EUZggX+E5mUjsV0Bqm-9oaEby-hhbfQsWiig@mail.gmail.com>
In-Reply-To: <CAKZjf5R16k5MJ=EUZggX+E5mUjsV0Bqm-9oaEby-hhbfQsWiig@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [PHP-DEV] Random string generation =?ISO-8859-1?Q?=28=E1_l?=
 =?ISO-8859-1?Q?a_password=5Fmake=5Fsalt=29?=
From: keisial@gmail.com (=?ISO-8859-1?Q?=C1ngel_Gonz=E1lez?=)

On 17/07/12 13:34, Alex Aulbach wrote:
>> That's more or less what I have thought.
>> If it's a string surrounded by square brackets, it's a character class,
>> else
>> treat as a literal list of characters.
>> ] and - can be provided with the old trick of provide "] as first
>> character",
>> "make - the first or last one".
> Right thought. But introducing a new scheme of character-class
> identificators or a new kind of describing character-classes is
> confusing. As PHP developer I think "Oh no, not again new magic
> charsets".
Not really new. Those escapings is how you had to work with them in
character classes of traditional regular expressions.
But I agree it can be confusing. What about a flag parameter, then?

> I suggest again to use PCRE for that. The difference to your proposal
> is not so big. Examples:
>
> "/[[:alnum:]]/" will return "abc...XYZ0123456789". We can do this also
> with "/[a-zA-Z0-9]/". Or "/[a-z0-9]/i". Or "/[[:alpha:][:digit:]]/"
>
> You see: You can do things in much more different ways with PCRE. And
> you continue to use this "standard".
>
> [And PCRE supports UTF8. Currently not important. But who knows?]
>
> And maybe we can think about removing the beginning "/[" and the
> ending "]/", but a "/" at the end should be optionally possible to add
> some regex-parameters (like "/i").
Those could be in the flag. The / are not really needed, they are an
additional
syntax over regex provided by PHP (and the character can be a different
one,
although usually / is picked).


>> Having to detect character limits makes it uglier.
> Exactly. That's why I think we need not so much magic to the second
> parameter. The character-list is just a list of characters. No magic.
> We can extent this with a third parameter to tell the function from
> which charset it is. And maybe a fourth to tell the random-algorithm,
> but I think it's eventually better to have a function for each
> algorithm, because that's the way how random currently works.
>
> If I should write it with php this looks like that:
>
> pseudofunction str_random($len, $characters, $encoding = 'ASCII', $algo)
> {
>     $result = '';
>     $chlen = mb_strlen($characters,$encoding);
>     for ($i = 0; $i < $len; $i++) {
>         $result .= mb_substr($characters, myrandom(0, $chlen, $algo),1);
>     }
>     return $result;
> }
>
> Without testing anything. It's just an idea.
>
> This is a working php-function, but $encoding doesn't work (some
> stupid error?) and not using $algo:
>
> function str_random($len, $characters, $encoding = 'ASCII', $algo = null)
> {
>             $result = '';
>             $chlen = mb_strlen($characters,$encoding);
>             for ($i = 0; $i < $len; $i++) {
>                  $result .= mb_substr($characters, rand(0, $chlen),1);
>             }
>             return $result;
> }
>
>
>> About supporting POSIX classes, that could be cool. But you then need a way
>> to enumerate them. Note that isalpha() will be provided by the C
>> library, so you
>> can't count on having its data. It's possible that PCRE, which we bundle,
>> contains the needed unicode tables.
> It works without thinking as above written in PHP code, but I dunno if
> this could be done in C equally.
The above code doesn't support POSIX character classes, just picking
characters
out of a string (which I agree is simple).


>>> 3. Because generating a string from character-classes is very handy in
>>> general for some other things (many string functions have it), I
>>> suggest that it is not part of random_string(). Make a new function
>>> str_from_character_class(), or if you use pcre like above
>>> pcre_str_from_character_class()?
>> How would you use such function? If you want to make a string out of them,
> Oh, there are many cases to use it.
>
> For example (I renamed the function to "str_charset()", because it is
> just a string of a charset):
>
> // Search spacer strings
> strpbrk ("Hello World", str_charset('/[\s]/'));
So you're expanding all spacing characters, then iterating over them
with strpbrk(),
a preg_match() would have been more efficient.

> // remove invisible chars at begin or end (not very much sense,
> because a regex in this case is maybe faster)
> trim("\rblaa\n", str_charset('/[^[:print:]]/'));
>
> // remove invisible chars: when doing this with very big strings it
> could be much faster than with regex.
> str_replace(str_split(str_charset('/[^[:print:]]/')), "\rblaa\n");
I don't see why expanding to a string, then converting to an array to
finally str_replace
would be faster :S
Also, that str_split() for all non-printable characters (even
considering that you
wouldn't get out of the memory limit with the many unicode chars you
will meet)
will fail with codepoints > 127 (str_split works on bytes)


> There are many other more or less useful things you can do with a
> charset-string. :)
I'm not really convinced it's the right way to do them :)