Hi Dmitry,
Hi Dmitry,
I know :)
Interned strings in PHP5 were implemented as characters allocated in one
single buffer. Adding new strings into this buffer from different
threads would require synchronization (locks).In PHP7 this implementation was changed. So it's probably must be
possible to use interned strings in ZTS now. If we use separate
HashTables
for interned strings in different threads we may share some common part
of predefined interned strings and have new interned strings in each
thread independently. I'm not sure if it'll work well with opcache,
because it substitutes interned strings handling mechanism to use shared
memory. May be it'll work out of the box. BTW: I'm not interested in
implementing this myself.Also, if we really like ZTS, may be PHP7 is the time to switch to
native TLS and remove all these TSRMLS macros.
Even if it won't allow to run ZTS on some platforms, it won't be the end
of the world, because ZTS is not really widely used now. I won't be
able to work on it actively, but I may provide some help.Thanks. Dmitry.
maybe it'd make sense to do it the other way round. First get rid of
TSRM,
than look what is doable with interned strings? I'd be sure in the game, if
there are enough interested people to actively do that.
yesterday Joe pushed the approach on the TSRMLS_* removal subject
http://git.php.net/?p=php-src.git;a=shortlog;h=refs/heads/native-tls
While trying to port it for Windows, I see some design issues and have an
idea how to solve it. The patch is relying on a few things which Visual
Studio cannot handle. The first one:
TSRM_API extern TSRM_TLS void *tsrm_ls_cache;
but exporting the TSRM cache from a DLL won't work with VS. There it would
look like
__declspec(dllexport) extern __declspec(thread) void *tsrm_ls_cache;
VS linker cannot share variables directly between DLL and EXE.
Furthermore, even bigger issue could be with modules loaded on runtime.
The second issue is that while ini entries are defined in static arrays,
when they link to some SAPI or extension globals, they would have an id
passed. That Id would be declared like
__declspec(dllimport) ts_rsrc_id cli_server_globals_id;
While passed to the ini entry by reference, it's still not a constant
value. So VS refuses to initialize an ini entry with something not from
the constant extent (AFAIK that's ok for C89).
That's for the issues. How to solve them - after some research it looks
like one can share thread data using getter/setter functions (using
DLL_THREAD_ATTACH event in DllMain to initialize current thread data).
That would probably require some more rewrites in the code. The main
question, whether the negative impact because of having to do the extra
function calls would make such porting senseless.
The flow were like
apache (or php) binary starts thread
TSRM layer inits TLS data
php inits globals / depend on TSRM layer
extensions init/read globals / depend on TSRM layer
That's just a basic idea yet.
For the second issue - some similar approach. Instead of using
&some_globals_id, one could replace it with a function pointer so then it
can deliver the right location by thread.
What do you think about my solution ideas? Maybe also there was some other
approaches we didn't take in account yet? Depending on whether we can use
the function calls, maybe it'd even make sense to make something like
libtsrm to incapsulate all the TS features.
Thanks
Anatol
Hi Anatol,
I didn't completely get your ideas, but if tsrm_ls_cache can't be exported
on Windows directly, can we have a copy of tsrm_ls_cache in each DLL/EXE
and initialize it once?
Thanks. Dmitry.
On Sun, Sep 21, 2014 at 9:19 PM, Anatol Belski anatol.php@belski.net
wrote:
Hi Dmitry,
Hi Dmitry,
I know :)
Interned strings in PHP5 were implemented as characters allocated in one
single buffer. Adding new strings into this buffer from different
threads would require synchronization (locks).In PHP7 this implementation was changed. So it's probably must be
possible to use interned strings in ZTS now. If we use separate
HashTables
for interned strings in different threads we may share some common part
of predefined interned strings and have new interned strings in each
thread independently. I'm not sure if it'll work well with opcache,
because it substitutes interned strings handling mechanism to use shared
memory. May be it'll work out of the box. BTW: I'm not interested in
implementing this myself.Also, if we really like ZTS, may be PHP7 is the time to switch to
native TLS and remove all these TSRMLS macros.
Even if it won't allow to run ZTS on some platforms, it won't be the end
of the world, because ZTS is not really widely used now. I won't be
able to work on it actively, but I may provide some help.Thanks. Dmitry.
maybe it'd make sense to do it the other way round. First get rid of
TSRM,
than look what is doable with interned strings? I'd be sure in the game,
if
there are enough interested people to actively do that.yesterday Joe pushed the approach on the TSRMLS_* removal subject
http://git.php.net/?p=php-src.git;a=shortlog;h=refs/heads/native-tls
While trying to port it for Windows, I see some design issues and have an
idea how to solve it. The patch is relying on a few things which Visual
Studio cannot handle. The first one:TSRM_API extern TSRM_TLS void *tsrm_ls_cache;
but exporting the TSRM cache from a DLL won't work with VS. There it would
look like__declspec(dllexport) extern __declspec(thread) void *tsrm_ls_cache;
VS linker cannot share variables directly between DLL and EXE.
Furthermore, even bigger issue could be with modules loaded on runtime.The second issue is that while ini entries are defined in static arrays,
when they link to some SAPI or extension globals, they would have an id
passed. That Id would be declared like__declspec(dllimport) ts_rsrc_id cli_server_globals_id;
While passed to the ini entry by reference, it's still not a constant
value. So VS refuses to initialize an ini entry with something not from
the constant extent (AFAIK that's ok for C89).That's for the issues. How to solve them - after some research it looks
like one can share thread data using getter/setter functions (using
DLL_THREAD_ATTACH event in DllMain to initialize current thread data).
That would probably require some more rewrites in the code. The main
question, whether the negative impact because of having to do the extra
function calls would make such porting senseless.The flow were like
apache (or php) binary starts thread
TSRM layer inits TLS data
php inits globals / depend on TSRM layer
extensions init/read globals / depend on TSRM layerThat's just a basic idea yet.
For the second issue - some similar approach. Instead of using
&some_globals_id, one could replace it with a function pointer so then it
can deliver the right location by thread.What do you think about my solution ideas? Maybe also there was some other
approaches we didn't take in account yet? Depending on whether we can use
the function calls, maybe it'd even make sense to make something like
libtsrm to incapsulate all the TS features.Thanks
Anatol
Hi Dmitry,
Hi Anatol,
I didn't completely get your ideas, but if tsrm_ls_cache can't be
exported on Windows directly, can we have a copy of tsrm_ls_cache in each
DLL/EXE
and initialize it once?Thanks. Dmitry.
Joe and me was working on this and there is a worky version now. Generally
it suffers from some issues already present in master, but in all things
together it's a worky crossplatform approach. Please look up the
native-tls branch.
For the current variant I used the idea from the original RFC, but removed
exporting the TSRM cache through a __thread variable as it's not portable.
I've also removed the offset logic from the RFC patch, as that brought
additional hard to find bugs especially into the current unstable version.
I don't think it's necessary to copy the arbitrary globals structs in
every ext, further more i think it's not easy possible without some big
overhead. However even with the current native-tls branch I'm able to run
wordpress, symfony, ab -c 8 -n 2048 pass also with multiple calls. Still,
some Apache bugs are already reported against master, I also repro some
others, mostly arbitrary shutdown crashes in Apache (so TS version). So as
they're in master, they're for sure in native-tls.
PHP happens to always have used TLS, however the pointer was passed
directly to the functions. In TSRM.c, that's tsrm_tls_get/tsrm_tls_set.
Now, a function wrapper is used to fetch the TLS cache directly in the
TSRMG macro. This makes the whole slowlier, but allows to get rid of the
TSRMLS_* macros. The big question is to optimize the function call to
speedup the whole. Maybe one can speedup it saving a tsrm ls cache pointer
locally per extension or code area. ATM we're checking the functional
part, then one can proceed further with removing the TSRMLS_* macros. Any
speedup or improvement thoughts are welcome.
Possible directions of the further work after known bugs are fixed (in
master or in native-tls), some are mutually exclusive
- reimplement the offset logic instead of arrays for the globals structs
- share the tsrm cache pointer globally to some scope, like extension or sapi
- remove the linked lists logic and use TLS explicitly
- improve locking
Thanks
Anatol
Hi Anatol.
I'll take a look on Tuesday or Wednesday.
Thanks. Dmitry.
On Sat, Sep 27, 2014 at 12:59 AM, Anatol Belski anatol.php@belski.net
wrote:
Hi Dmitry,
Hi Anatol,
I didn't completely get your ideas, but if tsrm_ls_cache can't be
exported on Windows directly, can we have a copy of tsrm_ls_cache in each
DLL/EXE
and initialize it once?Thanks. Dmitry.
Joe and me was working on this and there is a worky version now. Generally
it suffers from some issues already present in master, but in all things
together it's a worky crossplatform approach. Please look up the
native-tls branch.For the current variant I used the idea from the original RFC, but removed
exporting the TSRM cache through a __thread variable as it's not portable.
I've also removed the offset logic from the RFC patch, as that brought
additional hard to find bugs especially into the current unstable version.
I don't think it's necessary to copy the arbitrary globals structs in
every ext, further more i think it's not easy possible without some big
overhead. However even with the current native-tls branch I'm able to run
wordpress, symfony, ab -c 8 -n 2048 pass also with multiple calls. Still,
some Apache bugs are already reported against master, I also repro some
others, mostly arbitrary shutdown crashes in Apache (so TS version). So as
they're in master, they're for sure in native-tls.PHP happens to always have used TLS, however the pointer was passed
directly to the functions. In TSRM.c, that's tsrm_tls_get/tsrm_tls_set.
Now, a function wrapper is used to fetch the TLS cache directly in the
TSRMG macro. This makes the whole slowlier, but allows to get rid of the
TSRMLS_* macros. The big question is to optimize the function call to
speedup the whole. Maybe one can speedup it saving a tsrm ls cache pointer
locally per extension or code area. ATM we're checking the functional
part, then one can proceed further with removing the TSRMLS_* macros. Any
speedup or improvement thoughts are welcome.Possible directions of the further work after known bugs are fixed (in
master or in native-tls), some are mutually exclusive
- reimplement the offset logic instead of arrays for the globals structs
- share the tsrm cache pointer globally to some scope, like extension or
sapi- remove the linked lists logic and use TLS explicitly
- improve locking
Thanks
Anatol
Hi,
I took a quick look over the patch.
I didn't get why it's named "native_tls" now, because it doesn't use
"_thread" variables anymore.
Actually, now the patch get rid of additional TSRMLS arguments, but
performs near the same thing as TSRMLS_FETCH() on each module global
access. It leads to huge slowdown.
bench.php.
non-zts: 1.222 sec
zts: 1.362 sec
native_tls: 1.785 sec
I think, the patch makes no sense in this state.
It looks like on Windows we can't use __declspec(thread) in DLLs loaded
using LoadLibray() at all, so we won't ale to build mod_php for Apache or
use __declspec(thread) for module globals of extensions build as DLL
On Linux it must be possible, but it depends on TLS model (gcc
-ftls-model=...). "global-dynamic" model must work in all cases, but I'm
not sure about performance, because it'll lead to additional function call
for each "__thread" variable access (may be I'm wrong). Better models (like
"initial-exec") have some limitations. I don't have enough experience to
say, if they could work for us.
Thanks. Dmitry.
Hi Anatol.
I'll take a look on Tuesday or Wednesday.
Thanks. Dmitry.
On Sat, Sep 27, 2014 at 12:59 AM, Anatol Belski anatol.php@belski.net
wrote:Hi Dmitry,
Hi Anatol,
I didn't completely get your ideas, but if tsrm_ls_cache can't be
exported on Windows directly, can we have a copy of tsrm_ls_cache in
each
DLL/EXE
and initialize it once?Thanks. Dmitry.
Joe and me was working on this and there is a worky version now. Generally
it suffers from some issues already present in master, but in all things
together it's a worky crossplatform approach. Please look up the
native-tls branch.For the current variant I used the idea from the original RFC, but removed
exporting the TSRM cache through a __thread variable as it's not portable.
I've also removed the offset logic from the RFC patch, as that brought
additional hard to find bugs especially into the current unstable version.
I don't think it's necessary to copy the arbitrary globals structs in
every ext, further more i think it's not easy possible without some big
overhead. However even with the current native-tls branch I'm able to run
wordpress, symfony, ab -c 8 -n 2048 pass also with multiple calls. Still,
some Apache bugs are already reported against master, I also repro some
others, mostly arbitrary shutdown crashes in Apache (so TS version). So as
they're in master, they're for sure in native-tls.PHP happens to always have used TLS, however the pointer was passed
directly to the functions. In TSRM.c, that's tsrm_tls_get/tsrm_tls_set.
Now, a function wrapper is used to fetch the TLS cache directly in the
TSRMG macro. This makes the whole slowlier, but allows to get rid of the
TSRMLS_* macros. The big question is to optimize the function call to
speedup the whole. Maybe one can speedup it saving a tsrm ls cache pointer
locally per extension or code area. ATM we're checking the functional
part, then one can proceed further with removing the TSRMLS_* macros. Any
speedup or improvement thoughts are welcome.Possible directions of the further work after known bugs are fixed (in
master or in native-tls), some are mutually exclusive
- reimplement the offset logic instead of arrays for the globals structs
- share the tsrm cache pointer globally to some scope, like extension or
sapi- remove the linked lists logic and use TLS explicitly
- improve locking
Thanks
Anatol
Hi Dmtry,
thanks for taking a look at this.
Hi,
I took a quick look over the patch.
I didn't get why it's named "native_tls" now, because it doesn't use
"__thread" variables anymore.
I was wondering myself but now I see (intentionally taking the 5.2 source)
http://lxr.php.net/xref/PHP_5_2/TSRM/TSRM.c#282
http://lxr.php.net/xref/PHP_5_2/TSRM/TSRM.c#329
We already use TLS :) It took quite some time to understand this.
Actually, now the patch get rid of additional TSRMLS_ arguments, but
performs near the same thing as TSRMLS_FETCH() on each module global
access. It leads to huge slowdown.bench.php.
non-zts: 1.222 sec
zts: 1.362 sec
native_tls: 1.785 secI think, the patch makes no sense in this state.
Absolutely, this state is just to show we can drop the TSRMLS_* things
without hurting the functional part. At least I'm glad you have noticed no
regression on the functionality, but just the slowdown.
It looks like on Windows we can't use __declspec(thread) in DLLs loaded
using LoadLibray() at all, so we won't ale to build mod_php for Apache or
use __declspec(thread) for module globals of extensions build as DLLOn Linux it must be possible, but it depends on TLS model (gcc
-ftls-model=...). "global-dynamic" model must work in all cases, but I'm
not sure about performance, because it'll lead to additional function call
for each "__thread" variable access (may be I'm wrong). Better models
(like
"initial-exec") have some limitations. I don't have enough experience to
say, if they could work for us.
With the linux part - yeah, the gcc linker does the great magic which
makes __thread variables work between shared objects. The function call is
needed specifically because on Windows it is not possible to do
__declspec(dllexport) __declspec(thread). With __dllspec(thread) unusable
when explicitly loaded - not true since Vista anymore. Please read here
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684175%28v=vs.85%29.aspx
"Windows Server 2003 and Windows XP: The Visual C++ compiler supports a
syntax that enables you to declare thread-local variables:
_declspec(thread). If you use this syntax in a DLL, you will not be able
to load the DLL explicitly using LoadLibrary on versions of Windows prior
to Windows Vista. If your DLL will be loaded explicitly, you must use the
thread local storage functions instead of _declspec(thread). For an
example, see Using Thread Local Storage in a Dynamic Link Library."
So this is not an issue as we won't support XP in PHP7 anyway. But the
issue is that it cannot export a thread specific variable, but it can
perfectly gain the access to it through an explicit tls storage query.
MSDN provides also a snippet on how it works
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686997%28v=vs.85%29.aspx
While investigating on it, i've added one more DLL which loads through
LoadLibrary and it worked the same way as the DLL linked implicitly using
a .lib file. Let me know if you'd like to look at my investigation code on
this. Just to sumarize:
- TLS is used all the way since at least 5.2
- the portable way to share the tls storage is by accessing it through
tsrm_tls_get/tsrm_tls_set macros - function calls slow down everything
That is why I mean - the patch is functionally doing the same what the
mainstream does, but allows to remove the TSRMLS_* macros.
Btw. the thread keys get allocated by Apache already. Maybe we have to
care about that in some other SAPI but I'm not sure there is another one
except mpm_worker/mpm_winnt which can exhaust all the TS potential. In
Apache it's spread over several sources, however here is the essential
part
http://svn.apache.org/viewvc/apr/apr/trunk/threadproc/win32/threadpriv.c?view=markup
My current idea on how to speed up it - the __thread or __declspec(thread)
variables can be used and are both portable within the same binary unit
(say .so, .dll, .exe, etc.). Once we have a resource pointer, it can be
cached in a local variable. In some header, it would be declared like
TSRM_TLS extern void *tsrm_cache;
And in one .c file it would be properly defined. The header would make it
accessible from all the .c files in the same binary unit (all object files
linked together). Of course, this variable has to be updated once per
thread before any globals could be read. What is needed is to lookup the
correct places to update that variable. Such a variable will have
unfortunately to be defined in every so/dll/exe and then updated
accordingly. But getting the tsrm cache per function call will still work.
Maybe something comes in your mind where such correct places should be?
zend_startup() and zend_activate()? When it's worky, it might even solve
some perf issue also with Linux, if I understood it correctly.
Regards
Anatol
Hi Anatol,
I know, TSRM uses TLS APIs internally.
In my opinion, the simplest (and probably efficient) way to get rid of
TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a global
thread specific variable.
__thread void ***tsrm_ls;
As I understood it won't work on Windows anyway, because windows linker is
not smart enough to use TLS variables across different DLLs.
May be it's possible to have a local thread specific copy of tsrm_ls for
each DLL, but then we should make them to be consistent...
Sorry, I can't give you any advice, and can't spend a lot of time on this
topic.
May be description of TLS internals on ELF systems would give you some
ideas.
http://www.akkadia.org/drepper/tls.pdf
Thanks. Dmitry.
Hi Dmtry,
thanks for taking a look at this.
Hi,
I took a quick look over the patch.
I didn't get why it's named "native_tls" now, because it doesn't use
"__thread" variables anymore.
I was wondering myself but now I see (intentionally taking the 5.2 source)http://lxr.php.net/xref/PHP_5_2/TSRM/TSRM.c#282
http://lxr.php.net/xref/PHP_5_2/TSRM/TSRM.c#329We already use TLS :) It took quite some time to understand this.
Actually, now the patch get rid of additional TSRMLS_ arguments, but
performs near the same thing as TSRMLS_FETCH() on each module global
access. It leads to huge slowdown.bench.php.
non-zts: 1.222 sec
zts: 1.362 sec
native_tls: 1.785 secI think, the patch makes no sense in this state.
Absolutely, this state is just to show we can drop the TSRMLS_* things
without hurting the functional part. At least I'm glad you have noticed no
regression on the functionality, but just the slowdown.It looks like on Windows we can't use __declspec(thread) in DLLs loaded
using LoadLibray() at all, so we won't ale to build mod_php for Apache or
use __declspec(thread) for module globals of extensions build as DLLOn Linux it must be possible, but it depends on TLS model (gcc
-ftls-model=...). "global-dynamic" model must work in all cases, but I'm
not sure about performance, because it'll lead to additional function
call
for each "__thread" variable access (may be I'm wrong). Better models
(like
"initial-exec") have some limitations. I don't have enough experience to
say, if they could work for us.With the linux part - yeah, the gcc linker does the great magic which
makes __thread variables work between shared objects. The function call is
needed specifically because on Windows it is not possible to do
__declspec(dllexport) __declspec(thread). With __dllspec(thread) unusable
when explicitly loaded - not true since Vista anymore. Please read herehttp://msdn.microsoft.com/en-us/library/windows/desktop/ms684175%28v=vs.85%29.aspx
"Windows Server 2003 and Windows XP: The Visual C++ compiler supports a
syntax that enables you to declare thread-local variables:
_declspec(thread). If you use this syntax in a DLL, you will not be able
to load the DLL explicitly using LoadLibrary on versions of Windows prior
to Windows Vista. If your DLL will be loaded explicitly, you must use the
thread local storage functions instead of _declspec(thread). For an
example, see Using Thread Local Storage in a Dynamic Link Library."So this is not an issue as we won't support XP in PHP7 anyway. But the
issue is that it cannot export a thread specific variable, but it can
perfectly gain the access to it through an explicit tls storage query.
MSDN provides also a snippet on how it workshttp://msdn.microsoft.com/en-us/library/windows/desktop/ms686997%28v=vs.85%29.aspx
While investigating on it, i've added one more DLL which loads through
LoadLibrary and it worked the same way as the DLL linked implicitly using
a .lib file. Let me know if you'd like to look at my investigation code on
this. Just to sumarize:
- TLS is used all the way since at least 5.2
- the portable way to share the tls storage is by accessing it through
tsrm_tls_get/tsrm_tls_set macros- function calls slow down everything
That is why I mean - the patch is functionally doing the same what the
mainstream does, but allows to remove the TSRMLS_* macros.Btw. the thread keys get allocated by Apache already. Maybe we have to
care about that in some other SAPI but I'm not sure there is another one
except mpm_worker/mpm_winnt which can exhaust all the TS potential. In
Apache it's spread over several sources, however here is the essential
parthttp://svn.apache.org/viewvc/apr/apr/trunk/threadproc/win32/threadpriv.c?view=markup
My current idea on how to speed up it - the __thread or __declspec(thread)
variables can be used and are both portable within the same binary unit
(say .so, .dll, .exe, etc.). Once we have a resource pointer, it can be
cached in a local variable. In some header, it would be declared likeTSRM_TLS extern void *tsrm_cache;
And in one .c file it would be properly defined. The header would make it
accessible from all the .c files in the same binary unit (all object files
linked together). Of course, this variable has to be updated once per
thread before any globals could be read. What is needed is to lookup the
correct places to update that variable. Such a variable will have
unfortunately to be defined in every so/dll/exe and then updated
accordingly. But getting the tsrm cache per function call will still work.
Maybe something comes in your mind where such correct places should be?
zend_startup() and zend_activate()? When it's worky, it might even solve
some perf issue also with Linux, if I understood it correctly.Regards
Anatol
Hi Dmitry,
Hi Anatol,
I know, TSRM uses TLS APIs internally.
In my opinion, the simplest (and probably efficient) way to get rid of
TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a global
thread specific variable.__thread void ***tsrm_ls;
As I understood it won't work on Windows anyway, because windows linker
is not smart enough to use TLS variables across different DLLs. May be it's
possible to have a local thread specific copy of tsrm_ls for each DLL, but
then we should make them to be consistent...Sorry, I can't give you any advice, and can't spend a lot of time on this
topic.May be description of TLS internals on ELF systems would give you some
ideas.http://www.akkadia.org/drepper/tls.pdf
Thanks. Dmitry.
I've reworked this patch to take a pointer per one shared unit. Please see
here
http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409fac47c217d4927ec6f6
(though this was just the first in the series). Afterwards I've adapted
ext/standard and also converted ext/sockets as an exemplary item because
it's usually compiled shared.
With this change I experience much better performance - a diff is in
100-50ms range compared to the master TS build. Particular positions in
bench.php show even some better result.
However this is not a global __thread variable, but a local one to every
shared unit. Say tsrm_ls will have to be declared in every so, dll or exe
and updated on request. For now I've put the update code in MINIT and into
the first ctor (zmm is the one in the php7ts.dll) called. The ctor seems
to be the only reliable place (but maybe I'm wrong), despite it'll be
called for every request instead of per thread, that won't be very bad.
I'd suggest to go this way so we have the same flow everywhere.
Regards
Anatol
Hi Dmitry,
Hi Anatol,
I know, TSRM uses TLS APIs internally.
In my opinion, the simplest (and probably efficient) way to get rid of
TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
global thread specific variable.__thread void ***tsrm_ls;
As I understood it won't work on Windows anyway, because windows linker
is not smart enough to use TLS variables across different DLLs. May be
it's possible to have a local thread specific copy of tsrm_ls for each
DLL, but
then we should make them to be consistent...Sorry, I can't give you any advice, and can't spend a lot of time on
this topic.May be description of TLS internals on ELF systems would give you some
ideas.http://www.akkadia.org/drepper/tls.pdf
Thanks. Dmitry.
I've reworked this patch to take a pointer per one shared unit. Please
see here
http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409fac
47c217d4927ec6f6
(though this was just the first in the series). Afterwards I've adapted
ext/standard and also converted ext/sockets as an exemplary item because
it's usually compiled shared.With this change I experience much better performance - a diff is in
100-50ms range compared to the master TS build. Particular positions in
bench.php show even some better result.However this is not a global __thread variable, but a local one to every
shared unit. Say tsrm_ls will have to be declared in every so, dll or exe
and updated on request. For now I've put the update code in MINIT and
into the first ctor (zmm is the one in the php7ts.dll) called. The ctor
seems to be the only reliable place (but maybe I'm wrong), despite it'll
be called for every request instead of per thread, that won't be very bad.I'd suggest to go this way so we have the same flow everywhere.
Here are just the results from Zend/bench.php done on 64 bit Linux and
Windows
master ts linux
simple 0.112
simplecall 0.036
simpleucall 0.129
simpleudcall 0.135
mandel 0.317
mandel2 0.340
ackermann(7) 0.086
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.173
fibo(30) 0.291
hash1(50000) 0.027
hash2(500) 0.023
heapsort(20000) 0.070
matrix(20) 0.075
nestedloop(12) 0.188
sieve(30) 0.062
strcat(200000) 0.013
Total 2.095
native-tls linux
simple 0.072
simplecall 0.048
simpleucall 0.180
simpleudcall 0.161
mandel 0.311
mandel2 0.322
ackermann(7) 0.128
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.159
fibo(30) 0.394
hash1(50000) 0.029
hash2(500) 0.024
heapsort(20000) 0.067
matrix(20) 0.070
nestedloop(12) 0.129
sieve(30) 0.063
strcat(200000) 0.011
Total 2.186
master ts windows
simple 0.096
simplecall 0.046
simpleucall 0.137
simpleudcall 0.124
mandel 0.283
mandel2 0.346
ackermann(7) 0.089
ary(50000) 0.009
ary2(50000) 0.007
ary3(2000) 0.130
fibo(30) 0.231
hash1(50000) 0.023
hash2(500) 0.020
heapsort(20000) 0.078
matrix(20) 0.065
nestedloop(12) 0.162
sieve(30) 0.045
strcat(200000) 0.012
Total 1.903
native-tls windows
simple 0.098
simplecall 0.048
simpleucall 0.107
simpleudcall 0.109
mandel 0.285
mandel2 0.338
ackermann(7) 0.093
ary(50000) 0.009
ary2(50000) 0.007
ary3(2000) 0.140
fibo(30) 0.250
hash1(50000) 0.025
hash2(500) 0.020
heapsort(20000) 0.080
matrix(20) 0.070
nestedloop(12) 0.189
sieve(30) 0.047
strcat(200000) 0.010
Total 1.925
Made on real hardware, no VMs.
Regards
Anatol
Moin Dmitry,
Hi Dmitry,
Hi Anatol,
I know, TSRM uses TLS APIs internally.
In my opinion, the simplest (and probably efficient) way to get rid
of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
global thread specific variable.__thread void ***tsrm_ls;
As I understood it won't work on Windows anyway, because windows
linker is not smart enough to use TLS variables across different DLLs.
May be
it's possible to have a local thread specific copy of tsrm_ls for each
DLL, but
then we should make them to be consistent...Sorry, I can't give you any advice, and can't spend a lot of time on
this topic.May be description of TLS internals on ELF systems would give you
some ideas.http://www.akkadia.org/drepper/tls.pdf
Thanks. Dmitry.
I've reworked this patch to take a pointer per one shared unit. Please
see here
http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f
ac 47c217d4927ec6f6
(though this was just the first in the series). Afterwards I've adapted
ext/standard and also converted ext/sockets as an exemplary item because
it's usually compiled shared.With this change I experience much better performance - a diff is in
100-50ms range compared to the master TS build. Particular positions in
bench.php show even some better result.However this is not a global __thread variable, but a local one to
every shared unit. Say tsrm_ls will have to be declared in every so, dll
or exe and updated on request. For now I've put the update code in MINIT
and into the first ctor (zmm is the one in the php7ts.dll) called. The
ctor seems to be the only reliable place (but maybe I'm wrong), despite
it'll be called for every request instead of per thread, that won't be
very bad.I'd suggest to go this way so we have the same flow everywhere.
the perf issue is fixed now, still yet core only converted, but here are
Zend/bench.php results on 64 bit
master ts linux
simple 0.158
simplecall 0.050
simpleucall 0.148
simpleudcall 0.151
mandel 0.310
mandel2 0.337
ackermann(7) 0.088
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.154
fibo(30) 0.285
hash1(50000) 0.029
hash2(500) 0.023
heapsort(20000) 0.072
matrix(20) 0.082
nestedloop(12) 0.204
sieve(30) 0.062
strcat(200000) 0.014
Total 2.185
native-tls linux
simple 0.072
simplecall 0.036
simpleucall 0.163
simpleudcall 0.169
mandel 0.297
mandel2 0.354
ackermann(7) 0.123
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.158
fibo(30) 0.396
hash1(50000) 0.030
hash2(500) 0.024
heapsort(20000) 0.072
matrix(20) 0.069
nestedloop(12) 0.130
sieve(30) 0.054
strcat(200000) 0.011
Total 2.178
master ts windows
simple 0.100
simplecall 0.048
simpleucall 0.146
simpleudcall 0.120
mandel 0.292
mandel2 0.364
ackermann(7) 0.091
ary(50000) 0.009
ary2(50000) 0.008
ary3(2000) 0.133
fibo(30) 0.238
hash1(50000) 0.025
hash2(500) 0.020
heapsort(20000) 0.076
matrix(20) 0.069
nestedloop(12) 0.168
sieve(30) 0.048
strcat(200000) 0.011
Total 1.965
native-tls windows
simple 0.100
simplecall 0.050
simpleucall 0.108
simpleudcall 0.110
mandel 0.292
mandel2 0.347
ackermann(7) 0.097
ary(50000) 0.009
ary2(50000) 0.008
ary3(2000) 0.140
fibo(30) 0.280
hash1(50000) 0.025
hash2(500) 0.021
heapsort(20000) 0.075
matrix(20) 0.072
nestedloop(12) 0.176
sieve(30) 0.048
strcat(200000) 0.010
Total 1.969
Still there is some room for improvement (for instance the fibo results),
but the overall result shows at least same perf now. What do you think
guys?
Regards
Anatol
Hi Anatol,
Thanks for update, I'll take a look a bit later, but the performance
difference looks quite good now.
Dmitry.
On Wed, Oct 8, 2014 at 11:26 AM, Anatol Belski anatol.php@belski.net
wrote:
Moin Dmitry,
Hi Dmitry,
Hi Anatol,
I know, TSRM uses TLS APIs internally.
In my opinion, the simplest (and probably efficient) way to get rid
of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
global thread specific variable.__thread void ***tsrm_ls;
As I understood it won't work on Windows anyway, because windows
linker is not smart enough to use TLS variables across different DLLs.
May be
it's possible to have a local thread specific copy of tsrm_ls for each
DLL, but
then we should make them to be consistent...Sorry, I can't give you any advice, and can't spend a lot of time on
this topic.May be description of TLS internals on ELF systems would give you
some ideas.http://www.akkadia.org/drepper/tls.pdf
Thanks. Dmitry.
I've reworked this patch to take a pointer per one shared unit. Please
see here
http://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f
ac 47c217d4927ec6f6
(though this was just the first in the series). Afterwards I've adapted
ext/standard and also converted ext/sockets as an exemplary item because
it's usually compiled shared.With this change I experience much better performance - a diff is in
100-50ms range compared to the master TS build. Particular positions in
bench.php show even some better result.However this is not a global __thread variable, but a local one to
every shared unit. Say tsrm_ls will have to be declared in every so, dll
or exe and updated on request. For now I've put the update code in MINIT
and into the first ctor (zmm is the one in the php7ts.dll) called. The
ctor seems to be the only reliable place (but maybe I'm wrong), despite
it'll be called for every request instead of per thread, that won't be
very bad.I'd suggest to go this way so we have the same flow everywhere.
the perf issue is fixed now, still yet core only converted, but here are
Zend/bench.php results on 64 bitmaster ts linux
simple 0.158
simplecall 0.050
simpleucall 0.148
simpleudcall 0.151
mandel 0.310
mandel2 0.337
ackermann(7) 0.088
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.154
fibo(30) 0.285
hash1(50000) 0.029
hash2(500) 0.023
heapsort(20000) 0.072
matrix(20) 0.082
nestedloop(12) 0.204
sieve(30) 0.062
strcat(200000) 0.014Total 2.185
native-tls linux
simple 0.072
simplecall 0.036
simpleucall 0.163
simpleudcall 0.169
mandel 0.297
mandel2 0.354
ackermann(7) 0.123
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.158
fibo(30) 0.396
hash1(50000) 0.030
hash2(500) 0.024
heapsort(20000) 0.072
matrix(20) 0.069
nestedloop(12) 0.130
sieve(30) 0.054
strcat(200000) 0.011Total 2.178
master ts windows
simple 0.100
simplecall 0.048
simpleucall 0.146
simpleudcall 0.120
mandel 0.292
mandel2 0.364
ackermann(7) 0.091
ary(50000) 0.009
ary2(50000) 0.008
ary3(2000) 0.133
fibo(30) 0.238
hash1(50000) 0.025
hash2(500) 0.020
heapsort(20000) 0.076
matrix(20) 0.069
nestedloop(12) 0.168
sieve(30) 0.048
strcat(200000) 0.011Total 1.965
native-tls windows
simple 0.100
simplecall 0.050
simpleucall 0.108
simpleudcall 0.110
mandel 0.292
mandel2 0.347
ackermann(7) 0.097
ary(50000) 0.009
ary2(50000) 0.008
ary3(2000) 0.140
fibo(30) 0.280
hash1(50000) 0.025
hash2(500) 0.021
heapsort(20000) 0.075
matrix(20) 0.072
nestedloop(12) 0.176
sieve(30) 0.048
strcat(200000) 0.010Total 1.969
Still there is some room for improvement (for instance the fibo results),
but the overall result shows at least same perf now. What do you think
guys?Regards
Anatol
Hi Anatol,
At first, I still saw the same big difference on Linux.
bench.php ZTS - 1.340 sec, native TLS - 1.785 sec.
As I understood, it must be related to incomplete changes in build scripts,
related to ZEND_ENABLE_STATIC_TSRMLS_CACHE. Right?
If I get it properly, main PHP binary should be compiled with
-DZEND_ENABLE_STATIC_TSRMLS_CACHE=1 and shared extensions without it. It
should lead to quite fast code in main PHP binary and statically linked
extensions, but to slow code in shared extensions. Right?
I built PHP in this way with all extensions linked statically. Now, I see
small slowdown on bench.php (however according to callgrind it executes
less instructions and should be faster). Wordpress became 2% faster.
So the patch becomes interesting. :)
However, many distributions prefer shard extensions, and it would be great
to invent some trick to make them fast too.
I would also prefer to keep the semantic patch small and don't delete all
FETCH_TSRM() in thousand places (at this point).
Replacing macro in one place must be easier.
It's not a problem to remove them on second step if the PoC would really
work.
Thanks. Dmitry.
Hi Anatol,
Thanks for update, I'll take a look a bit later, but the performance
difference looks quite good now.Dmitry.
On Wed, Oct 8, 2014 at 11:26 AM, Anatol Belski anatol.php@belski.net
wrote:Moin Dmitry,
Hi Dmitry,
Hi Anatol,
I know, TSRM uses TLS APIs internally.
In my opinion, the simplest (and probably efficient) way to get rid
of TSRMLS_DC arguments and TSRMLS_FETCH calls, would be introducing a
global thread specific variable.__thread void ***tsrm_ls;
As I understood it won't work on Windows anyway, because windows
linker is not smart enough to use TLS variables across different DLLs.
May be
it's possible to have a local thread specific copy of tsrm_ls for each
DLL, but
then we should make them to be consistent...Sorry, I can't give you any advice, and can't spend a lot of time on
this topic.May be description of TLS internals on ELF systems would give you
some ideas.http://www.akkadia.org/drepper/tls.pdf
Thanks. Dmitry.
I've reworked this patch to take a pointer per one shared unit. Please
see herehttp://git.php.net/?p=php-src.git;a=commitdiff;h=76081df168829a5cc0409f
ac 47c217d4927ec6f6
(though this was just the first in the series). Afterwards I've adapted
ext/standard and also converted ext/sockets as an exemplary item
because
it's usually compiled shared.With this change I experience much better performance - a diff is in
100-50ms range compared to the master TS build. Particular positions in
bench.php show even some better result.However this is not a global __thread variable, but a local one to
every shared unit. Say tsrm_ls will have to be declared in every so,
dll
or exe and updated on request. For now I've put the update code in
MINIT
and into the first ctor (zmm is the one in the php7ts.dll) called. The
ctor seems to be the only reliable place (but maybe I'm wrong), despite
it'll be called for every request instead of per thread, that won't be
very bad.I'd suggest to go this way so we have the same flow everywhere.
the perf issue is fixed now, still yet core only converted, but here are
Zend/bench.php results on 64 bitmaster ts linux
simple 0.158
simplecall 0.050
simpleucall 0.148
simpleudcall 0.151
mandel 0.310
mandel2 0.337
ackermann(7) 0.088
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.154
fibo(30) 0.285
hash1(50000) 0.029
hash2(500) 0.023
heapsort(20000) 0.072
matrix(20) 0.082
nestedloop(12) 0.204
sieve(30) 0.062
strcat(200000) 0.014Total 2.185
native-tls linux
simple 0.072
simplecall 0.036
simpleucall 0.163
simpleudcall 0.169
mandel 0.297
mandel2 0.354
ackermann(7) 0.123
ary(50000) 0.010
ary2(50000) 0.009
ary3(2000) 0.158
fibo(30) 0.396
hash1(50000) 0.030
hash2(500) 0.024
heapsort(20000) 0.072
matrix(20) 0.069
nestedloop(12) 0.130
sieve(30) 0.054
strcat(200000) 0.011Total 2.178
master ts windows
simple 0.100
simplecall 0.048
simpleucall 0.146
simpleudcall 0.120
mandel 0.292
mandel2 0.364
ackermann(7) 0.091
ary(50000) 0.009
ary2(50000) 0.008
ary3(2000) 0.133
fibo(30) 0.238
hash1(50000) 0.025
hash2(500) 0.020
heapsort(20000) 0.076
matrix(20) 0.069
nestedloop(12) 0.168
sieve(30) 0.048
strcat(200000) 0.011Total 1.965
native-tls windows
simple 0.100
simplecall 0.050
simpleucall 0.108
simpleudcall 0.110
mandel 0.292
mandel2 0.347
ackermann(7) 0.097
ary(50000) 0.009
ary2(50000) 0.008
ary3(2000) 0.140
fibo(30) 0.280
hash1(50000) 0.025
hash2(500) 0.021
heapsort(20000) 0.075
matrix(20) 0.072
nestedloop(12) 0.176
sieve(30) 0.048
strcat(200000) 0.010Total 1.969
Still there is some room for improvement (for instance the fibo results),
but the overall result shows at least same perf now. What do you think
guys?Regards
Anatol