In a similar vein to approving the use of software, Roman Pronskiy asked for my help putting together an RFC on collecting analytics for PHP.net.
https://wiki.php.net/rfc/phpnet-analytics
Of particular note:
- This is self-hosted, first-party only. No third parties get data, so no third parties can do evil things with it.
- There is no plan to collect any PII.
- The goal is to figure how how to most efficiently spend Foundation money improving php.net, something that is sorely needed.
Ideally we'd have this in place by the 8.4 release or shortly thereafter, though I realize that's a little tight on the timeline.
--
Larry Garfield
larry@garfieldtech.com
In a similar vein to approving the use of software, Roman Pronskiy asked for my help putting together an RFC on collecting analytics for PHP.net.
https://wiki.php.net/rfc/phpnet-analytics
Of particular note:
- This is self-hosted, first-party only. No third parties get data, so no third parties can do evil things with it.
- There is no plan to collect any PII.
- The goal is to figure how how to most efficiently spend Foundation money improving php.net, something that is sorely needed.
Ideally we'd have this in place by the 8.4 release or shortly thereafter, though I realize that's a little tight on the timeline.
Hey Larry,
I have a couple concerns and questions:
Is there a way to track analytics with only transient data? As in, data
actually stored is always already anonymized enough that it would be
unproblematic to share it with everyone?
Or possibly, is there a retention period for the raw data after which
only anonymized data remains?
Do you actually have a plan what to use that data for? The RFC mostly
talks about "high traffic". But does that mean anything? I do look at a
documentation page, because I need to look something specific up (what
was the order of arguments of strpos again?). I may only look shortly at
it. Maybe even often. But it has absolutely zero signal on whether the
documentation page is good enough. In that case I don't look at the
comments either. Comments are something you rarely look at, mostly the
first time you want to even use a function.
Also, I don't buy the argument none of that can be derived from server
logs. Let's see what the RFC names:
Time-on-page
Whether they read the whole page or just a part
Whether they even saw comments
Yes, these need a client side tracker. But I doubt the usefulness of the
signal. You don't know who reads that. Is it someone who is already
familiar with PHP and searches a detail? He'll quickly just find one
part. Is it someone who is new to PHP and tries to understand PHP. He
may well read the whole page. But you don't know that.
Quality of documentation is measured in whether it's possible to grasp
the information easily. Not in how long or how completely a page is
being read.
What percentage of users get to the docs through direct links vs
the home page
That's something you can generally infer from server logs - was the home
page accessed from that IP right before another page was opened? It's
not as accurate, but for a general understanding of orders of magnitude
it's good enough.
If users are hitting a single page per browser window or navigating
through the site, and if the latter, how?
Number of windows needs a client side tracker too. Knowing whether the
cross-referencing links (e.g. "See also") are used is possibly relevant.
And also "what functions are looked up after this function".
How much are users using the search function? Is it finding what
they want, or is it just a crutch?
How much is probably greppable from the server logs as well. Whether
they find what they want - I'm not sure how you'd determine that. I
search something ... and possibly open a page. If that's not what I
wanted, I'll leave the site and e.g. use google. If that's what I
wanted, I'll also stop looking after that page.
Do people use the translations alone, or do they use both the
English site and other languages in tandem?
Does anyone use multiple translations?
That's likely also determinable by server logs.
And yeah, server logs are more locked down. But that's something you can
fix. I hope that the raw analytics data is just as locked down as the
server logs...
I get that "cached by another proxy" is a possible problem, but it's a
strawman I think. You don't need to be able to track all users, but just
many.
Overall I feel like the signal we can get from using a JS tracker
specifically is comparatively low to the point it's not actually worth it.
Bob
What percentage of users get to the docs through direct links vs
the home pageThat's something you can generally infer from server logs - was the home
page accessed from that IP right before another page was opened? It's
not as accurate, but for a general understanding of orders of magnitude
it's good enough.
Even better: If we're talking about internal navigation you can check
the referrer header and know for sure, since the docs don't add
rel=noreferrer on links or anything.
You shouldn't need server logs or client side JS. A lot of this
tracking stuff could be done by just putting down a proxy or shim that
checks request headers. It looks like matomo offers exactly this via
matomo/matomo-php-tracker.
I second bob's general sentiment: There's no need for client side tracking.
What percentage of users get to the docs through direct links vs
the home page
That's something you can generally infer from server logs - was the home
page accessed from that IP right before another page was opened? It's
not as accurate, but for a general understanding of orders of magnitude
it's good enough.Even better: If we're talking about internal navigation you can check
the referrer header and know for sure, since the docs don't add
rel=noreferrer on links or anything.You shouldn't need server logs or client side JS. A lot of this
tracking stuff could be done by just putting down a proxy or shim that
checks request headers. It looks like matomo offers exactly this via
matomo/matomo-php-tracker.I second bob's general sentiment: There's no need for client side tracking.
Further, most (all?) devs I know generally tend to use pi-holes and other tracking blockers. Devs are notoriously hard people to track via client-side analytics. If we went with a client side solution, I would hope that we use a dedicated domain for ingestion so that this tracking can be easily blocked. It will still be blocked, but some people would rather block the entire domain (e.g., go to other mirrors/sites with the documentation) than be tracked.
For the case of whether comments are viewed via server-side, you could always load the comments div async once the scroll position goes past a certain point, and inject them into the dom (see: htmx). This has really crappy usability, but works and might create a faster page load for pages with lots of comments. For people not using javascript, a simple button to reload the page with comments (?comments=1
?) should be enough and provide the desired analytics as well.
— Rob
In a similar vein to approving the use of software, Roman Pronskiy asked for my help putting together an RFC on collecting analytics for PHP.net.
https://wiki.php.net/rfc/phpnet-analytics
Of particular note:
- This is self-hosted, first-party only. No third parties get data, so no third parties can do evil things with it.
- There is no plan to collect any PII.
- The goal is to figure how how to most efficiently spend Foundation money improving php.net, something that is sorely needed.
Ideally we'd have this in place by the 8.4 release or shortly thereafter, though I realize that's a little tight on the timeline.
Hey Larry,
I have a couple concerns and questions:
Is there a way to track analytics with only transient data? As in, data
actually stored is always already anonymized enough that it would be
unproblematic to share it with everyone?
Or possibly, is there a retention period for the raw data after which
only anonymized data remains?
The plan is to configure Matomo to not collect anything non-anonymous to begin with, to the extent possible. We're absolutely not talking about user-stalking like ad companies do, or anything even remotely close to that.
I'm not convinced that publishing raw, even anonymized data, is valuable or responsible. I don't know of any other sites off hand that publish their raw analytics, and I don't know what purpose that would serve other than just a principled "radical transparency" stance, which I generally don't agree with.
However, having an automated aggregate dashboard similar to https://analytics.bookstackapp.com/bookstackapp.com (made by a different tool, but same idea) that we could make public is the goal, but we don't want to do that until it's been running a while and we're sure that nothing personally identifiable could leak through that way.
Do you actually have a plan what to use that data for? The RFC mostly
talks about "high traffic". But does that mean anything? I do look at a
documentation page, because I need to look something specific up (what
was the order of arguments of strpos again?). I may only look shortly at
it. Maybe even often. But it has absolutely zero signal on whether the
documentation page is good enough. In that case I don't look at the
comments either. Comments are something you rarely look at, mostly the
first time you want to even use a function.
Right now, the key problem is that there's a lot of "we don't know what we don't know." We want to improve the site and docs, the Foundation wants to spend money on doing so, but other than "fill in the empty pages" we have no definition of "improve" to work from. The intent is that better data will give us a better sense of what "improve" even means.
It would also be useful for marketing campaigns, even on-site. Eg, if we spend the time to write a "How to convince your boss to use PHP" page... how useful is it? From logs, all we could get is page count. That's it. Or the PHP-new-release landing page that we've put up for the last several releases. Do people actually get value of that? Do they bother to scroll down through each section or do they just look at the first one or two and leave, meaning the time we spent on any other items is wasted? Right now, we have no idea if the time spent on those is even useful.
Another example off the top of my head: Right now, the enum documentation is spread across a dozen sub-pages. I don't know why I did that exactly in the first place rather than one huge page, other than "huge pages bad." But are they bad? Would it be better to combine enums back into fewer pages, or to split the visibility long-page up into smaller ones? I have no idea. We need data to answer that.
It's also absolutely true that analytics are not the end of data collection. User surveys, usability tests, etc. are also highly valuable, and can get you a different kind of data. We should likely do those at some point, but that doesn't make automated analytics not useful.
Another concern with just using raw logs is that it would be more work to setup, and have more moving parts to break. Let's be honest, PHP has an absolutely terrible track record when it comes to keeping our moving parts working, and the Infra Team right now is tiny. The bus factor there is a concern. Using a client-side tracker is the more-supported and fewer-custom-scripts approach, which makes it easier for someone new to pick it up when needed.
Logs also will fold anyone behind a NAT together into a single IP, and thus "user." IP address is in general a pretty poor way of uniquely identifying people with the number of translation layers on the Internet these days.
Overall I feel like the signal we can get from using a JS tracker
specifically is comparatively low to the point it's not actually worth it.
Some more things a client-side tracker could do that logs cannot:
- How many people are accessing the site from a desktop vs mobile?
- What speed connection do people have?
- How many people are using the in-browser Wasm code runner that is currently being worked on? cf: https://github.com/php/web-php/pull/1097
Also, for reference, most language sites do have some kind of analytics, usually Google:
https://www.python.org –Plausible.io, Google analytics
https://go.dev/ — Google Analytics
https://www.rust-lang.org/ –N/A
https://nodejs.org/ – Google Analytics
https://www.typescriptlang.org/ – N/A
https://kotlinlang.org/ – Google Analytics
https://www.swift.org/ – Adobe Analytics
https://www.ruby-lang.org/ – Google Analytics
We'd be the only one with a self-hosted option, making it the most privacy-conscious of the bunch.
As far as blocking the analytics goes, Matomo uses a cookieless approach, so it's rarely blocked (and would not need a GDPR-compliance banner). Even if someone wanted to block it, meh. We'd still be getting enough signal to make informed decisions.
--Larry Garfield
Overall I feel like the signal we can get from using a JS tracker
specifically is comparatively low to the point it's not actually worth it.Some more things a client-side tracker could do that logs cannot:
- How many people are accessing the site from a desktop vs mobile?
- What speed connection do people have?
- How many people are using the in-browser Wasm code runner that is currently being worked on? cf: https://github.com/php/web-php/pull/1097
For the first there's a user agent (Again, matomo-php-tracker) as well
as media queries for transparent tracking with <link /> or CSS
Even if someone wanted to block it, meh. We'd still be getting enough signal to make informed decisions.
Firefox famously keeps killing features thinking no-one uses them
because the people who use them are savvy enough to turn off tracking.
Even if you do use client side tracking that's a good argument to have
server side tracking anyway as a fallback.
Transparency is a big deal. Server side analytics are ok because PHP
devs know what goes into an HTTP request. (And it's fairly limited in
scope by definition) We don't know what goes into a request sent from a
black box blob of minified JS.
If your JS just consisted of if(wasm) fetch()
I would be fine with
that, but it's actually a 66kb minified JS file.
Perhaps you could just start with server side tracking and see how it
goes? I'd be much happier with client side tracking in future if it's
voted on one metric at a time rather than a big opaque file.