Just because a test is good at measuring performance for one metric, doesn’t mean that it’s good for all metrics. The other day I posted about some JavaScript Library Loading Speed Tests that were done by the PBWiki team. I made some conclusions about JavaScript Library Loading speed that, I think, were pretty interesting – however, I mentioned some browser load performance results (at the end of the post) which were especially problematic. This brings up an important point from the performance results:
User-generated performance results are a dual-edged sword.
Assuming that there’s no cheating involved (which is a big assumption) then quietly collecting data from users can provide interesting results. HOWEVER – how that data is analyzed can wildly effect the quality of your results. Analyzed correctly and you can start to get a picture for how JavaScript libraries perform on page load, incorrectly and you might assume that specific browsers are broken, slow, or providing incorrect results.
There’s a ton of examples of misinformation, relating to browsers, within the “Browser Comparison” results. I’m just going to list a bunch of issues – showing how much of a problem user-generated browser performance data can be.
- The results show the numbers for Opera being heavily skewed. At first glance one might assume “oh, that’s because Opera is slower at loading JavaScript files” – however this is not the case at all. Instead, a more-plausible answer is that users were testing this site from a copy of Opera Mobile (which performs poorly, compared to a desktop browser).
- Both Safari 2 and Safari 3 are grouped together, which is highly suspect. By a number of measurements Safari 3 is much faster than Safari 2, so having these two merged doesn’t do any favors.
- Firefox 3 only has two results. A commenter mentioned that this was because they were being grouped into the “Netscape 6” category – which, in and of itself, is a poor place for conglomeration.
- IE 7 is shown as being faster as IE 6. This may be the case, however it’s far more likely that users who are running IE 7 are on newer hardware (think: A new computer with Vista installed), meaning that, on average, IE 7 will run faster than IE 6.
- Firefox, Opera, and Safari for Windows users are, generally, early adopters and technically savvy – meaning that they’re, also, more likely to have high performance hardware (giving them an unnecessary advantage in their results).
- No attempt at platform comparison is given (for example, Safari Window vs. Firefox Window and Safari Mac vs. Firefox Mac). Having the results lump together provides an inaccurate view of actual browser performance.
There’s one message that should be taken away from this particular case: Don’t trust random-user-generated browser performance data. Until you neutralize for outstanding factors like platform, system load, and even hardware it becomes incredibly hard to get meaningful data that is relevant to most users – or even remotely useful to browser vendors.
Axel Hecht (February 7, 2008 at 3:55 am)
Seems like Churchill had more than one good quote on statistics:
“Statistics are like a drunk with a lampost: used more for support than illumination.”
(http://www.quotationspage.com/quote/35671.html)
Francesco (February 7, 2008 at 4:13 am)
Great post, I definitely agree with you.
Mauvis (February 7, 2008 at 5:21 am)
I think the above quote is nonsensical. I believe this is the real quote and author:
“He uses statistics as a drunken man uses lamp-posts—for support rather than illumination.”
Andrew Lang. Quoted in The Harvest of a Quiet Eye, compiled by Alan L. Mackay (1977).
solnic (February 7, 2008 at 5:23 am)
Platform can be significant, for instance, I’ve noticed that Firefox on Windows runs JavaScript tests a lot faster then on Linux.
Frederico Caldeira Knabben (February 7, 2008 at 6:33 am)
John, I agree with many of your thoughts on this, specially the improper grouping of browsers in the results.
Regarding the “Browser Comparison”, I think it is a matter of point of view, and it depends on the intentions of the results analysis.
You have pointed the analysis as a pure “browsers war” situation, pointing that, to have correct results you should run all browsers in the same environment. If you want to check if “my browser is faster than yours”, than that’s the way to go and much probably the PBWiki results are not that accurate.
In the other hand, that comparison results are quite useful if you are looking at the marketplace as a whole, to understand how those browsers behave “in the real world”. I mean, PBWiki is not a browser vendor. They want to know how JS applications behave in the average environment for each specific browser. So, looking at the data, you understand that IE6 is a crappy browser, which runs on crappy platforms, with crappy hardware, while IE7 is much better not only because of the browser, but also because of the better systems real users have around it. Based on that, as a JS developer, I could take some coding related decisions to optimize things for the average crappy IE6, not that IE6 that runs in my powerful dev notebook.
This is not to criticize your (as always) interesting article. You have a foot inside the browsers war world, and your point of view is perfect in that sense. I just wanted to point another way to consider that research, saying that you may “trust random-user-generated browser performance data”. It depends on your research needs.
Fotios (February 7, 2008 at 9:30 am)
Hey John,
I’m not entirely caught up on the whole speed testing thing or exactly how PBWiki conducted theirs, but I had an idea.
What would it take to write an automated test suite and then have average, everyday users visit the page. Their browsers could then submit the speed tests along with browser and some system information.
If that made it to Digg or \., you could probably have tens of thousands of samples from a good coverage of the browser space.
solnic (February 7, 2008 at 9:39 am)
That’s a cool idea Fotios, I’ve recently created something like this: http://blog.solnic.eu/test_runner/index.html for testing jQuery and Prototype performance, the only thing I’m missing is submitting system information, but that’s an easy thing to add. It’d be a good starting point I guess.
Wade Harrell (February 7, 2008 at 1:23 pm)
@Fotios: anonymous data would be so untrustworthy to be useless, registered users would be a little better. Now, if you were to email a couple dozen CTOs inviting their QA teams to participate in registered testing, that could be reliable data. Even if you just got 3 QA teams to spend a few minutes running your test those number would have much more weight than thousands of anonymous users.
Just a thought.
funtomas (February 7, 2008 at 1:29 pm)
I agree with you John, download times in great part depends on CPU speed. See a research paper on parallelism.
Brennan Stehling (February 7, 2008 at 7:26 pm)
It just proves that 67% of statistics are made up.
Fotios (February 8, 2008 at 9:19 am)
@wade: How would it be untrustworthy? It’s not like you would have the users manually enter their “scores”. Yea, you’ll get a few outliers, but that’s what statistical analysis is for. When 95% of the results fall in a good range, chances are that is the actual range. Plus it would be interesting to see how varied the “same” versions of browsers/systems are. Of course there are always going to be uncontrollable factors. I’m sure plugins like Firebug will jack up some render times, but then again, if there was a way to gather that data, it might be interesting
Brennan Stehling (February 8, 2008 at 10:34 am)
@Fotios: But in order for that 95% to be reliable you need a sample pool that is large enough to properly represent the whole. I am not sure the numbers gathered in this survey have been enough yet. And mixing Safari 2 & 3 and Firefox 2 and Netscape 7 seems to weaken the results for me. I think John’s point about IE7 users likely being on more current hardware is an important detail. Beyond that detail, the speed of the internet connection and the speeds of each of the nodes between the user and the web site. And given the throttling that ISPs have been tinkering with lately, the ISP may also be a huge factor. All I can take from this survey is that the smaller my JavaScript library the better and shrinking it with JSMin instead of Packer may be better on performance when the script is interpreted. That is good enough information for me. All in all, it may only save a very small amount of time. More and more often users can stay on an AJAX/Web 2.0 page for quite a while so the time to load a script is really not terribly important. What is becoming increasingly important is runtime performance and memory leaks. Fortunately I can sit back and wait for browser fixes for most memory and performance problems as well as updating the the latest JS library versions which also address these issues. I feel like most of the heavy lifting is done for me.
Wade Harrell (February 8, 2008 at 11:22 am)
@Fotios: the system information is really suspect if it is in any way manually entered, and to get really valuable system information it will need to be (RAM, processor, drive speed, ect.). back in the mid 90’s i did three years of tech support for dial up internet access. every possible factor is important, the more knowledge you can get about the environment you are dealing with the more you will understand the performance numbers you are getting and can make decisions based on those numbers.
open that up to anonymous entry and you are going to have every waring faction (Mac vs PC, anti-IE, pro-FF, pro-Opera, etc.) trying to break the results.
automated only numbers are slightly better i guess, but so many factors are left out that you don’t get a clear picture, just one that you can make high level determinations from (as Brennan pointed out)
Johan Sundström (February 10, 2008 at 6:23 pm)
On the contrary: do trust it (insofar as you would trust it not to be intentionally tampered with to get skew), but know what it is you measure. And most of all: realize that you can’t do browser shootouts if you don’t measure connection bandwidth, latency and end-user hardware in the same test.
Remember it’s the user tangible download time experience you measure and that whichever trends you can map those to User-Agent strings only gives you a graph of the statistical likelihood of a user of that make and version may have to wait x, y or z seconds for a library to load — not if those numbers are then outstanding or abysmal, given the expectations of the network-hardware-software conditions that generated them.