Lies, Damned Lies and Performance Tests
Mark Twain attributed the phrase "Lies, Damned Lies and Statistics" to British Prime Minister Benjamin Disraeli, which suits the Prime Minister's known wit, although its provenance has been questioned. If Twain or Disraeli had lived in the days of computers and software, he probably would have coined the phrase as "Lies, Damned Lies and Performance Tests." Perhaps Twain's great novel of Americans touring the desolate Holy Land of the late 19th century might have been called, "The Innocents Online."
As of last week, I have a new favourite example of abused performance tests. A video from Surge 2013's Lightning Talks, kindly shared with me by the head of a New York based financial IT firm, is amazing. I encourage anyone who has ever performed any kind of numerical calculation, ever, to watch the clip. The link points directly to the 5-minute presentation beginning at 17:19.
- Question: What happens when a platform which absolutely could not perform worse than another did?
- Answer: If your results smell bad, if they absolutely stink, something must be wrong.
What went wrong here? Good input to a good analysis engine was passed through an awful pre-processor to "clean the data up."
The engineer writing the tests realized, correctly, that the engine could only handle numbers. Obviously, "123,000" written is "123K" would not be acceptable, so the pre-processor chopped off the "K". Now, if I am comparing 123K and 100K, it is fair to eliminate the troublesome "K", since everything is on the same scale. 123000:100000, or (1.23) is the same as 123:100 (also 1.23). But if not every number is of the same scale, you might start with 123K:1000 (123), chop off the "K", and end up with 123:1000 which is not the same thing! (0.123)
The engineer also realized that the analysis engine could handle only natural numbers, i.e. no fractions or decimals. No problem, he thought, we will filter out the period. Once again, if I am comparing 110.1 and 100.2, it is OK (if asking for trouble) to chop them off, since 110.1:100.2 is the same as 1101:1002. But if not every number has a decimal with the exact same number of digits to the right of the decimal, it is very easy to think that 110.1:100.234 is the same as 1101:100234!
Needless to say, garbage in led to garbage out.
Although that was the proximate cause of the error, the root cause is the lack of human involvement: it takes intuition and knowledge to review the results, and human review to ensure that the inputs make sense.
I have seen numerous similar cases with performance tests comparing two processes where one performed better than another in a case where it is nearly impossible, e.g. a process running directly on bare metal performing worse than on a VM! Fortunately, most of the time we are able to find significant enough discrepancies to enable wise heads to prevail and realize that:
- Different generations of hardware, different CPU speeds, different disk will give different results.
- Different operating system installations and optimizations will give different results.
- The application itself often is a very poor measure of its own performance; external monitors are the only reliable way to test (if you can get DTrace, do).
Performance tests are crucial for applications. They let you:
- Compare to competitors' products
- Benchmark your own speed against requirements
- Compare revisions of your own product
- Save infrastructure costs
- Increase revenues as you deliver more for your customers
However, performance tests are only as useful as the intelligence of the people who create them and analyze them. Never analyze performance in isolation; context is everything.
Summary
Performance tests are an invaluable tool for improving your reliability, iteration speed, revenue and profits, but they are hard to do correctly. Without smart people analyzing in context, you can make poor decisions that hurt your business.
Do you have performance numbers? Do they make sense? Are you using them blindly or leveraging them wisely? Ask us to help set up your performance testing regimen.