April 2017 – airdesk

I have been doing a bit of work with data visualization recently, using Tableau. It got me thinking about the way we use data to produce information, and how that is changing.

One of my early career challenges was to analyse what effect promotions had on overall product sales. In Unilever at that time the standard practice was to run product promotions with the supermarkets every few months. The idea was to gain more prominent shelf space, and so increase sales. The promotion had to offer something extra (money off, extra free, two for one) and manufacturing had to be geared up to support the extra volume. The annual financial plan had to be modelled on the anticipated peaks and troughs of volume. In fact you could say that the whole operation was geared around these promotions.

But the question was: did we actually increase overall profitability; or did we displace volume from one cycle to another? My job was to look at the evidence to see what we could conclude about the effectiveness of promotion on profit.

The trouble is, I had no tools. I could get data about production, physical sales to the supermarket and market share by getting reports from the "mainframe", but I had no tools to analyse them. I had to draw graphs by hand. I plotted sales volume against market share and drew these up on paper and on acetates (remember those?). The results were presented to the Board, and I was asked to go and discuss them. I could make only the vaguest conclusions: promotions did not seem to increase market share in any sustained way; sales volume seemed to fall after a promotion by as much as it had increased; average price sold and profitability went down as much as sales volume went up.

At that time there were no computers on desks. Now the purpose of the desk is to hold the computer. Today I would be able to draw nice graphs, with bubbles expanding and floating upwards. But would it make any difference? No, because there was no useful data to make the correlation between the promotion and the effect on consumer behaviour. The real difference between then and now is not the computer. It is the data.

One of my pet peeves is the phrase "the pace of change is increasing". No, it is not. The pace of change is a constant. If it were increasing, it would either have to change direction and start slowing down at some point, or it would have to increase ad infinitum, which would be an absurdity. The phrase is a rhetorical device to encourage action. But you have to consider that if your call to action is a logical absurdity then there is something wrong.

OK, so what is changing, because something is? It is the availability of data about the world and our actions in it. The steadily lower cost of technology is making more and more data available, and giving us better tools to turn the data into usable information. We have more information, so we can act with more knowledge. We can use the data to gain a new insight into the behaviour of the world. It may be what we guessed intuitively, without data, or it may be new. So instead of "the pace of change is increasing" we have "the availability of data and information and knowledge is constantly increasing". We can respond in two ways:

Collect more data. This is what the Internet of Things is about.
Use the data more effectively. This is what Data Visualization is about.

This is about our experience recently on a project to improve the performance and stability of a set of engineering applications after migration to a new datacentre. We had really excellent data produced by the application centre business analysts. These showed in detail that applications were significantly slower than previously, across a wide range of transactions. On average, transactions were taking 25% longer (let’s say). Someone set the objective that we would not be satisfied until 90% of transactions were within the benchmark figure for each transaction.

On the face of it this was going to be difficult, because we knew that there would always be variability, and this new target effectively outlawed variation. We did not know the previous variability. If the benchmark transaction times were only met say 70% of the time previously, then there was no reason for them to now be met 90% of the time.

The first and obvious variable was the user site. We found that, if we excluded the sites with known poor networks, or those sites which seemed to have a much higher incidence of poor results (because that is how we knew they had a poor network), then the number of transactions outside the benchmark dropped significantly. But they were still a lot more than 10%. Obviously the site and the network did not account for all poor performance.

The second obvious factor was the performance of the computing platform (Citrix XenDesktop). We could not tell if a poor test result correlated with a general experience for other users of poor performance on the platform at that precise time. But the general feeling was that the platform must have periods of poor performance. So the number of virtual machines was increased; the number of users per virtual machine reduced; and in some cases the number of vCPU’s per virtual machine increased. It made no difference. There continued to be a significant number of transactions outside the benchmark times.

One of the issues for us was that we could not reproduce the problem on demand. The analysts had all experienced a bad transaction. But it was not repeatable. So we knew that we were looking for erratic rather than predictable results. When we looked at the test data again, we found that the Average time (the average time taken for a number of instances of the same transaction) was very misleading. We found that the Median value was indeed well below the benchmark transaction time. Most people were experiencing good performance most of the time, but some people were experiencing poor performance some of the time. The measurements at the time of poor performance were extreme, so they made the averages less useful.

The example I think of is taking a train to work. It normally takes 30 minutes. Four times out of five the train runs on time, but the fifth time it is cancelled and you have a 20 minute delay for the next train, which also runs more slowly, taking 40 minutes. It is not useful to say that the journey takes on average 36 minutes. You would not be on time to work more often if you allowed 36 minutes. Instead the conclusion is that the service is unreliable, which is quite a different thing.

So we plotted the actual times in a scatter graph, and it was immediately clear that the real problem was not performance, but reliability. We also calculated the standard deviation, as a more accurate representation of variability, which told us the same thing. Examples:

We decided that, instead of looking at the things that affect performance (vCPU, vRAM, disk latency, network latency) we would look at the things that affect reliability. We started by analysing each transaction with SysInternals Process Monitor and Wireshark, to understand what exactly caused time to be taken. The results were a revelation. We found a set of causes that we would not have guessed existed:

A benchmark transaction exported from the old system without version history. The transaction attempted to validate the version number by checking prior versions, before giving up and running.
An export to Excel failed if Excel is already open in the background. It continued to fail silently until the user runs it with Excel not open.
A transaction called an external module. The module is signed with a certificate from the vendor. The transactionattempts to check the revocation of the certificate. If the user has an invalid proxy server configuration then there is a delay before a timeout expires and it continues. If the transaction is run a second time there is no check and it is fast.
The user logs on. The application searches in various non-existent locations for a user configuration. After around 20 seconds it finds a configuration and begins.
Running a transaction for the first time causes the data to be cached locally. The second time it runs from cache and is fast. Therefore the recorded time depends on the instance of running.
A report writes to Excel at a network location. The data is transferred to the remote file in very small packets, taking a long time. Another report run to a local file, which is then copied to the remote destination, and completes in a fraction of the time.

The conclusions? it is important to look at the data statistically to see whether the problem is about performance or reliability; and you need to understand the makeup of the transaction to know what may cause it to take longer than expected.

Month: April 2017

The Future Desktop

Performance Measurement