The Troubleshooting Process

I do a lot of troubleshooting of problems in corporate infrastructure. By that, I mean investigating difficult technical problems in a structured way to find either a solution or a workaround. This post is a few thoughts on the process.

Here is a caricature of the process that often happens. It is imaginary, but I think most people will easily identify real cases that are similar.

  1. A user (or a group of users, for example at one site or using one application) experiences a problem. The help desk tries a few things, not very systematically. Sometimes different people from the help desk try the same thing again. Various updates and changes are made to see if they fix the problem. Sometimes the problem is resolved. Sometimes it goes away. And sometimes it is ignored as "just the way it is until we upgrade xyz".
  2. A user experiences a problem. The user is senior, or the problem affects something that senior people care about. A problem manager takes over the case. The problem manager is someone who co-ordinates actions but does not, themselves, possess the technical skills to resolve the problem. The problem is assigned to whichever team seems most likely e.g. networks, server, storage, application. The team works hard to demonstrate that the fault cannot be with them. The problem continues, with increasing levels of polite acrimony. Eventually a significant change is made which may, or may not, be the cause e.g. a platform component is changed. The problem goes away, but no-one knows what actually caused it.

What is the perfect way? A user experiences a problem. A technical wizard understands every aspect of the problem without needing to be told anything about it, or to do any investigation. The technical wizard knows the solution, whether it is in the network, or server, or storage, or application layer. After some discussion, the fix is implemented and the problem is solved.

This perfect way is absurd, obviously. No-one can know what the problem is until they have seen it, thought about it, asked a few questions, gathered some data. No-one can be expert in enough technologies to know exactly what the fix is, without running some tests, trying a few things.

So we need a process that is not like 1) and 2), but also does not assume there is a perfect way.

First, some context. We already have an awareness of what is normal behaviour and what is not. If a user says that logon is slow, we assess it against a typical logon time. If the behaviour is not normal, then we can assume there is a fault. For the most part (not always!) vendor products do not show obvious faults in isolation. So, if the vendors do not know about this fault, then there must be a fault in our configuration, or an unknown fault in the product. There must be something we can change so that the problem does not occur. Our job is to find what to change.

The way I go about it is divided into five parts. These are not all sequential, but they involve different types of activity:

  1. Incident Handling
  2. Problem Definition
  3. Investigation
  4. Analysis
  5. Resolution.

I am not going to describe each part in detail: just give an idea.

Incident Handling

Being practical, I am going to assume that the great majority of day-to-day help desk calls do not require much rigour. The most important thing is to avoid doing things twice. As soon as we (on the help desk) realise that a problem is not simple, we need to follow a systematic process to gather the incident data in a way that avoids having to do it again.

Problem Definition

As soon as we realise that a fault may be more difficult to resolve than by providing a simple change, we need to prepare a Problem Definition.

Very often I find that, when starting out, it is more difficult to write down an accurate statement of the problem than you might expect. The report might be: "The application crashes". But is that a Not Responding window or an application error? If there is an error, what is the exact message, and what happens when you click on OK? Is there an error shown in the Application Event Log? Does the application log show an error? How often does this occur? How many people are affected? What is the user doing in the moments leading up to the problem? Writing down what is and is not known goes a long way to defining what problem we are trying to solve.

It is not always easy. I have worked on a fault causing Windows to crash with a bug check, where it was difficult to know even whether there was a single problem, or more than one.

Investigation

This is not a distinct step. But I identify it separately because we attempt to gather data systematically to understand the problem better. The more we understand, the more specific the tools we might use to investigate further.

Initially we might configure application debug level logging. Or we might run a network capture. As we narrow it down we might get very specific.

For example, I recently had a problem where Outlook seemed to hang for a few seconds when typing. I first ran a network capture of all traffic to see if, perhaps, a peak was causing a delay in response. There was no significant level of traffic. There was no significant latency. No retransmission. No problems with the network connection in general.

Then I ran a capture only between the Outlook client and the Exchange server (using the relevant ports). I noticed a spike in traffic every 15 minutes, coinciding with the problem. But what was the spike? Was it normal, or a fault?

So then, on the server, I configured Remote Operation (ROP) logging to capture the exact MAPI operations being performed by the client. The problem was caused by a MAPI action to re-read the mail database and refresh the view in Outlook. This occurred exactly when the user experienced the problem.

Analysis

I will mention just two aspects of the analysis.

One is that we must have a mental model of the part of the system we are examining, and all its components. We then need to rule in or out the different components we think could be involved. Often I am told the problem is in Citrix. When I investigate, it is not in Citrix at all. It is in, for example, the configuration of the application on Citrix. Or, in one very difficult case, it was port exhaustion in an unpatched Windows server. Or it might be a configuration error in Citrix after all.

The second is conjecture. I use the term to mean that we need to develop an idea of a single cause of all the different symptoms we have found in the investigation. I don’t mean a guess. For example, sometimes people will say "I think it must be the anti-virus". This is a guess. But a conjecture would be: "A windows driver fault is occurring in the I/O stack. It is a type of fault that does not appear in testing using Driver Verifier". This is exactly what the problem was, but it was extremely difficult to find.

Resolution

An interesting aspect of troubleshooting is that the final part is often really easy. Often, I don’t need to be involved any further, once the exact cause is identified.

You might think that, after investigation and analysis, we may have a good idea but we cannot be certain. Of course, nothing is ever entirely certain. But going back to what I said about the context, the problem must be caused by a specific fault; otherwise the behaviour would be normal. When you find the cause, you just know this is it.

The problem with certifications

Certifications sound like a great idea, and if I were in HR recruiting IT people, I could be forgiven for thinking that they tell me something important about a person’s skills level. But I would be wrong.

The idea of certifications is not wrong. If I were having a boiler installed, I would probably want the technician to be certified to work with gas. This would be an industry certification, perhaps with an independent assessment body. They might also have done some training for the specific type of boilers they install. The IT industry does not do this. It has vendor certifications that are intended to demonstrate a skill level with a specific technology:

  • Cisco Certified Internetwork Expert (CCIE) and Cisco Certified Design Expert (CCDE)
  • Microsoft Certified Solutions Expert (MCSE)
  • VMware Certified Design Expert (VCDX)

The certifications are hard. Anyone who has a CCIE certification has demonstrated the ability to study and has the aptitude to pass a certain type of exam. They may also have been fortunate to have an employer willing to pay the steep fees for courses. The question, however, is whether they demonstrate real expertise in the technology.

The problem is that the course material is created with the idea of enabling an exam, and the exam is created purely as a test of the course material. An example will show what I mean.

This example is taken from the Cisco material for TSHOOT. This is one of the exams for the Cisco Certified Network Professional (CCNP) Routing and Switching certification. It covers the skills for troubleshooting and maintaining Cisco IP Networks. Cisco certifications are some of the best, so this example is not an adverse comment on Cisco. It is just an example of a certification topic.

Troubleshooting an IP network requires a good understanding of TCP/IP, and how packets flow through a network from server to switch to WAN and client, and back to the server. NetFlow is a way of recording information about flows, so that you can diagnose performance problems. There is quite a lot you need to know about flows in order to diagnose problems. The course material tells us that:

"A flow is a unidirectional stream of packets, between a given source and a destination, that have several components in common. The seven fields that need to match for packets to be considered part of the same flow are as follows:

  • Source IP Address
  • Destination IP Address
  • Source Port (protocol dependent)
  • Destination Port (protocol dependent)
  • Protocol (Layer 3 or 4)
  • Type of Service (ToS) Value (differentiated services code point [DSCP])
  • Input interface."

I suppose there are a number of concepts here. One is that the flow is a specific "conversation" between client and server. Now this is a bit surprising. It says "unidirectional". Does that mean that the response to a request is in a different flow? How can I tell if there is a network or server delay if the request and response are in different flows? Another concept is that you can’t jump between interfaces. You might have more than one network connection to a switch, but those would be separate flows. I don’t really need to know that there are precisely seven fields: I can always look that up. And I don’t need to know trick questions about what might be a field but is not. TCP/IP flows is a really interesting topic, and I would like to know a bit more about it.

Now here is the test question:

"Which of the following is not a NetFlow key field

  • Source IP Address
  • Layer 4 Source Port
  • ToS Byte (DSCP)
  • TTL
  • Input Interface."

Did you notice what happened there? I don’t need to know anything about flows. I just need to remember the list of seven fields. And I need to be aware of trick answers. Is Source Port really Layer 4? Is TOS value really a byte? Did I just forget TTL, or could there be a reason why Time to Live is a field that I have forgotten? None of this matters in the real world. In the real world we switch on NetFlow, and configure a collector like SolarWinds. The real work is in interpreting the NetFlow data. And NetFlow is expensive. And it can’t tell you what is happening on the parts of the WAN you do not control. And it does not tell you what response time the user actually experiences.

The problem here is the methodology. If the exam were in Algebra, there would be a vast body of knowledge, different course material, trained teachers and professional examiners. But there is no such body of knowledge or of educators for troubleshooting an IP network. Cisco has to get someone to prepare a course and someone else to prepare an exam. The exam is a test of the course.

Certification courses provide useful training. And simple exams that test whether you paid attention are OK. But certifications do not prove skills. They prove a willingness to study, and an aptitude for a certain kind of test.

Windows 10 S for Enterprise?

Windows 10 S is the new edition of the client OS that is restricted to run only applications from the Windows Store. The advantage is that it is more stable and secure than an OS where the user can install software from anywhere. Microsoft has positioned the OS for the education market. But perhaps it has possibilities for the enterprise too.

Windows 10 S was released in May 2017. It is only available as an OEM installation, for example in this Microsoft Surface Laptop.

Surface_L_Pivot-Blue-pos2_V1

Vendors with Windows 10 S laptops currently include Dell, HP and others. Prices are in a similar range to other laptops and notebooks.

The marketing from Microsoft is aimed at the education market, but what interests me is the scope for using Windows 10 S in the enterprise. Mobility is costly, and this OS might bring the cost down.

The main problem for enterprise mobility is making it secure. One approach to this is the managed laptop:

  • a custom Windows 10 Enterprise image
  • joined to the domain
  • encrypted
  • authenticated by a certificate
  • no admin rights for the user, OR admin rights with more active detection and blocking
  • SSL VPN client
  • web proxy client.

This has more complexity and higher support costs than a standard desktop.An alternative approach is to do away with the idea of validating the device at all, and provide access to enterprise data and applications only through a virtual desktop. In this case mobility is provided by any device running the remote access software: like Citrix Receiver or the VMware Horizon client. It can be a Mac, a Chromebook or a mobile thin client. The problem here is that, if you want to work offline, you need to store data and you need local applications. If you do that, you again need a managed device, and you add further costs.

Windows 10 S may provide a new option. Use a regular desktop in the office, and a Windows 10 S laptop for mobility. As the Windows 10 S laptop cannot run applications except from the Windows Store, the level of protection and additional support required is much lower. You can still run Office applications like Outlook. You can still edit a PowerPoint presentation or work on an Excel spreadsheet offline. But the scope for malware is much reduced. If you need to use an enterprise application like SAP when working from home, say, then you can use remote access to connect to a virtual desktop or a published application. But in this case the virtual desktop needs to be provided only to the mobile users and not to all users.

Windows 10 S supports these enterprise features:

  • Trusted Platform Module (depending on the OEM hardware)
  • Certificate Store for enterprise certificates
  • BitLocker disk encryption
  • Azure domain join and Windows Hello authentication
  • mobile device management with Intune, AirWatch or similar
  • desktop applications from Microsoft, Adobe, etc. as long as they are available from the Windows Store.

The typical company laptop is an expensive compromise. It needs to be powerful enough to run enterprise applications, light enough to carry around easily, secure enough to hold enterprise data, flexible enough to allow the user to work offline. I think on balance I would prefer to use a regular desktop in the office, and a Windows 10 S laptop for mobility.

End of an era

We are seeing the end of an era in how we think of, and manage, the corporate desktop.

The corporate desktop is only about 12 to 15 years old. In a short burst, Microsoft introduced a range of technologies that made it practical to administer personal computers on a large scale: Active Directory, Group Policy, Windows Installer etc. Microsoft called it Intellimirror, although that name has disappeared. We take it all for granted now. It’s how desktops work.

Having an administered desktop like this was very important to the overall architecture of IT services. Devices on the LAN were safe and were allowed to access corporate data. Other devices were not. That’s why software like Flash, Java and Adobe Reader could be allowed to be out of date, and why people stuck with Windows XP and IE 8. They were on the LAN, so they were safe.

As things have evolved, it is getting to the point where this just isn’t the case anymore. The basic design has come to the end of the road. The effort to keep it up to date and secure is too great, and the benefit is no longer there.

I know you can keep the desktop up to date and secure. But its a lot of work and it is easy for it to break down. For the user this is all a waste of effort and cost. There’s no benefit to them. It is just a cost, a nuisance, and a constraint. As a minimum you need:

  1. Disk encryption, with boot PIN or password.
  2. Constant updates to Java, Flash, Adobe Reader, Chrome, Firefox. Not just regular, like every three months, but every few days.
  3. Every app virtualised, except the ones that won’t virtualise.
  4. Special treatment for web apps that need older versions of Internet Explorer and Java.
  5. A certificate infrastructure, and network access control, to test whether the device is one of yours or not.
  6. Security and audit controls to prevent, detect and respond to intrusions.

But mostly now the aim is to allow people to reach the main corporate services, like e-mail, from any device, and from anywhere. Not in all organisations, I know, but mostly I think. And why not?

If I can get to a service with Chrome, then I also don’t need to get to it on a company desktop. Any device with a browser will do. Web services and Cloud services don’t require a corporate desktop, and in many cases can’t tell if the client is a corporate desktop or not.

Take Office 365 as an example. I see a lot of organisations adopting it. The whole point of Office 365 is that you can use it on and off the network, and from any device (more or less). Office 365 has no method to detect whether your device is a corporate desktop or not. It can detect the IP address, and the type of device (Windows, iOS etc.), but it can’t detect whether the computer is joined to your domain, or has a machine certificate, or is encrypted, or the software is up to date – all the things that make a corporate desktop.

I think now we are looking ahead to a different paradigm.

  1. Device enrollment of any kind of device with something like Intune or AirWatch
  2. A corporate user identity, with different levels of authentication and authorisation for different services e.g. an Azure AD identity with Yubikey MFA for the finance and HR systems.
  3. Corporate applications delivered as separate services that you sign up to, and delivered mostly virtually or as web services, with no data on the end device.

I think this also means we will not need the monolithic, outsourced, integrated IT organisation. When IT is delivered as separate managed services, it does not need to be managed as a single entity. I would expect to see: Corporate Systems; Line of Business Systems; Local Systems.

How would this work in practice? Let’s say I am in engineering in a UK subsidiary of a global business. I get an Azure AD identity and a Yubikey from HR when I join. I pick my devices (a phone, a laptop) from a list, and they are delivered direct to me by the vendor. If I want, I download a corporate clean image, otherwise I just use Windows 10 OEM. I go to the Corporate Intranet new starters page, and enroll both devices in the Device Management system. They auto-discover the Office 365 e-mail and chat. I get a phone ID, which I key in to the phone on my desk.

From a portal I download the apps for my expenses and time reporting from Corporate Services. They only download onto an enrolled device. If I un-enroll, or fail to authenticate, they are wiped. Most of them will be virtual or web apps.

My engineering apps, like Autodesk, come from my Engineering Services. They will only install on an enrolled device. I can do what I like with the app, but I can’t get any important data without my Yubikey.

My own department pays the vendor for the devices. It pays Corporate services per employee. It has whatever Local Services it wants, for example its own helpdesk. Apps have a subscription per month.

OK, its not perfect, but it is a lot less complicated and easier to manage. It makes IT a set of services instead of an organisation.

Complexity in IT

In a previous post I said I thought that problems in IT are caused by complexity, and not by the pace of change, poor management or lack of skills (although any of those may contribute).

Here are some interesting thoughts from David Gelernter. Gelernter is Professor of Computer Science at Yale.

In his book Mirror Worlds he says:

"Information structures are, potentially, the most complicated structures known to man. Precisely because software is so easy to build, complexity is a deadly software killer."

"Programs that amount to a quarter of a million lines of text (there are about 7000 lines in this book, so picture 36 volumes of programs) are not in the least unusual." "It is very hard to make programs come out right. After a decent amount of effort they tend to be mostly right, with just a few small bugs. Fixing those small bugs (a bug can be small but catastrophic under the wrong circumstances) might take ten times longer than the entire specification, design, construction, and testing effort up to this point."

"If you are a software designer and you can’t master and subdue monumental complexity, you’re dead: your machines don’t work….Hence "managing complexity" must be your goal."

I don’t think many businesses or customers of IT fully recognise this. They think of IT as fiddly, not for them, full of jargon. They are happy to say they don’t really understand it. They don’t realize that hardly anyone does!

Why is IT so difficult?

A friend of mine, a very experienced and senior non-executive director, asked me why, in all the organisations he knows, IT is the area that causes the most difficulty. There are several common explanations, but I am not sure they add up. This leads me to a different explanation, with interesting consequences.

IT causes difficulty in many ways, for example:

  • results not achieved on time, not what was expected or promised, and not within budget
  • catastrophic failure of critical systems, or loss of data
  • systems hard to use, difficult to change, not integrated with other systems, expensive to maintain, hard to replace
  • problems with staff, and with suppliers: poor quality, high turnover, unreliable.

So my friend can be reasonably confident that a five year multi-billion pound engineering project will be completed successfully, while a one year million pound IT project is unlikely to run to plan. Why is that?

Possible explanations:

  1. IT is changing so fast that whatever you plan is obsolete within a short time
  2. People in IT generally lack professional training and skills
  3. People in the business don’t understand IT, and the people in IT don’t understand the business.

I have doubts about these explanations. They have a superficial truth, but for me they don’t explain the level of difficulty in managing IT successfully.

1. The rate of change

The IT industry is constantly producing new things, that’s true. But in other respects the rate of change is fairly slow. The way we do computing is not fundamentally very different from say ten years ago. Many of the same companies are selling many of the same products. If you started a project five years ago, no matter how large, it is difficult to see what technology has changed sufficiently to cause the project to fail.

2. Training and skills

Because things in IT go wrong, it is easy retrospectively to identify faults in the skills of the individuals as the cause, but it is not necessarily so. When things are difficult or impossible to achieve even the highest level of skill may not be sufficient. It is hard to imagine that in general the training and skills of people in IT are lower than in Sales, Marketing, Procurement, Distribution. Maybe those areas just aren’t as difficult, and so the managers appear to be more successful.

3. Understanding

There is a high threshold in getting to grips with the language of IT, certainly. But at the level at which IT and other people in the business need to communicate this really should not be relevant. Medicine has its own language, but doctors don’t seem to have the same problem communicating with patients. I suspect that problems in understanding are more to do with trust than with language.

So if these explanations don’t account for the difficulty with IT what are we left with? My view is that the root cause is complexity. IT systems are the link between a human intention and a computer chip. Human intentions are imprecise and hard to define, but chips are strictly binary. The layers of software in between intention and chip are hugely complex. To produce a predicable outcome is extremely difficult.

If it is true that the root cause of difficulty in managing IT is complexity then there are two consequences. The first is that we should aim to minimise the complexity in every possible way; and the second is that we need people who manage complexity very well.

Migrating applications to Windows 7

One of the biggest challenges when upgrading to Windows 7 is in testing and preparing applications. This blog puts together a few conclusions that might assist you in planning the work.

The extended lifespan of Windows XP and Server 2003 has been a sort of "peace dividend" or "pension holiday". When you do come to upgrade it is important not to underestimate the cost and uncertainty involved in application compatibility. But at the same time you don’t need to accept that the migration will take forever.

The problem is that applications can be incompatible with Windows 7 in many different ways. Some of these are trivial and easily solved. Some are harder to solve. Some are hard to find and impossible to solve. You don’t know until you test. The same applies to running the applications in Citrix on Server 2008 R2, with the added complication of 64-bit. Here are a few examples to illustrate:

Standard third party application: Lotus Notes

  • The current version 8.5.1 is not certified on Windows 7. Does it work OK or not? Do you wait until later this year for 8.5.2, or go ahead with 8.5.1? There is a patch, Fix Pack 1, that is certified but it adds complexity to the installation.
  • You would think it would be quite simple to find out: ask the vendor. But most vendors do not certify previous versions. That does not mean they don’t run perfectly well. In this case, although 8.5.1 is not certified, the release notes for Fix Pack 1 contain only trivial changes and 8.5.1 appears to work fine, so there is no reason to delay.

Specialised third party application: legal software

  • The installation fails on Vista/Windows 7. Examination of the logs and the Windows Installer file shows there is a custom action to copy templates into the user profile path. The path is hard coded and fails.
  • The solution is to customise the installer to remove the custom action and replicate it in a custom script. Inform the vendor so they can modify the installer.

Custom system: membership database

  • This is an old system with a SystemBuilder 4GL graphical interface to a Unidata database. The version of SystemBuilder being used is not certified or even tested on Vista/Windows 7. The SBClient application contains an OEM Pervasive SQL client that is also obsolete. The client does mail merge to Word 2003, so it would need to be tested if used with Word 2007 or 2010.
  • There is a new version of SystemBuilder that, amazingly for such an old product, is certified both on Windows 7 and on Server 2008 R2 Terminal Services. The new version seems to work perfectly with the old system. However you need to change the client side and the server side of the graphical interface at the same time, so it would be a big bang change to a critical system.
  • But, after packaging the old version using Wise Package Studio, it seems to work fine on both Windows 7 and on Server 2008 Terminal Services, so there is no need to upgrade.

Other Gotcha’s

  • Applications with OEM versions of Crystal Report 11 or earlier do not install on Windows 7. Crystal Reports 11 is not supported on Windows 7, and you can’t upgrade an OEM edition, but it can be modified to install successfully.
  • Applications using the common VB6 SendKeys function raise an error on Windows 7. Sendkeys does not work with UAC. UAC can only be turned off per computer, not per application so there is no workaround except to turn UAC off entirely.
  • In XP you can use the Printer button on the PageSetupDialog to set different printer properties for an application. In the Vista/Windows 7 API it’s gone. There’s no error, it’s just not there. But in .NET Framework it’s still there! This might seem rather obscure, but the point is: you would have to do a lot of testing to discover this and then find out whether it matters to your users of that application.

Obviously you could wait till your applications are fully tested or upgraded to the latest certified versions, but this could take impossibly long. If you have just one core application that is not ready, you can’t upgrade the desktop.

A lot of people seem to be combining application virtualization with a Windows 7 rollout. Perhaps surprisingly, application virtualization is largely irrelevant to compatibility across OS’s. With a virtualized app, the same dll’s run within the OS with exactly the same results. If the application faults natively, it will fault when virtualized. Virtualization can be used to implement a compatibility fix, but you still need the fix.

The best way to approach this is with a structured testing environment and a full set of delivery options. Then, for the difficult applications, you can set a time limit.

Structured Testing Environment

  • Wise Package Studio or similar, to examine the internal structure of the application and check for conflicts between applications.
  • A full VMWare testing environment with VMWare Workstation and ESXi, so you can create all the packaging and testing environments you need and, most importantly, leave them running so that users can log on remotely to test.
  • Scripted or automated tests and test accounts for each application.
  • Microsoft Application Compatibility Toolkit for testing and distributing fixes
  • Thorough documentation and audit trail of the testing.

Delivery options

  • Native installation for compatible and well behaved applications
  • Citrix XenApp published applications, or perhaps virtual desktop, for incompatible applications
  • Virtualization for conflicting applications (e.g. applications that require different versions of common components) or badly behaved applications (e.g. applications that change the default behaviour of the OS)

Most larger organisations already use several delivery options. What is new is to work out the interdependencies of different applications and which platforms they need to sit on. For example, if the incompatible app does a mail merge to Word or a report export to Excel, then the back end platform needs to have Office. It won’t be able to merge and export to the front end. This means that you also have to consider the user profile settings across different delivery platforms. If the user changes a default printer on the Windows 7 front end, should the same change be made to the back end or not?

With this approach, structured testing and multiple delivery options, you can set a time limit for preparing applications for Windows 7 migration. You can migrate the core desktop to Windows 7, while migrating older applications when they are ready.

Intel and McAfee

Intel announced on 19 Aug 2010 that it will buy McAfee for around $8bn. This has caused some surprise. Intel does not sell directly to the end-user, and it does not develop application software. It is not obvious what it achieves by acquiring a software vendor. Here’s my guess as to why Intel is doing it.

There is a complex pattern of change going on in the architecture of the server computer. As parts get cheaper and more powerful, they can be reconfigured in many ways. The basic model of one box and one chip per business function (e.g. the mail server, the domain controller) no longer exists.

Virtualisation and Cloud computing are just marketing words, but underneath is a continuous evolution and adaptation of components. The BIOS (very small bit of control code) becomes the EFI (much larger) and then the Hypervisor (even larger bit of control code). Virtualisation is not a new concept. It just signifies that the hardware has temporarily outstripped the operating system in the ability to run diverse tasks. The hardware is sitting there saying "give me more", but the OS can’t isolate them enough, so we put a thin layer in between to share the hardware. Next step is the "OS" shrinks to be task based, like Windows Server Core. Likewise cloud computing is not a new technology. It signifies that fibre optic networks are cheap enough to move servers off site, where they can share resources like cooling and power supply more easily.

One aspect of this continuous evolution and adaptation is that the security risks are changing. It used to be accepted that "inside" was inherently safer than "outside". Outside you need two factor authentication and strong encryption. Inside you can get away with the odd admin password passed over the network in the clear. Now you can’t assume this. For example on shared hardware you need to process security keys (used for disk encryption) outside of shared memory where they might be discovered by different virtual machines on the same physical host.

As a result there is a lot of work going on to improve the manageability and security of computers below the operating system layer.

  • faster and stronger encryption
  • better protection of encryption keys and passwords
  • more isolation of different virtual machines
  • detection of unexpected state changes.

For Intel this includes initiatives such as: Active Management Technology (AMT); Virtualization Technology (VT); and Trusted Execution Technology (TXT). These have also been evolving over the past five years and more. Here is a really good insight into what AMT does: AMT

So I think Intel must have acquired McAfee in order to adapt their antivirus technology for implementation in hardware. This would enable the physical host to scan virtual guests and preserve the integrity of the system. The host would be able to detect if the guest had been altered. It would also be able to detect if shared drivers for graphics and audio had been tampered with. It might even be easier to stop the AV process running away with the CPU, which happens frequently in software.

Why McAfee? I don’t know. I am not aware of any technical superiority between different AV vendors. Perhaps because they have a reasonably good name, client base and income stream. Why not invent from scratch? Only because it would take too long. These are just guesses mind you.

Outsourcing IT is not the answer

Most large businesses I have come across at some point come to wonder how better to manage their IT operations. IT consumes a lot of money, but often does not seem to be doing what you want, almost wilfully. You ask for something to be done, and three weeks later nothing at all seems to have happened. Surely they are all just incompetent. Outsourcing has been around a long time as a solution to this problem of feeling a lack of control.

Outsourcing sounds like it should make sense. ICHA (or whoever) do lots of IT and must know how to do it better than we do. They are specialists where we are amateurs. They must have lots of highly skilled experts who can be called on to deal with the tricky technical stuff only when required. It all sounds so efficient. And now they even have technical experts and support centres in India and China, where costs are so much lower. How could it fail to be both more effective and less costly than our current operations?

And yet. When you start talking costs, they always seem remarkably close to your current costs. And service levels always sound more as though they are trying to avoid things rather than commit to them. TUPE means of course that you simply can’t release your staff (who were so incompetent, remember?) and use ICHA’s. And the shared data centres you were going to use instead of paying for your own, well, it would cost millions to make the move and actually the services are going to be run from your own data centre after all. In the end it seems as though your own people and facilities are going to be sold back to you at a premium, but managed by someone else. So the pitch comes down to this: "Don’t worry your pretty little head about this IT stuff. Just tell us what you want and we will manage it for you". Core business is the key word. By the time you have got this far down the track, it would be really embarrassing to go back to the Board and say, "It doesn’t add up, I must have misunderstood what IT is about", so it goes ahead anyway.

Here’s why Outsourcing in this way doesn’t work.

Most of IT Operations is simply deploying vendors’ kit. It may be in large quantities, it may be very expensive, but it is still just kit. Most kit from most vendors is at the upper bounds of complexity and capability. As a random example, RSA SecurID can provide strong authentication for five plumbers, or for 100,000 staff spread around international offices. It works the same way. To implement this stuff effectively you need to be fairly expert. But then day to day it requires little more than following the book for how you add users, change settings or whatever. Mostly it just works. And when it doesn’t you really need the expert to fix it.

Now the problem is that it does not make sense for IT Operations to hire experts. You only set it up once, and change it rarely. But you administer it every day. So you tend to hire the administrators, and then try to get by on that. Systems are put together by people who are not experts, and so they don’t get done or they fail. I don’t mean to say that the people in IT Operations are not very capable. You may have a small group of people who are indeed expert in some things. Its just that they can’t possibly have the variety and depth of experience of people who do this all the time. Is it enough? Well, perhaps, but probably not.

And then when you go to the market for outside help, the transaction costs are high. It takes time to brief people and for them to understand what you are trying to do, and that time one way or another must be covered in their costs. It also looks like real money. £50,000 to do a project is a lot of cash to justify, with business cases and cost benefits analysis. Fred not achieving anything very much in a year is much harder to see.

Outsourcing does not solve this.

The outsourcer is going to sell you back your own staff and kit. Yes, there may be some changes in the way some things are done, and you may have a few redundancies. But fundamentally you have the same faulty systems being run by the same people. When you would think you would have access to experts to solve problems or make things work better, they don’t seem to be available. Why is that? Well, an expert in something like Active Directory can be charged out at high rates to client implementation projects. He is not going to be assigned to your problem just because you’d like it. If he is assigned to a chargeable project to help you, he won’t know any more about you than any other new supplier.