Government Commercial Problems with IT Procurement

Working in IT, I come across procurement problems frequently. The root cause, it seems to me, is that government procurement rules are implicitly designed for a steady state, whereas IT projects implement change, which is inherently imprecise. These rules need a radical overhaul. The new Procurement Bill, currently (Feb 2023) going through the House of Commons, aims to do this.

Problems

What sort of problems? 1) Long delays. A procurement that might be a simple executive decision in the private sector can be a three or six month exercise in the public sector. On a project, delay has a cost. This cost often outweighs the potential benefit of the procurement process. 2) Inflexibility as requirements evolve. Sometimes you don’t know exactly what you need until you talk to suppliers. But you can’t talk to suppliers without a formal procurement process.

I cannot give specific cases, for reasons of client confidentiality. But I can highlight the areas of the procurement rules that create these problems. The intention of the public procurement policy is clear and legitimate: to achieve “the best mix of quality and effectiveness for the least outlay over the period of use of the goods or services bought”. The question is whether the rules do this in practice.

I must say at the outset, these thoughts are from a “user” perspective. I have no great knowledge of the procurement rules, only my experience in performing procurements as part of an IT project. The amount of regulation and guidance applying to procurement is vast, and I don’t know how anyone could master it. The scope is vast too: £ hundreds of billions of contracts, of every conceivable type, and ranging in value from £ billions to £10,000. I don’t believe it is realistic to try to codify the rules for this vast enterprise, but that is what the policy does.

Long delays

I led a piece of work to implement a small piece of software that integrated two different systems. There are four products that do this. It is quite a niche area, with not much published information. The value of the purchase would be small, in relation to the two systems being integrated. The products are priced by volume of usage, with annual subscriptions. There were various technical complications about integrating with the two specific systems in our case.

The obvious thing to do was to buy a few licences and learn on the job. We were not allowed to do this. The rules said that no purchase of any kind could be made without a selection process, in this case to decide which ones to trial. The public information was not sufficient to justify the selection of a single product to trial. The next obvious thing was to talk to vendors. We were strictly not allowed to do this. Talking informally to any vendor would prejudice a fair selection.

So we developed our selection criteria as best we could (based on what we could glean from the published information), and then carried out a systematic trial of all four products sequentially. The trial involved actually implementing all four products, and asking staff to evaluate their experience when using them. The experience was almost identical, as we expected.

Some of our important selection criteria were technical, for example compliance with security requirements, and licensing terms. For these, we had to ask the vendors to respond to an RFP. As you can imagine, the responses were inadequate to provide any assurance, without speaking further to the vendors.

After going through the selection process, amazingly, we had not actually completed the procurement. All the vendors sold licences through resellers, as you would expect. So, after the selection, we needed to pick a reseller. You’ve guessed it! We needed a procurement to pick a reseller to sell us the licences for the product we had selected. Fortunately, we were able to use the Crown Commercial Services framework to ask for quotes.

The end result was that we purchased a few licences for the product we expected to pick at the beginning, but many months later and at considerably greater cost than the cost of the licences.

The basic problem here is that we do not live in a world of perfect information. At the outset, we cannot know all the ins and outs of different products. Vendors design their product information to highlight advantages and hide weaknesses. Vendors do not publish real prices. Vendors do not respond to RFPs with full and honest answers to questions.

Think of it from the vendor’s point of view. Some government department wants to make a small purchase. The department invents a long and complicated process and invites them to participate. What should they do? Obviously, just send them the data sheet and the price list. Why would they go to the effort and expense of responding when the total profit if they won would be less than the cost of responding?

Inflexibility

I led a project to upgrade the technology of an existing system, the purpose of which was to enable integration with another system. Sorry if that is a bit obscure: the reason is confidentiality.

The original system was contracted for before the integration even existed. We were not allowed to select our new network supplier with the integration built in to their product. This service was not in the scope of their new contract, because no-one at the time knew we would need to do this. It would have required a completely fresh procurement of the primary product, which would have taken at least a year.

In this case we were allowed to vary the existing contract. The rules on variation are highly complex. They require a good understanding of Annex A – Regulations 72 and 73 of the Guidance on Amendments to Contracts 2016. We were allowed to vary the contract but only provided the contract used different technology to do the same thing.

This gave us a few big challenges to negotiate. One, we needed a new type of support for the new technology not provided in the original contract. Two, we needed a third party (at additional cost) to provide a service to assist in the integration.

After something like a year we had completed the integration. At this point there was less than a year to run on the existing contract. But we could not extend the contract. The rules on extension are especially severe: they are one of the “red lines” for IT procurement. So the next stage had to be a full procurement of the whole service, having just completed the transformation of the previous service.

The basic problem here is that we don’t live in a world of isolated products and services. They are all inter-related in some way. It is not possible to have perfect foreknowledge of all the ways the services might need to change in the future.

Observations

I have a few observations.

  1. Procurement rules do not take account of the cost of complying, in relation to the value obtained.
  2. They assume the availability of adequate market information to make perfect choices without speaking to vendors.
  3. They also assume vendors can and will respond with accurate detailed information about what they offer.
  4. They do not take sufficient account of the relationships with other products and services, and the way these all evolve over time.
  5. It is simply not possible to comply with the rules intelligently, without having a large and skilled Commercial department.
  6. Commercial department cannot have a full knowledge of the product or service being procured and, therefore, there will be extensive delay or bad choices made.
  7. Delay is built in to the system, and the cost of delay is not accounted for.
  8. The cost and delay of procurement means that people are incentivised to wrap up products and services into large contracts that preclude innovation and competition – the exact opposite of what is intended.

Procurement Bill

The original Public Contracts Regulation 2015 stemmed directly from the EU Public Contracts Directive. The intention was to make contracts open across Europe.

But the idea that you can regulate all procurement across all of Europe with a value of more than £138,760 (Jan 2022 threshold) seems unrealistic. Let’s say you have an organisation of 10,000 staff. Let’s say a contract might run for 5 years (printing, laptops, software etc.). The threshold means that any contract worth about £3 per member of staff per year must be subject to a full, open, procurement. Let’s say the vendor profit on the procurement is 20%, or £27,752. The procurement process will cost more than that!

The explicit aim of the current Public Procurement Policy is to obtain value for money. But people don’t need rules to enable them to obtain value for money when buying a holiday, or a car, or the weekly shopping. People will do this for themselves. What the public needs is rules to prevent corruption. Anything that knowingly does not obtain value for money is corrupt. The new Procurement Bill says it aims to do this: “Integrity must sit at the heart of the process. It means there must be good management, prevention of misconduct, and control in order to prevent fraud and corruption.”

I will leave it to others to describe the changes in the new bill. But it is interesting to consider how it might affect the two cases I mentioned.

  • A below-threshold contract is one worth more than £12,000 and less than (I think) £138,760
  • For a below-threshold contract, the contracting authority “may not restrict the submission of tenders by reference to an assessment of a supplier’s suitability to perform the contract [including technical ability]. I take that to mean that all procurements must be open to all potential suppliers and not shortlisted. That is admirable, and I see no difficulty in making all these tenders public. But for obscure and specialised requirements the result is likely to be a deluge of irrelevant tenders and/or no valid submissions at all.
  • This does not apply to frameworks, so the best way to procure anything below-threshold will always be through a framework. But frameworks can only sell commodities. They can’t sell niche specialised products.
  • Modifying an existing contract is covered in Section 74 and Schedule 8. I think a contract extension is limited to 10% of the term, i.e. 6 months of a five year contract. This is still not enough where a change of circumstances occurs during the contract.
  • The provision for additional goods, services or works during a contract seem less restrictive then before. “A modification is a permitted modification if (a) the modification provides for the supply of goods, services or works in addition to the goods, services or works already provided for in the contract, (b) using a different supplier would result in the supply of goods, services or works that are different from, or incompatible with, those already provided for in the contract, (c) the contracting authority considers that the difference or incompatibility would result in (i) disproportionate technical difficulties in operation or maintenance or other significant inconvenience, and (ii) the substantial duplication of costs for the authority, and (d) the modification would not increase the estimated value of the contract by more than 50 per cent.” That seems to be a lot more flexible than before.

The scope of government contracts, even just IT contracts, is vast and I don’t know how it is possible to codify the rules governing them except by introducing  a great deal of bureaucracy and expense.

Curiously, the word “integrity”, despite being one of the bill’s objectives, only occurs once in the bill, other than in the statement of the objective. It occurs in the context of the supplier’s integrity. But, when a private sector organisation contracts with a vendor, the organisation is relying on the integrity of the staff, not the vendor. If the staff act with integrity, the organisation is confident the best choice will be made.

Speaking for an SME, I’m glad the bill has provisions to make it easier for small businesses to obtain contracts from government. But I have difficulty seeing how that will work in practice. Bidding is an expensive process. The way a small business manages the cost of bidding is to screen the opportunities for a competitive advantage. This might be having a good reputation with previous clients, or offering a high quality of service, or having strong skills in a particular area. These are intangibles that are screened out in a bureaucratic tendering process.

Fault with Company Portal

This is a story about the complete failure of Microsoft Premier Support to diagnose and resolve a fault in the Company Portal.

It is difficult to put into words how complete the failure is. But it includes a failure to define the problem; a failure to capture it or reproduce it; and a failure to provide any diagnosis of the cause.

The Company Portal is the Modern App, or Store App, that enables a user to install applications that have been published and made available to them from Intune. It is an essential part of the Mobile Device Management (MDM) platform. Without the Company Portal, a user can only have the applications that are “Required”. So, after Autopilot, Company Portal will often be the first place a user goes, to obtain the rest of the applications that they need to work with. An example is specialist finance applications. These might be made available to a community of finance users, but each person with install the ones they need individually.

The problem we have had for several months is that the Company Portal will suddenly disappear from a user’s desktop. It is gone. The user can search for “Company Portal” and it is not there. Where has it gone? No idea. How do you get it back? Well, you obviously can’t use the Company Portal to get it!

The facts of the problem are simple and clear, though you would not believe it from the number of times we have been asked to explain and provide logs:

  • After Autopilot completes, Company Portal is present and working.
  • Some short time later, it has disappeared from the user’s Start menu.
  • If you run Get-AppXPackage as the user, the Company Portal is not listed. However, if you log on as an admin, and run Get-AppXPackage –AllUsers, then the portal is shown as installed and available but only for the admin user.
  • The AppX event log does not show any obvious relevant events.
  • It seems to happen in episodes. And it seems to happen sometimes and not others.

We have been asked repeatedly to provide logs: Autopilot logs and One Data Collector logs. But, obviously, if you gather logs before it has disappeared, then there is nothing to see. If you gather logs after it has disappeared, then there is also nothing to see.

After a while, we asked Microsoft Premier Support to try to reproduce the fault themselves instead of continuously asking us for logs. Amazingly, they are unable to do this. Microsoft Premier Support does not have access to virtual machine, or physical machines, that can be used to reproduce faults in Intune and Autopilot. Just let that sink in. Premier Support is unable to attempt to reproduce a fault in Autopilot. It depends on the customer to reproduce it.

We had a long discussion with Premier Support about Offline versus Online apps. The Microsoft documentation for Autopilot recommends in several places that you should use the Offline version of Company Portal. This is counter-intuitive. Offline apps are designed, intended, to be used offline. The scenario given in Microsoft documentation is a kiosk or shared device that is not connected to the internet. The Offline app is installed by DISM in a task sequence, and is used offline. Company Portal, by definition, is of no use offline. It is used to install applications from Intune. If the device were offline, it would not connect to Intune. So why install the Offline version?

We eventually established, at least we think, that an Offline app is in some way cached by Intune; whereas an Online app is obtained directly from a Microsoft Store repository. This seems relevant to the case of the disappearing portal, but we never discovered more about the true difference.

In an early occurrence, we found an AppX event to say that the Company Portal was not installed because of a missing dependency. The missing dependency was the Microsoft Services Store Engagement app. This is the app that enables users to provide feedback. But this app is (apparently) an embedded part of Windows 10 and cannot be missing. We heard no more about this.

The Company Portal stopped disappearing for a while, and we deduced that the fault was in some way related to version updates. It occurred frequently when the version changed from 111.1.107.0 to 11.1.146.0. It has started to occur frequently again now the version is 11.1.177.0. Of course, we have no idea how it is related to the update. We don’t even really know how an update of an Offline app happens.

Finally Microsoft Premier Support has asked us to gather a SysInternals Procmon log, together with a UXTrace log. I have done a lot of troubleshooting with Procmon. It generates huge log files, of every file and registry operation, as well as some TCP operations. To use Procmon effectively, you need a way to stop it when the fault occurs. Microsoft Premier Support simply asked us to run it and stop it when the fault occurred. There are several problems with this. The first is that the user needs to run UXTrace and Procmon with elevated rights. In our environment (as in almost any production environment) the user does not have admin rights and cannot elevate. The second is that Procmon creates huge logs. You can’t just run it for an unspecified length of time, then stop it and save the log. Microsoft Premier Support were clearly unable to understand the problem of gathering the logs, let alone provide a solution. This is dismal. I would expect a competent second-line engineer to have the skills to work out a strategy for collecting logs. It is part of the basic skill of troubleshooting.

So, three months on, Microsoft Premier Support has no clue, and no practical problem-solving approach.

The thing we have found is that Premier Support seems to have no access to the people who develop and run Intune, Autopilot and Company Portal. They are just as much in the dark as you or I.

The Troubleshooting Process

I do a lot of troubleshooting of problems in corporate infrastructure. By that, I mean investigating difficult technical problems in a structured way to find either a solution or a workaround. This post is a few thoughts on the process.

Here is a caricature of the process that often happens. It is imaginary, but I think most people will easily identify real cases that are similar.

  1. A user (or a group of users, for example at one site or using one application) experiences a problem. The help desk tries a few things, not very systematically. Sometimes different people from the help desk try the same thing again. Various updates and changes are made to see if they fix the problem. Sometimes the problem is resolved. Sometimes it goes away. And sometimes it is ignored as "just the way it is until we upgrade xyz".
  2. A user experiences a problem. The user is senior, or the problem affects something that senior people care about. A problem manager takes over the case. The problem manager is someone who co-ordinates actions but does not, themselves, possess the technical skills to resolve the problem. The problem is assigned to whichever team seems most likely e.g. networks, server, storage, application. The team works hard to demonstrate that the fault cannot be with them. The problem continues, with increasing levels of polite acrimony. Eventually a significant change is made which may, or may not, be the cause e.g. a platform component is changed. The problem goes away, but no-one knows what actually caused it.

What is the perfect way? A user experiences a problem. A technical wizard understands every aspect of the problem without needing to be told anything about it, or to do any investigation. The technical wizard knows the solution, whether it is in the network, or server, or storage, or application layer. After some discussion, the fix is implemented and the problem is solved.

This perfect way is absurd, obviously. No-one can know what the problem is until they have seen it, thought about it, asked a few questions, gathered some data. No-one can be expert in enough technologies to know exactly what the fix is, without running some tests, trying a few things.

So we need a process that is not like 1) and 2), but also does not assume there is a perfect way.

First, some context. We already have an awareness of what is normal behaviour and what is not. If a user says that logon is slow, we assess it against a typical logon time. If the behaviour is not normal, then we can assume there is a fault. For the most part (not always!) vendor products do not show obvious faults in isolation. So, if the vendors do not know about this fault, then there must be a fault in our configuration, or an unknown fault in the product. There must be something we can change so that the problem does not occur. Our job is to find what to change.

The way I go about it is divided into five parts. These are not all sequential, but they involve different types of activity:

  1. Incident Handling
  2. Problem Definition
  3. Investigation
  4. Analysis
  5. Resolution.

I am not going to describe each part in detail: just give an idea.

Incident Handling

Being practical, I am going to assume that the great majority of day-to-day help desk calls do not require much rigour. The most important thing is to avoid doing things twice. As soon as we (on the help desk) realise that a problem is not simple, we need to follow a systematic process to gather the incident data in a way that avoids having to do it again.

Problem Definition

As soon as we realise that a fault may be more difficult to resolve than by providing a simple change, we need to prepare a Problem Definition.

Very often I find that, when starting out, it is more difficult to write down an accurate statement of the problem than you might expect. The report might be: "The application crashes". But is that a Not Responding window or an application error? If there is an error, what is the exact message, and what happens when you click on OK? Is there an error shown in the Application Event Log? Does the application log show an error? How often does this occur? How many people are affected? What is the user doing in the moments leading up to the problem? Writing down what is and is not known goes a long way to defining what problem we are trying to solve.

It is not always easy. I have worked on a fault causing Windows to crash with a bug check, where it was difficult to know even whether there was a single problem, or more than one.

Investigation

This is not a distinct step. But I identify it separately because we attempt to gather data systematically to understand the problem better. The more we understand, the more specific the tools we might use to investigate further.

Initially we might configure application debug level logging. Or we might run a network capture. As we narrow it down we might get very specific.

For example, I recently had a problem where Outlook seemed to hang for a few seconds when typing. I first ran a network capture of all traffic to see if, perhaps, a peak was causing a delay in response. There was no significant level of traffic. There was no significant latency. No retransmission. No problems with the network connection in general.

Then I ran a capture only between the Outlook client and the Exchange server (using the relevant ports). I noticed a spike in traffic every 15 minutes, coinciding with the problem. But what was the spike? Was it normal, or a fault?

So then, on the server, I configured Remote Operation (ROP) logging to capture the exact MAPI operations being performed by the client. The problem was caused by a MAPI action to re-read the mail database and refresh the view in Outlook. This occurred exactly when the user experienced the problem.

Analysis

I will mention just two aspects of the analysis.

One is that we must have a mental model of the part of the system we are examining, and all its components. We then need to rule in or out the different components we think could be involved. Often I am told the problem is in Citrix. When I investigate, it is not in Citrix at all. It is in, for example, the configuration of the application on Citrix. Or, in one very difficult case, it was port exhaustion in an unpatched Windows server. Or it might be a configuration error in Citrix after all.

The second is conjecture. I use the term to mean that we need to develop an idea of a single cause of all the different symptoms we have found in the investigation. I don’t mean a guess. For example, sometimes people will say "I think it must be the anti-virus". This is a guess. But a conjecture would be: "A windows driver fault is occurring in the I/O stack. It is a type of fault that does not appear in testing using Driver Verifier". This is exactly what the problem was, but it was extremely difficult to find.

Resolution

An interesting aspect of troubleshooting is that the final part is often really easy. Often, I don’t need to be involved any further, once the exact cause is identified.

You might think that, after investigation and analysis, we may have a good idea but we cannot be certain. Of course, nothing is ever entirely certain. But going back to what I said about the context, the problem must be caused by a specific fault; otherwise the behaviour would be normal. When you find the cause, you just know this is it.

The problem with certifications

Certifications sound like a great idea, and if I were in HR recruiting IT people, I could be forgiven for thinking that they tell me something important about a person’s skills level. But I would be wrong.

The idea of certifications is not wrong. If I were having a boiler installed, I would probably want the technician to be certified to work with gas. This would be an industry certification, perhaps with an independent assessment body. They might also have done some training for the specific type of boilers they install. The IT industry does not do this. It has vendor certifications that are intended to demonstrate a skill level with a specific technology:

  • Cisco Certified Internetwork Expert (CCIE) and Cisco Certified Design Expert (CCDE)
  • Microsoft Certified Solutions Expert (MCSE)
  • VMware Certified Design Expert (VCDX)

The certifications are hard. Anyone who has a CCIE certification has demonstrated the ability to study and has the aptitude to pass a certain type of exam. They may also have been fortunate to have an employer willing to pay the steep fees for courses. The question, however, is whether they demonstrate real expertise in the technology.

The problem is that the course material is created with the idea of enabling an exam, and the exam is created purely as a test of the course material. An example will show what I mean.

This example is taken from the Cisco material for TSHOOT. This is one of the exams for the Cisco Certified Network Professional (CCNP) Routing and Switching certification. It covers the skills for troubleshooting and maintaining Cisco IP Networks. Cisco certifications are some of the best, so this example is not an adverse comment on Cisco. It is just an example of a certification topic.

Troubleshooting an IP network requires a good understanding of TCP/IP, and how packets flow through a network from server to switch to WAN and client, and back to the server. NetFlow is a way of recording information about flows, so that you can diagnose performance problems. There is quite a lot you need to know about flows in order to diagnose problems. The course material tells us that:

"A flow is a unidirectional stream of packets, between a given source and a destination, that have several components in common. The seven fields that need to match for packets to be considered part of the same flow are as follows:

  • Source IP Address
  • Destination IP Address
  • Source Port (protocol dependent)
  • Destination Port (protocol dependent)
  • Protocol (Layer 3 or 4)
  • Type of Service (ToS) Value (differentiated services code point [DSCP])
  • Input interface."

I suppose there are a number of concepts here. One is that the flow is a specific "conversation" between client and server. Now this is a bit surprising. It says "unidirectional". Does that mean that the response to a request is in a different flow? How can I tell if there is a network or server delay if the request and response are in different flows? Another concept is that you can’t jump between interfaces. You might have more than one network connection to a switch, but those would be separate flows. I don’t really need to know that there are precisely seven fields: I can always look that up. And I don’t need to know trick questions about what might be a field but is not. TCP/IP flows is a really interesting topic, and I would like to know a bit more about it.

Now here is the test question:

"Which of the following is not a NetFlow key field

  • Source IP Address
  • Layer 4 Source Port
  • ToS Byte (DSCP)
  • TTL
  • Input Interface."

Did you notice what happened there? I don’t need to know anything about flows. I just need to remember the list of seven fields. And I need to be aware of trick answers. Is Source Port really Layer 4? Is TOS value really a byte? Did I just forget TTL, or could there be a reason why Time to Live is a field that I have forgotten? None of this matters in the real world. In the real world we switch on NetFlow, and configure a collector like SolarWinds. The real work is in interpreting the NetFlow data. And NetFlow is expensive. And it can’t tell you what is happening on the parts of the WAN you do not control. And it does not tell you what response time the user actually experiences.

The problem here is the methodology. If the exam were in Algebra, there would be a vast body of knowledge, different course material, trained teachers and professional examiners. But there is no such body of knowledge or of educators for troubleshooting an IP network. Cisco has to get someone to prepare a course and someone else to prepare an exam. The exam is a test of the course.

Certification courses provide useful training. And simple exams that test whether you paid attention are OK. But certifications do not prove skills. They prove a willingness to study, and an aptitude for a certain kind of test.

Windows 10 S for Enterprise?

Windows 10 S is the new edition of the client OS that is restricted to run only applications from the Windows Store. The advantage is that it is more stable and secure than an OS where the user can install software from anywhere. Microsoft has positioned the OS for the education market. But perhaps it has possibilities for the enterprise too.

Windows 10 S was released in May 2017. It is only available as an OEM installation, for example in this Microsoft Surface Laptop.

Surface_L_Pivot-Blue-pos2_V1

Vendors with Windows 10 S laptops currently include Dell, HP and others. Prices are in a similar range to other laptops and notebooks.

The marketing from Microsoft is aimed at the education market, but what interests me is the scope for using Windows 10 S in the enterprise. Mobility is costly, and this OS might bring the cost down.

The main problem for enterprise mobility is making it secure. One approach to this is the managed laptop:

  • a custom Windows 10 Enterprise image
  • joined to the domain
  • encrypted
  • authenticated by a certificate
  • no admin rights for the user, OR admin rights with more active detection and blocking
  • SSL VPN client
  • web proxy client.

This has more complexity and higher support costs than a standard desktop.An alternative approach is to do away with the idea of validating the device at all, and provide access to enterprise data and applications only through a virtual desktop. In this case mobility is provided by any device running the remote access software: like Citrix Receiver or the VMware Horizon client. It can be a Mac, a Chromebook or a mobile thin client. The problem here is that, if you want to work offline, you need to store data and you need local applications. If you do that, you again need a managed device, and you add further costs.

Windows 10 S may provide a new option. Use a regular desktop in the office, and a Windows 10 S laptop for mobility. As the Windows 10 S laptop cannot run applications except from the Windows Store, the level of protection and additional support required is much lower. You can still run Office applications like Outlook. You can still edit a PowerPoint presentation or work on an Excel spreadsheet offline. But the scope for malware is much reduced. If you need to use an enterprise application like SAP when working from home, say, then you can use remote access to connect to a virtual desktop or a published application. But in this case the virtual desktop needs to be provided only to the mobile users and not to all users.

Windows 10 S supports these enterprise features:

  • Trusted Platform Module (depending on the OEM hardware)
  • Certificate Store for enterprise certificates
  • BitLocker disk encryption
  • Azure domain join and Windows Hello authentication
  • mobile device management with Intune, AirWatch or similar
  • desktop applications from Microsoft, Adobe, etc. as long as they are available from the Windows Store.

The typical company laptop is an expensive compromise. It needs to be powerful enough to run enterprise applications, light enough to carry around easily, secure enough to hold enterprise data, flexible enough to allow the user to work offline. I think on balance I would prefer to use a regular desktop in the office, and a Windows 10 S laptop for mobility.

End of an era

We are seeing the end of an era in how we think of, and manage, the corporate desktop.

The corporate desktop is only about 12 to 15 years old. In a short burst, Microsoft introduced a range of technologies that made it practical to administer personal computers on a large scale: Active Directory, Group Policy, Windows Installer etc. Microsoft called it Intellimirror, although that name has disappeared. We take it all for granted now. It’s how desktops work.

Having an administered desktop like this was very important to the overall architecture of IT services. Devices on the LAN were safe and were allowed to access corporate data. Other devices were not. That’s why software like Flash, Java and Adobe Reader could be allowed to be out of date, and why people stuck with Windows XP and IE 8. They were on the LAN, so they were safe.

As things have evolved, it is getting to the point where this just isn’t the case anymore. The basic design has come to the end of the road. The effort to keep it up to date and secure is too great, and the benefit is no longer there.

I know you can keep the desktop up to date and secure. But its a lot of work and it is easy for it to break down. For the user this is all a waste of effort and cost. There’s no benefit to them. It is just a cost, a nuisance, and a constraint. As a minimum you need:

  1. Disk encryption, with boot PIN or password.
  2. Constant updates to Java, Flash, Adobe Reader, Chrome, Firefox. Not just regular, like every three months, but every few days.
  3. Every app virtualised, except the ones that won’t virtualise.
  4. Special treatment for web apps that need older versions of Internet Explorer and Java.
  5. A certificate infrastructure, and network access control, to test whether the device is one of yours or not.
  6. Security and audit controls to prevent, detect and respond to intrusions.

But mostly now the aim is to allow people to reach the main corporate services, like e-mail, from any device, and from anywhere. Not in all organisations, I know, but mostly I think. And why not?

If I can get to a service with Chrome, then I also don’t need to get to it on a company desktop. Any device with a browser will do. Web services and Cloud services don’t require a corporate desktop, and in many cases can’t tell if the client is a corporate desktop or not.

Take Office 365 as an example. I see a lot of organisations adopting it. The whole point of Office 365 is that you can use it on and off the network, and from any device (more or less). Office 365 has no method to detect whether your device is a corporate desktop or not. It can detect the IP address, and the type of device (Windows, iOS etc.), but it can’t detect whether the computer is joined to your domain, or has a machine certificate, or is encrypted, or the software is up to date – all the things that make a corporate desktop.

I think now we are looking ahead to a different paradigm.

  1. Device enrollment of any kind of device with something like Intune or AirWatch
  2. A corporate user identity, with different levels of authentication and authorisation for different services e.g. an Azure AD identity with Yubikey MFA for the finance and HR systems.
  3. Corporate applications delivered as separate services that you sign up to, and delivered mostly virtually or as web services, with no data on the end device.

I think this also means we will not need the monolithic, outsourced, integrated IT organisation. When IT is delivered as separate managed services, it does not need to be managed as a single entity. I would expect to see: Corporate Systems; Line of Business Systems; Local Systems.

How would this work in practice? Let’s say I am in engineering in a UK subsidiary of a global business. I get an Azure AD identity and a Yubikey from HR when I join. I pick my devices (a phone, a laptop) from a list, and they are delivered direct to me by the vendor. If I want, I download a corporate clean image, otherwise I just use Windows 10 OEM. I go to the Corporate Intranet new starters page, and enroll both devices in the Device Management system. They auto-discover the Office 365 e-mail and chat. I get a phone ID, which I key in to the phone on my desk.

From a portal I download the apps for my expenses and time reporting from Corporate Services. They only download onto an enrolled device. If I un-enroll, or fail to authenticate, they are wiped. Most of them will be virtual or web apps.

My engineering apps, like Autodesk, come from my Engineering Services. They will only install on an enrolled device. I can do what I like with the app, but I can’t get any important data without my Yubikey.

My own department pays the vendor for the devices. It pays Corporate services per employee. It has whatever Local Services it wants, for example its own helpdesk. Apps have a subscription per month.

OK, its not perfect, but it is a lot less complicated and easier to manage. It makes IT a set of services instead of an organisation.

Complexity in IT

In a previous post I said I thought that problems in IT are caused by complexity, and not by the pace of change, poor management or lack of skills (although any of those may contribute).

Here are some interesting thoughts from David Gelernter. Gelernter is Professor of Computer Science at Yale.

In his book Mirror Worlds he says:

"Information structures are, potentially, the most complicated structures known to man. Precisely because software is so easy to build, complexity is a deadly software killer."

"Programs that amount to a quarter of a million lines of text (there are about 7000 lines in this book, so picture 36 volumes of programs) are not in the least unusual." "It is very hard to make programs come out right. After a decent amount of effort they tend to be mostly right, with just a few small bugs. Fixing those small bugs (a bug can be small but catastrophic under the wrong circumstances) might take ten times longer than the entire specification, design, construction, and testing effort up to this point."

"If you are a software designer and you can’t master and subdue monumental complexity, you’re dead: your machines don’t work….Hence "managing complexity" must be your goal."

I don’t think many businesses or customers of IT fully recognise this. They think of IT as fiddly, not for them, full of jargon. They are happy to say they don’t really understand it. They don’t realize that hardly anyone does!

Why is IT so difficult?

A friend of mine, a very experienced and senior non-executive director, asked me why, in all the organisations he knows, IT is the area that causes the most difficulty. There are several common explanations, but I am not sure they add up. This leads me to a different explanation, with interesting consequences.

IT causes difficulty in many ways, for example:

  • results not achieved on time, not what was expected or promised, and not within budget
  • catastrophic failure of critical systems, or loss of data
  • systems hard to use, difficult to change, not integrated with other systems, expensive to maintain, hard to replace
  • problems with staff, and with suppliers: poor quality, high turnover, unreliable.

So my friend can be reasonably confident that a five year multi-billion pound engineering project will be completed successfully, while a one year million pound IT project is unlikely to run to plan. Why is that?

Possible explanations:

  1. IT is changing so fast that whatever you plan is obsolete within a short time
  2. People in IT generally lack professional training and skills
  3. People in the business don’t understand IT, and the people in IT don’t understand the business.

I have doubts about these explanations. They have a superficial truth, but for me they don’t explain the level of difficulty in managing IT successfully.

1. The rate of change

The IT industry is constantly producing new things, that’s true. But in other respects the rate of change is fairly slow. The way we do computing is not fundamentally very different from say ten years ago. Many of the same companies are selling many of the same products. If you started a project five years ago, no matter how large, it is difficult to see what technology has changed sufficiently to cause the project to fail.

2. Training and skills

Because things in IT go wrong, it is easy retrospectively to identify faults in the skills of the individuals as the cause, but it is not necessarily so. When things are difficult or impossible to achieve even the highest level of skill may not be sufficient. It is hard to imagine that in general the training and skills of people in IT are lower than in Sales, Marketing, Procurement, Distribution. Maybe those areas just aren’t as difficult, and so the managers appear to be more successful.

3. Understanding

There is a high threshold in getting to grips with the language of IT, certainly. But at the level at which IT and other people in the business need to communicate this really should not be relevant. Medicine has its own language, but doctors don’t seem to have the same problem communicating with patients. I suspect that problems in understanding are more to do with trust than with language.

So if these explanations don’t account for the difficulty with IT what are we left with? My view is that the root cause is complexity. IT systems are the link between a human intention and a computer chip. Human intentions are imprecise and hard to define, but chips are strictly binary. The layers of software in between intention and chip are hugely complex. To produce a predicable outcome is extremely difficult.

If it is true that the root cause of difficulty in managing IT is complexity then there are two consequences. The first is that we should aim to minimise the complexity in every possible way; and the second is that we need people who manage complexity very well.