Fault with Company Portal

This is a story about the complete failure of Microsoft Premier Support to diagnose and resolve a fault in the Company Portal.

It is difficult to put into words how complete the failure is. But it includes a failure to define the problem; a failure to capture it or reproduce it; and a failure to provide any diagnosis of the cause.

The Company Portal is the Modern App, or Store App, that enables a user to install applications that have been published and made available to them from Intune. It is an essential part of the Mobile Device Management (MDM) platform. Without the Company Portal, a user can only have the applications that are “Required”. So, after Autopilot, Company Portal will often be the first place a user goes, to obtain the rest of the applications that they need to work with. An example is specialist finance applications. These might be made available to a community of finance users, but each person with install the ones they need individually.

The problem we have had for several months is that the Company Portal will suddenly disappear from a user’s desktop. It is gone. The user can search for “Company Portal” and it is not there. Where has it gone? No idea. How do you get it back? Well, you obviously can’t use the Company Portal to get it!

The facts of the problem are simple and clear, though you would not believe it from the number of times we have been asked to explain and provide logs:

  • After Autopilot completes, Company Portal is present and working.
  • Some short time later, it has disappeared from the user’s Start menu.
  • If you run Get-AppXPackage as the user, the Company Portal is not listed. However, if you log on as an admin, and run Get-AppXPackage –AllUsers, then the portal is shown as installed and available but only for the admin user.
  • The AppX event log does not show any obvious relevant events.
  • It seems to happen in episodes. And it seems to happen sometimes and not others.

We have been asked repeatedly to provide logs: Autopilot logs and One Data Collector logs. But, obviously, if you gather logs before it has disappeared, then there is nothing to see. If you gather logs after it has disappeared, then there is also nothing to see.

After a while, we asked Microsoft Premier Support to try to reproduce the fault themselves instead of continuously asking us for logs. Amazingly, they are unable to do this. Microsoft Premier Support does not have access to virtual machine, or physical machines, that can be used to reproduce faults in Intune and Autopilot. Just let that sink in. Premier Support is unable to attempt to reproduce a fault in Autopilot. It depends on the customer to reproduce it.

We had a long discussion with Premier Support about Offline versus Online apps. The Microsoft documentation for Autopilot recommends in several places that you should use the Offline version of Company Portal. This is counter-intuitive. Offline apps are designed, intended, to be used offline. The scenario given in Microsoft documentation is a kiosk or shared device that is not connected to the internet. The Offline app is installed by DISM in a task sequence, and is used offline. Company Portal, by definition, is of no use offline. It is used to install applications from Intune. If the device were offline, it would not connect to Intune. So why install the Offline version?

We eventually established, at least we think, that an Offline app is in some way cached by Intune; whereas an Online app is obtained directly from a Microsoft Store repository. This seems relevant to the case of the disappearing portal, but we never discovered more about the true difference.

In an early occurrence, we found an AppX event to say that the Company Portal was not installed because of a missing dependency. The missing dependency was the Microsoft Services Store Engagement app. This is the app that enables users to provide feedback. But this app is (apparently) an embedded part of Windows 10 and cannot be missing. We heard no more about this.

The Company Portal stopped disappearing for a while, and we deduced that the fault was in some way related to version updates. It occurred frequently when the version changed from 111.1.107.0 to 11.1.146.0. It has started to occur frequently again now the version is 11.1.177.0. Of course, we have no idea how it is related to the update. We don’t even really know how an update of an Offline app happens.

Finally Microsoft Premier Support has asked us to gather a SysInternals Procmon log, together with a UXTrace log. I have done a lot of troubleshooting with Procmon. It generates huge log files, of every file and registry operation, as well as some TCP operations. To use Procmon effectively, you need a way to stop it when the fault occurs. Microsoft Premier Support simply asked us to run it and stop it when the fault occurred. There are several problems with this. The first is that the user needs to run UXTrace and Procmon with elevated rights. In our environment (as in almost any production environment) the user does not have admin rights and cannot elevate. The second is that Procmon creates huge logs. You can’t just run it for an unspecified length of time, then stop it and save the log. Microsoft Premier Support were clearly unable to understand the problem of gathering the logs, let alone provide a solution. This is dismal. I would expect a competent second-line engineer to have the skills to work out a strategy for collecting logs. It is part of the basic skill of troubleshooting.

So, three months on, Microsoft Premier Support has no clue, and no practical problem-solving approach.

The thing we have found is that Premier Support seems to have no access to the people who develop and run Intune, Autopilot and Company Portal. They are just as much in the dark as you or I.

The Troubleshooting Process

I do a lot of troubleshooting of problems in corporate infrastructure. By that, I mean investigating difficult technical problems in a structured way to find either a solution or a workaround. This post is a few thoughts on the process.

Here is a caricature of the process that often happens. It is imaginary, but I think most people will easily identify real cases that are similar.

  1. A user (or a group of users, for example at one site or using one application) experiences a problem. The help desk tries a few things, not very systematically. Sometimes different people from the help desk try the same thing again. Various updates and changes are made to see if they fix the problem. Sometimes the problem is resolved. Sometimes it goes away. And sometimes it is ignored as "just the way it is until we upgrade xyz".
  2. A user experiences a problem. The user is senior, or the problem affects something that senior people care about. A problem manager takes over the case. The problem manager is someone who co-ordinates actions but does not, themselves, possess the technical skills to resolve the problem. The problem is assigned to whichever team seems most likely e.g. networks, server, storage, application. The team works hard to demonstrate that the fault cannot be with them. The problem continues, with increasing levels of polite acrimony. Eventually a significant change is made which may, or may not, be the cause e.g. a platform component is changed. The problem goes away, but no-one knows what actually caused it.

What is the perfect way? A user experiences a problem. A technical wizard understands every aspect of the problem without needing to be told anything about it, or to do any investigation. The technical wizard knows the solution, whether it is in the network, or server, or storage, or application layer. After some discussion, the fix is implemented and the problem is solved.

This perfect way is absurd, obviously. No-one can know what the problem is until they have seen it, thought about it, asked a few questions, gathered some data. No-one can be expert in enough technologies to know exactly what the fix is, without running some tests, trying a few things.

So we need a process that is not like 1) and 2), but also does not assume there is a perfect way.

First, some context. We already have an awareness of what is normal behaviour and what is not. If a user says that logon is slow, we assess it against a typical logon time. If the behaviour is not normal, then we can assume there is a fault. For the most part (not always!) vendor products do not show obvious faults in isolation. So, if the vendors do not know about this fault, then there must be a fault in our configuration, or an unknown fault in the product. There must be something we can change so that the problem does not occur. Our job is to find what to change.

The way I go about it is divided into five parts. These are not all sequential, but they involve different types of activity:

  1. Incident Handling
  2. Problem Definition
  3. Investigation
  4. Analysis
  5. Resolution.

I am not going to describe each part in detail: just give an idea.

Incident Handling

Being practical, I am going to assume that the great majority of day-to-day help desk calls do not require much rigour. The most important thing is to avoid doing things twice. As soon as we (on the help desk) realise that a problem is not simple, we need to follow a systematic process to gather the incident data in a way that avoids having to do it again.

Problem Definition

As soon as we realise that a fault may be more difficult to resolve than by providing a simple change, we need to prepare a Problem Definition.

Very often I find that, when starting out, it is more difficult to write down an accurate statement of the problem than you might expect. The report might be: "The application crashes". But is that a Not Responding window or an application error? If there is an error, what is the exact message, and what happens when you click on OK? Is there an error shown in the Application Event Log? Does the application log show an error? How often does this occur? How many people are affected? What is the user doing in the moments leading up to the problem? Writing down what is and is not known goes a long way to defining what problem we are trying to solve.

It is not always easy. I have worked on a fault causing Windows to crash with a bug check, where it was difficult to know even whether there was a single problem, or more than one.

Investigation

This is not a distinct step. But I identify it separately because we attempt to gather data systematically to understand the problem better. The more we understand, the more specific the tools we might use to investigate further.

Initially we might configure application debug level logging. Or we might run a network capture. As we narrow it down we might get very specific.

For example, I recently had a problem where Outlook seemed to hang for a few seconds when typing. I first ran a network capture of all traffic to see if, perhaps, a peak was causing a delay in response. There was no significant level of traffic. There was no significant latency. No retransmission. No problems with the network connection in general.

Then I ran a capture only between the Outlook client and the Exchange server (using the relevant ports). I noticed a spike in traffic every 15 minutes, coinciding with the problem. But what was the spike? Was it normal, or a fault?

So then, on the server, I configured Remote Operation (ROP) logging to capture the exact MAPI operations being performed by the client. The problem was caused by a MAPI action to re-read the mail database and refresh the view in Outlook. This occurred exactly when the user experienced the problem.

Analysis

I will mention just two aspects of the analysis.

One is that we must have a mental model of the part of the system we are examining, and all its components. We then need to rule in or out the different components we think could be involved. Often I am told the problem is in Citrix. When I investigate, it is not in Citrix at all. It is in, for example, the configuration of the application on Citrix. Or, in one very difficult case, it was port exhaustion in an unpatched Windows server. Or it might be a configuration error in Citrix after all.

The second is conjecture. I use the term to mean that we need to develop an idea of a single cause of all the different symptoms we have found in the investigation. I don’t mean a guess. For example, sometimes people will say "I think it must be the anti-virus". This is a guess. But a conjecture would be: "A windows driver fault is occurring in the I/O stack. It is a type of fault that does not appear in testing using Driver Verifier". This is exactly what the problem was, but it was extremely difficult to find.

Resolution

An interesting aspect of troubleshooting is that the final part is often really easy. Often, I don’t need to be involved any further, once the exact cause is identified.

You might think that, after investigation and analysis, we may have a good idea but we cannot be certain. Of course, nothing is ever entirely certain. But going back to what I said about the context, the problem must be caused by a specific fault; otherwise the behaviour would be normal. When you find the cause, you just know this is it.

The problem with certifications

Certifications sound like a great idea, and if I were in HR recruiting IT people, I could be forgiven for thinking that they tell me something important about a person’s skills level. But I would be wrong.

The idea of certifications is not wrong. If I were having a boiler installed, I would probably want the technician to be certified to work with gas. This would be an industry certification, perhaps with an independent assessment body. They might also have done some training for the specific type of boilers they install. The IT industry does not do this. It has vendor certifications that are intended to demonstrate a skill level with a specific technology:

  • Cisco Certified Internetwork Expert (CCIE) and Cisco Certified Design Expert (CCDE)
  • Microsoft Certified Solutions Expert (MCSE)
  • VMware Certified Design Expert (VCDX)

The certifications are hard. Anyone who has a CCIE certification has demonstrated the ability to study and has the aptitude to pass a certain type of exam. They may also have been fortunate to have an employer willing to pay the steep fees for courses. The question, however, is whether they demonstrate real expertise in the technology.

The problem is that the course material is created with the idea of enabling an exam, and the exam is created purely as a test of the course material. An example will show what I mean.

This example is taken from the Cisco material for TSHOOT. This is one of the exams for the Cisco Certified Network Professional (CCNP) Routing and Switching certification. It covers the skills for troubleshooting and maintaining Cisco IP Networks. Cisco certifications are some of the best, so this example is not an adverse comment on Cisco. It is just an example of a certification topic.

Troubleshooting an IP network requires a good understanding of TCP/IP, and how packets flow through a network from server to switch to WAN and client, and back to the server. NetFlow is a way of recording information about flows, so that you can diagnose performance problems. There is quite a lot you need to know about flows in order to diagnose problems. The course material tells us that:

"A flow is a unidirectional stream of packets, between a given source and a destination, that have several components in common. The seven fields that need to match for packets to be considered part of the same flow are as follows:

  • Source IP Address
  • Destination IP Address
  • Source Port (protocol dependent)
  • Destination Port (protocol dependent)
  • Protocol (Layer 3 or 4)
  • Type of Service (ToS) Value (differentiated services code point [DSCP])
  • Input interface."

I suppose there are a number of concepts here. One is that the flow is a specific "conversation" between client and server. Now this is a bit surprising. It says "unidirectional". Does that mean that the response to a request is in a different flow? How can I tell if there is a network or server delay if the request and response are in different flows? Another concept is that you can’t jump between interfaces. You might have more than one network connection to a switch, but those would be separate flows. I don’t really need to know that there are precisely seven fields: I can always look that up. And I don’t need to know trick questions about what might be a field but is not. TCP/IP flows is a really interesting topic, and I would like to know a bit more about it.

Now here is the test question:

"Which of the following is not a NetFlow key field

  • Source IP Address
  • Layer 4 Source Port
  • ToS Byte (DSCP)
  • TTL
  • Input Interface."

Did you notice what happened there? I don’t need to know anything about flows. I just need to remember the list of seven fields. And I need to be aware of trick answers. Is Source Port really Layer 4? Is TOS value really a byte? Did I just forget TTL, or could there be a reason why Time to Live is a field that I have forgotten? None of this matters in the real world. In the real world we switch on NetFlow, and configure a collector like SolarWinds. The real work is in interpreting the NetFlow data. And NetFlow is expensive. And it can’t tell you what is happening on the parts of the WAN you do not control. And it does not tell you what response time the user actually experiences.

The problem here is the methodology. If the exam were in Algebra, there would be a vast body of knowledge, different course material, trained teachers and professional examiners. But there is no such body of knowledge or of educators for troubleshooting an IP network. Cisco has to get someone to prepare a course and someone else to prepare an exam. The exam is a test of the course.

Certification courses provide useful training. And simple exams that test whether you paid attention are OK. But certifications do not prove skills. They prove a willingness to study, and an aptitude for a certain kind of test.

Windows 10 S for Enterprise?

Windows 10 S is the new edition of the client OS that is restricted to run only applications from the Windows Store. The advantage is that it is more stable and secure than an OS where the user can install software from anywhere. Microsoft has positioned the OS for the education market. But perhaps it has possibilities for the enterprise too.

Windows 10 S was released in May 2017. It is only available as an OEM installation, for example in this Microsoft Surface Laptop.

Surface_L_Pivot-Blue-pos2_V1

Vendors with Windows 10 S laptops currently include Dell, HP and others. Prices are in a similar range to other laptops and notebooks.

The marketing from Microsoft is aimed at the education market, but what interests me is the scope for using Windows 10 S in the enterprise. Mobility is costly, and this OS might bring the cost down.

The main problem for enterprise mobility is making it secure. One approach to this is the managed laptop:

  • a custom Windows 10 Enterprise image
  • joined to the domain
  • encrypted
  • authenticated by a certificate
  • no admin rights for the user, OR admin rights with more active detection and blocking
  • SSL VPN client
  • web proxy client.

This has more complexity and higher support costs than a standard desktop.An alternative approach is to do away with the idea of validating the device at all, and provide access to enterprise data and applications only through a virtual desktop. In this case mobility is provided by any device running the remote access software: like Citrix Receiver or the VMware Horizon client. It can be a Mac, a Chromebook or a mobile thin client. The problem here is that, if you want to work offline, you need to store data and you need local applications. If you do that, you again need a managed device, and you add further costs.

Windows 10 S may provide a new option. Use a regular desktop in the office, and a Windows 10 S laptop for mobility. As the Windows 10 S laptop cannot run applications except from the Windows Store, the level of protection and additional support required is much lower. You can still run Office applications like Outlook. You can still edit a PowerPoint presentation or work on an Excel spreadsheet offline. But the scope for malware is much reduced. If you need to use an enterprise application like SAP when working from home, say, then you can use remote access to connect to a virtual desktop or a published application. But in this case the virtual desktop needs to be provided only to the mobile users and not to all users.

Windows 10 S supports these enterprise features:

  • Trusted Platform Module (depending on the OEM hardware)
  • Certificate Store for enterprise certificates
  • BitLocker disk encryption
  • Azure domain join and Windows Hello authentication
  • mobile device management with Intune, AirWatch or similar
  • desktop applications from Microsoft, Adobe, etc. as long as they are available from the Windows Store.

The typical company laptop is an expensive compromise. It needs to be powerful enough to run enterprise applications, light enough to carry around easily, secure enough to hold enterprise data, flexible enough to allow the user to work offline. I think on balance I would prefer to use a regular desktop in the office, and a Windows 10 S laptop for mobility.

End of an era

We are seeing the end of an era in how we think of, and manage, the corporate desktop.

The corporate desktop is only about 12 to 15 years old. In a short burst, Microsoft introduced a range of technologies that made it practical to administer personal computers on a large scale: Active Directory, Group Policy, Windows Installer etc. Microsoft called it Intellimirror, although that name has disappeared. We take it all for granted now. It’s how desktops work.

Having an administered desktop like this was very important to the overall architecture of IT services. Devices on the LAN were safe and were allowed to access corporate data. Other devices were not. That’s why software like Flash, Java and Adobe Reader could be allowed to be out of date, and why people stuck with Windows XP and IE 8. They were on the LAN, so they were safe.

As things have evolved, it is getting to the point where this just isn’t the case anymore. The basic design has come to the end of the road. The effort to keep it up to date and secure is too great, and the benefit is no longer there.

I know you can keep the desktop up to date and secure. But its a lot of work and it is easy for it to break down. For the user this is all a waste of effort and cost. There’s no benefit to them. It is just a cost, a nuisance, and a constraint. As a minimum you need:

  1. Disk encryption, with boot PIN or password.
  2. Constant updates to Java, Flash, Adobe Reader, Chrome, Firefox. Not just regular, like every three months, but every few days.
  3. Every app virtualised, except the ones that won’t virtualise.
  4. Special treatment for web apps that need older versions of Internet Explorer and Java.
  5. A certificate infrastructure, and network access control, to test whether the device is one of yours or not.
  6. Security and audit controls to prevent, detect and respond to intrusions.

But mostly now the aim is to allow people to reach the main corporate services, like e-mail, from any device, and from anywhere. Not in all organisations, I know, but mostly I think. And why not?

If I can get to a service with Chrome, then I also don’t need to get to it on a company desktop. Any device with a browser will do. Web services and Cloud services don’t require a corporate desktop, and in many cases can’t tell if the client is a corporate desktop or not.

Take Office 365 as an example. I see a lot of organisations adopting it. The whole point of Office 365 is that you can use it on and off the network, and from any device (more or less). Office 365 has no method to detect whether your device is a corporate desktop or not. It can detect the IP address, and the type of device (Windows, iOS etc.), but it can’t detect whether the computer is joined to your domain, or has a machine certificate, or is encrypted, or the software is up to date – all the things that make a corporate desktop.

I think now we are looking ahead to a different paradigm.

  1. Device enrollment of any kind of device with something like Intune or AirWatch
  2. A corporate user identity, with different levels of authentication and authorisation for different services e.g. an Azure AD identity with Yubikey MFA for the finance and HR systems.
  3. Corporate applications delivered as separate services that you sign up to, and delivered mostly virtually or as web services, with no data on the end device.

I think this also means we will not need the monolithic, outsourced, integrated IT organisation. When IT is delivered as separate managed services, it does not need to be managed as a single entity. I would expect to see: Corporate Systems; Line of Business Systems; Local Systems.

How would this work in practice? Let’s say I am in engineering in a UK subsidiary of a global business. I get an Azure AD identity and a Yubikey from HR when I join. I pick my devices (a phone, a laptop) from a list, and they are delivered direct to me by the vendor. If I want, I download a corporate clean image, otherwise I just use Windows 10 OEM. I go to the Corporate Intranet new starters page, and enroll both devices in the Device Management system. They auto-discover the Office 365 e-mail and chat. I get a phone ID, which I key in to the phone on my desk.

From a portal I download the apps for my expenses and time reporting from Corporate Services. They only download onto an enrolled device. If I un-enroll, or fail to authenticate, they are wiped. Most of them will be virtual or web apps.

My engineering apps, like Autodesk, come from my Engineering Services. They will only install on an enrolled device. I can do what I like with the app, but I can’t get any important data without my Yubikey.

My own department pays the vendor for the devices. It pays Corporate services per employee. It has whatever Local Services it wants, for example its own helpdesk. Apps have a subscription per month.

OK, its not perfect, but it is a lot less complicated and easier to manage. It makes IT a set of services instead of an organisation.

Complexity in IT

In a previous post I said I thought that problems in IT are caused by complexity, and not by the pace of change, poor management or lack of skills (although any of those may contribute).

Here are some interesting thoughts from David Gelernter. Gelernter is Professor of Computer Science at Yale.

In his book Mirror Worlds he says:

"Information structures are, potentially, the most complicated structures known to man. Precisely because software is so easy to build, complexity is a deadly software killer."

"Programs that amount to a quarter of a million lines of text (there are about 7000 lines in this book, so picture 36 volumes of programs) are not in the least unusual." "It is very hard to make programs come out right. After a decent amount of effort they tend to be mostly right, with just a few small bugs. Fixing those small bugs (a bug can be small but catastrophic under the wrong circumstances) might take ten times longer than the entire specification, design, construction, and testing effort up to this point."

"If you are a software designer and you can’t master and subdue monumental complexity, you’re dead: your machines don’t work….Hence "managing complexity" must be your goal."

I don’t think many businesses or customers of IT fully recognise this. They think of IT as fiddly, not for them, full of jargon. They are happy to say they don’t really understand it. They don’t realize that hardly anyone does!

Why is IT so difficult?

A friend of mine, a very experienced and senior non-executive director, asked me why, in all the organisations he knows, IT is the area that causes the most difficulty. There are several common explanations, but I am not sure they add up. This leads me to a different explanation, with interesting consequences.

IT causes difficulty in many ways, for example:

  • results not achieved on time, not what was expected or promised, and not within budget
  • catastrophic failure of critical systems, or loss of data
  • systems hard to use, difficult to change, not integrated with other systems, expensive to maintain, hard to replace
  • problems with staff, and with suppliers: poor quality, high turnover, unreliable.

So my friend can be reasonably confident that a five year multi-billion pound engineering project will be completed successfully, while a one year million pound IT project is unlikely to run to plan. Why is that?

Possible explanations:

  1. IT is changing so fast that whatever you plan is obsolete within a short time
  2. People in IT generally lack professional training and skills
  3. People in the business don’t understand IT, and the people in IT don’t understand the business.

I have doubts about these explanations. They have a superficial truth, but for me they don’t explain the level of difficulty in managing IT successfully.

1. The rate of change

The IT industry is constantly producing new things, that’s true. But in other respects the rate of change is fairly slow. The way we do computing is not fundamentally very different from say ten years ago. Many of the same companies are selling many of the same products. If you started a project five years ago, no matter how large, it is difficult to see what technology has changed sufficiently to cause the project to fail.

2. Training and skills

Because things in IT go wrong, it is easy retrospectively to identify faults in the skills of the individuals as the cause, but it is not necessarily so. When things are difficult or impossible to achieve even the highest level of skill may not be sufficient. It is hard to imagine that in general the training and skills of people in IT are lower than in Sales, Marketing, Procurement, Distribution. Maybe those areas just aren’t as difficult, and so the managers appear to be more successful.

3. Understanding

There is a high threshold in getting to grips with the language of IT, certainly. But at the level at which IT and other people in the business need to communicate this really should not be relevant. Medicine has its own language, but doctors don’t seem to have the same problem communicating with patients. I suspect that problems in understanding are more to do with trust than with language.

So if these explanations don’t account for the difficulty with IT what are we left with? My view is that the root cause is complexity. IT systems are the link between a human intention and a computer chip. Human intentions are imprecise and hard to define, but chips are strictly binary. The layers of software in between intention and chip are hugely complex. To produce a predicable outcome is extremely difficult.

If it is true that the root cause of difficulty in managing IT is complexity then there are two consequences. The first is that we should aim to minimise the complexity in every possible way; and the second is that we need people who manage complexity very well.

Migrating applications to Windows 7

One of the biggest challenges when upgrading to Windows 7 is in testing and preparing applications. This blog puts together a few conclusions that might assist you in planning the work.

The extended lifespan of Windows XP and Server 2003 has been a sort of "peace dividend" or "pension holiday". When you do come to upgrade it is important not to underestimate the cost and uncertainty involved in application compatibility. But at the same time you don’t need to accept that the migration will take forever.

The problem is that applications can be incompatible with Windows 7 in many different ways. Some of these are trivial and easily solved. Some are harder to solve. Some are hard to find and impossible to solve. You don’t know until you test. The same applies to running the applications in Citrix on Server 2008 R2, with the added complication of 64-bit. Here are a few examples to illustrate:

Standard third party application: Lotus Notes

  • The current version 8.5.1 is not certified on Windows 7. Does it work OK or not? Do you wait until later this year for 8.5.2, or go ahead with 8.5.1? There is a patch, Fix Pack 1, that is certified but it adds complexity to the installation.
  • You would think it would be quite simple to find out: ask the vendor. But most vendors do not certify previous versions. That does not mean they don’t run perfectly well. In this case, although 8.5.1 is not certified, the release notes for Fix Pack 1 contain only trivial changes and 8.5.1 appears to work fine, so there is no reason to delay.

Specialised third party application: legal software

  • The installation fails on Vista/Windows 7. Examination of the logs and the Windows Installer file shows there is a custom action to copy templates into the user profile path. The path is hard coded and fails.
  • The solution is to customise the installer to remove the custom action and replicate it in a custom script. Inform the vendor so they can modify the installer.

Custom system: membership database

  • This is an old system with a SystemBuilder 4GL graphical interface to a Unidata database. The version of SystemBuilder being used is not certified or even tested on Vista/Windows 7. The SBClient application contains an OEM Pervasive SQL client that is also obsolete. The client does mail merge to Word 2003, so it would need to be tested if used with Word 2007 or 2010.
  • There is a new version of SystemBuilder that, amazingly for such an old product, is certified both on Windows 7 and on Server 2008 R2 Terminal Services. The new version seems to work perfectly with the old system. However you need to change the client side and the server side of the graphical interface at the same time, so it would be a big bang change to a critical system.
  • But, after packaging the old version using Wise Package Studio, it seems to work fine on both Windows 7 and on Server 2008 Terminal Services, so there is no need to upgrade.

Other Gotcha’s

  • Applications with OEM versions of Crystal Report 11 or earlier do not install on Windows 7. Crystal Reports 11 is not supported on Windows 7, and you can’t upgrade an OEM edition, but it can be modified to install successfully.
  • Applications using the common VB6 SendKeys function raise an error on Windows 7. Sendkeys does not work with UAC. UAC can only be turned off per computer, not per application so there is no workaround except to turn UAC off entirely.
  • In XP you can use the Printer button on the PageSetupDialog to set different printer properties for an application. In the Vista/Windows 7 API it’s gone. There’s no error, it’s just not there. But in .NET Framework it’s still there! This might seem rather obscure, but the point is: you would have to do a lot of testing to discover this and then find out whether it matters to your users of that application.

Obviously you could wait till your applications are fully tested or upgraded to the latest certified versions, but this could take impossibly long. If you have just one core application that is not ready, you can’t upgrade the desktop.

A lot of people seem to be combining application virtualization with a Windows 7 rollout. Perhaps surprisingly, application virtualization is largely irrelevant to compatibility across OS’s. With a virtualized app, the same dll’s run within the OS with exactly the same results. If the application faults natively, it will fault when virtualized. Virtualization can be used to implement a compatibility fix, but you still need the fix.

The best way to approach this is with a structured testing environment and a full set of delivery options. Then, for the difficult applications, you can set a time limit.

Structured Testing Environment

  • Wise Package Studio or similar, to examine the internal structure of the application and check for conflicts between applications.
  • A full VMWare testing environment with VMWare Workstation and ESXi, so you can create all the packaging and testing environments you need and, most importantly, leave them running so that users can log on remotely to test.
  • Scripted or automated tests and test accounts for each application.
  • Microsoft Application Compatibility Toolkit for testing and distributing fixes
  • Thorough documentation and audit trail of the testing.

Delivery options

  • Native installation for compatible and well behaved applications
  • Citrix XenApp published applications, or perhaps virtual desktop, for incompatible applications
  • Virtualization for conflicting applications (e.g. applications that require different versions of common components) or badly behaved applications (e.g. applications that change the default behaviour of the OS)

Most larger organisations already use several delivery options. What is new is to work out the interdependencies of different applications and which platforms they need to sit on. For example, if the incompatible app does a mail merge to Word or a report export to Excel, then the back end platform needs to have Office. It won’t be able to merge and export to the front end. This means that you also have to consider the user profile settings across different delivery platforms. If the user changes a default printer on the Windows 7 front end, should the same change be made to the back end or not?

With this approach, structured testing and multiple delivery options, you can set a time limit for preparing applications for Windows 7 migration. You can migrate the core desktop to Windows 7, while migrating older applications when they are ready.