Windows Hello for Business and MFA

As an end-user computing specialist, I spend most of my time on security-related matters. Good cyber security is the most difficult part of the design to get right, with a balance between security and ease of use. It is quite easy to implement the standard security controls. What is more difficult is to deal with all the exceptions and operational difficulties in a secure way.

One small example of this is the configuration of Windows Hello for Business (WHB). WHB is an excellent authentication method but, like anything, it has potential flaws too.

Before WHB

Before WHB, a member of staff could typically log on to any corporate device. It had to be a corporate device, because only that would recognise the domain account. But it could be any corporate device. In fact, roaming profiles were designed to enable anyone to log on to any device.

There are two problems with this. First, because it relies only on a simple password, the password needs to be reasonably long and complex. This increases the risk that the user will write the password down. Where do they do this? They know they should not put it on a post-it note stuck to the computer. So they write it down in a notebook kept with the computer. If the computer is stolen with the notebook, the thief has access to the computer as that person.

The second problem is that, if someone gets hold of a password (for example by phishing), they only need to get hold of a device, any device, to gain access. There is no protection other than knowledge of the password combined with access to any device. An insider might easily obtain a password, and have access to another device to use it. Indeed, people might even voluntarily disclose their password, or arrange to have a password changed, so that another person can use it on another device (e.g. maternity leave).

With WHB

WHB counters these problems. It uses a one-time event to create an association between a specific user and a specific device. The one-time event uses a second authentication method to verify the identity of the user. When the identity is confirmed, a unique PIN is created, valid only for that device. The association is bound up in the Trusted Platform Module (TPM), a hardware component on the motherboard of the computer. When the PIN is supplied, it validates the association between user and device and unlocks the credentials to be used for access to network resources, for example the email account. The email service (e.g. Exchange) knows absolutely nothing about the PIN. It doesn’t even know there is a PIN. What it knows (through Conditional Access) is that the user supplied valid credentials from a managed device protected by a TPM.

We all have experience of something similar, when we create a PIN for a mobile phone. And, just like a phone, facial recognition or fingerprint can be used with WHB as a proxy for the PIN. The difference is that, with the personal phone, there was no separate verification of the identity at the outset. The person with the PIN is just the person who set up the phone.

Two flaws

There are two flaws with this authentication method. The first is in the one-time event; the second is in the way WHB is configured.

For the first, you need to know that the person setting up WHB is who they say they are. That might be quite obvious if they come into an office to set it up. But if you send out devices to be set up at home, you don’t have an assurance that the device gets to the right person. There has to have been a secure association created in the first place, between the user and the method they use to verify their identity.

The way I think of the verification of identity, or multi-factor authentication (MFA), is that it is like showing your photo ID to pick up a building pass. You need to come into the building, where people can see you, and you need to supply a proof of identity. Then you pick up the pass, and the pass in future lets you into the building. But that depends on having a valid proof of identity in the first place. The second method (building pass) is piggy-backing on the first method (photo ID).

When setting up WHB for the first time, staff typically use the Microsoft Authenticator app on their mobile phone. But setting up the Authenticator app does not prove your identity. It only proves that you know the password. So there is a circular logic if you set up the Authenticator app at the same time as setting up WHB. The steps in this circular logic are:

  1. User starts to set up WHB on a device, by supplying a password
  2. If the account does not already have a second factor method associated with it, then the user is prompted to set it up
  3. User downloads Microsoft Authenticator app on phone
  4. User receives prompt on phone to validate their identity
  5. User sets up PIN associated with that identity.

At no time did the user prove their identity other than by supplying the password of the account. WHB does not know who owns the phone. In the future, any prompt for MFA will prove that it is the same person who set up the MFA; but not who that person really is. So the second factor (Microsoft Authenticator app on a mobile phone) must be set up in a secure way that validates the identity of the person setting it up.

This is actually quite difficult to do. When an account is first created, it does not have a second authentication factor associated with it, only a password. A vulnerability exists until the second is set up securely and verifiably by the owner of the account.

The physical way to do this is to set up the second factor for the account as a one-time event similar to obtaining a building pass. The member of staff comes into the office. Someone validates their identity and enables the registration of the phone as a second factor. Any pre-existing registration is deleted. Then the member of staff receives the device and sets up WHB. The logical way to do this is with a Conditional Access policy. The policy can require specific conditions to allow the user to register security information. For example, it can require this to be done from the corporate LAN. Now the steps in this logic are:

  1. User enters the building, where their identity is verified
  2. User proceeds, as before, to set up device with WHB, but this time the second factor is a phone locked to a verified identity.

The second flaw is that the configuration of WHB enables it. It does not enforce it. The user still has the option to sign in with a password. This means that anyone can sign in with only a password and gain full access to the device and the data of the user of that account. This was the problem WHB is designed to solve. How did that happen?? The user will be nagged to set up WHB, but they don’t have to.

The way to prevent this is to configure Conditional Access policies to require multi-factor authentication for every access, even on managed devices. You might say that is absurd. Surely the possession of a managed device is the second factor. You have the password, and you have the device. But the critical point is that the WHB PIN (not password) is what proves ownership of the device. When using the PIN, the user does not need to respond to an MFA prompt when they log on. Supplying the PIN counts as performing MFA, because it was created with MFA. The MFA is valid (by default) for 90 days and, every time you supply the PIN, you revalidate and extend the MFA.

This is just one example of what I mean about striking the right balance between security and ease of use. It is easy to enable WHB, but it takes a few extra steps to make sure it is associated with a verified identity.

AppLocker or WDAC?

This is a short piece on the question of whether to use AppLocker or Windows Defender Application Control (WDAC) for application control on a Windows desktop. As technicians, we can sometimes get too interested in what technology is best, or what is newest. But the more important matter is what best meets the requirement.

WDAC is the newer technology, and a significant advance on AppLocker. You can read about the differences here: Overview. So, in a Microsoft environment (Windows 10/11 desktop, 365 Apps, Intune, SharePoint etc.) we should assume we would use WDAC unless there are reasons not to. What could those reasons be?

Cyber security is important, of course. But it needs to be a part of a productive work environment. The most secure desktop is one that cannot be used. And it needs to be part of a holistic approach. For example, if we do not allow a user to have local administrator privileges on a device, the exposure to malware is much lower than if we do. If we require MFA to log on to a device, the risk of a malicious user is much lower than if we do not.

In my view, application control should be transparent to the user. Software that is legitimate should just run. Software that is illegitimate should not run, with a message about the reason. If a new piece of software is introduced, it should either just run, or not run. There should not be a long delay while IT staff rejig the rules to allow it to run. An example would be a piece of finance software. Let’s say we are coming up for year-end, and the finance team have an update to one of the applications they use. They should be able to install it, and it should run. It should not take a month to develop and test application control rules.

AppLocker is much easier and less risky to update than WDAC. AppLocker XML files are simple text files that you can edit manually. WDAC XML files are also text files, but it is not practical to edit them manually. AppLocker uses the Subject Name of a certificate to identify a signed file. It is the same subject name regardless of the certificate used to sign. WDAC uses the thumbprint. The same name might be used in multiple different certificates with different thumbprints. A mistake in an AppLocker policy might cause some processes not to run. A mistake in a WDAC policy might cause Windows not to boot. If it cannot boot, the only solution is to re-image the device. Imagine doing that for 30 or 50,000 devices!

I think the right approach is to use WDAC, but with a process in place to make it relatively quick and safe to update. What is this approach?

  1. Use file path rules so that most administratively installed applications are allowed anyway
  2. Use “snippets” to extend the existing policies (snippets are policies created from a single application, and merged with the main policy)
  3. Use Supplemental policies for discrete areas of the business e.g. finance, or Assistive Technology, applications
  4. Use the WDAC Wizard for creating the base policy and applying updates
  5. Maintain a strict workflow for testing and deploying a policy update.

Let’s say you have a new application and it is blocked by current WDAC policy. There are several ways you could update the policy:

  • Scan the whole device and create a new policy. But this creates a significant risk of introducing new faults.
  • Read the event log or the Microsoft Defender audit of AppControl events to create rules for what was blocked. But this will only catch the first file that was blocked, not subsequent files that would have been blocked if that file had been allowed.
  • Scan the application itself, to create a policy that allows just that one application, then add this to the existing policy.

My preferred workflow is this:

  • Understand where the application saves all files including temp files and installation files
  • Copy all of them to a temp folder
  • Look to see whether the exe and dll files are signed or not. If they are, you will be able to use a Publisher rule. If they are not, see if you can install to a different location. For example, quite a few applications will allow a per-user or a per-machine install. Always use a per-machine install if you can, into a folder requiring admin rights. If you cannot, then you are going to have to use a hash, although this means any update of the file will be invalid.
  • Scan that temp folder to create a snippet
  • Merge the snippet into the base, or create a supplemental policy
  • Apply to a selection of test devices and make sure they still boot!

You need to keep a strict version control of policy versions and snippets. To achieve this, you should update the policy ID. Policies have several identifiers. The file name itself is irrelevant. When you import it into Windows, it will be generated with a name that is the policy GUID. The “Name” and “Id” (visible in the policy) are also just labels. The “BasePolicyID” and “PolicyID” are the two GUIDs that Windows uses to identify the policy. When you merge two policies, or merge a policy and a snippet, these GUIDs are not changed. You will see in the Event Log that Windows considers it to be the same policy. So, to keep track of which policy version is actually applied, you really want to update the GUID. You can do this in PowerShell with Set-CIPolicyIdInfo.

If you follow this approach, WDAC will work like a charm!

Government Commercial Problems with IT Procurement

Working in IT, I come across procurement problems frequently. The root cause, it seems to me, is that government procurement rules are implicitly designed for a steady state, whereas IT projects implement change, which is inherently imprecise. These rules need a radical overhaul. The new Procurement Bill, currently (Feb 2023) going through the House of Commons, aims to do this.

Problems

What sort of problems? 1) Long delays. A procurement that might be a simple executive decision in the private sector can be a three or six month exercise in the public sector. On a project, delay has a cost. This cost often outweighs the potential benefit of the procurement process. 2) Inflexibility as requirements evolve. Sometimes you don’t know exactly what you need until you talk to suppliers. But you can’t talk to suppliers without a formal procurement process.

I cannot give specific cases, for reasons of client confidentiality. But I can highlight the areas of the procurement rules that create these problems. The intention of the public procurement policy is clear and legitimate: to achieve “the best mix of quality and effectiveness for the least outlay over the period of use of the goods or services bought”. The question is whether the rules do this in practice.

I must say at the outset, these thoughts are from a “user” perspective. I have no great knowledge of the procurement rules, only my experience in performing procurements as part of an IT project. The amount of regulation and guidance applying to procurement is vast, and I don’t know how anyone could master it. The scope is vast too: £ hundreds of billions of contracts, of every conceivable type, and ranging in value from £ billions to £10,000. I don’t believe it is realistic to try to codify the rules for this vast enterprise, but that is what the policy does.

Long delays

I led a piece of work to implement a small piece of software that integrated two different systems. There are four products that do this. It is quite a niche area, with not much published information. The value of the purchase would be small, in relation to the two systems being integrated. The products are priced by volume of usage, with annual subscriptions. There were various technical complications about integrating with the two specific systems in our case.

The obvious thing to do was to buy a few licences and learn on the job. We were not allowed to do this. The rules said that no purchase of any kind could be made without a selection process, in this case to decide which ones to trial. The public information was not sufficient to justify the selection of a single product to trial. The next obvious thing was to talk to vendors. We were strictly not allowed to do this. Talking informally to any vendor would prejudice a fair selection.

So we developed our selection criteria as best we could (based on what we could glean from the published information), and then carried out a systematic trial of all four products sequentially. The trial involved actually implementing all four products, and asking staff to evaluate their experience when using them. The experience was almost identical, as we expected.

Some of our important selection criteria were technical, for example compliance with security requirements, and licensing terms. For these, we had to ask the vendors to respond to an RFP. As you can imagine, the responses were inadequate to provide any assurance, without speaking further to the vendors.

After going through the selection process, amazingly, we had not actually completed the procurement. All the vendors sold licences through resellers, as you would expect. So, after the selection, we needed to pick a reseller. You’ve guessed it! We needed a procurement to pick a reseller to sell us the licences for the product we had selected. Fortunately, we were able to use the Crown Commercial Services framework to ask for quotes.

The end result was that we purchased a few licences for the product we expected to pick at the beginning, but many months later and at considerably greater cost than the cost of the licences.

The basic problem here is that we do not live in a world of perfect information. At the outset, we cannot know all the ins and outs of different products. Vendors design their product information to highlight advantages and hide weaknesses. Vendors do not publish real prices. Vendors do not respond to RFPs with full and honest answers to questions.

Think of it from the vendor’s point of view. Some government department wants to make a small purchase. The department invents a long and complicated process and invites them to participate. What should they do? Obviously, just send them the data sheet and the price list. Why would they go to the effort and expense of responding when the total profit if they won would be less than the cost of responding?

Inflexibility

I led a project to upgrade the technology of an existing system, the purpose of which was to enable integration with another system. Sorry if that is a bit obscure: the reason is confidentiality.

The original system was contracted for before the integration even existed. We were not allowed to select our new network supplier with the integration built in to their product. This service was not in the scope of their new contract, because no-one at the time knew we would need to do this. It would have required a completely fresh procurement of the primary product, which would have taken at least a year.

In this case we were allowed to vary the existing contract. The rules on variation are highly complex. They require a good understanding of Annex A – Regulations 72 and 73 of the Guidance on Amendments to Contracts 2016. We were allowed to vary the contract but only provided the contract used different technology to do the same thing.

This gave us a few big challenges to negotiate. One, we needed a new type of support for the new technology not provided in the original contract. Two, we needed a third party (at additional cost) to provide a service to assist in the integration.

After something like a year we had completed the integration. At this point there was less than a year to run on the existing contract. But we could not extend the contract. The rules on extension are especially severe: they are one of the “red lines” for IT procurement. So the next stage had to be a full procurement of the whole service, having just completed the transformation of the previous service.

The basic problem here is that we don’t live in a world of isolated products and services. They are all inter-related in some way. It is not possible to have perfect foreknowledge of all the ways the services might need to change in the future.

Observations

I have a few observations.

  1. Procurement rules do not take account of the cost of complying, in relation to the value obtained.
  2. They assume the availability of adequate market information to make perfect choices without speaking to vendors.
  3. They also assume vendors can and will respond with accurate detailed information about what they offer.
  4. They do not take sufficient account of the relationships with other products and services, and the way these all evolve over time.
  5. It is simply not possible to comply with the rules intelligently, without having a large and skilled Commercial department.
  6. Commercial department cannot have a full knowledge of the product or service being procured and, therefore, there will be extensive delay or bad choices made.
  7. Delay is built in to the system, and the cost of delay is not accounted for.
  8. The cost and delay of procurement means that people are incentivised to wrap up products and services into large contracts that preclude innovation and competition – the exact opposite of what is intended.

Procurement Bill

The original Public Contracts Regulation 2015 stemmed directly from the EU Public Contracts Directive. The intention was to make contracts open across Europe.

But the idea that you can regulate all procurement across all of Europe with a value of more than £138,760 (Jan 2022 threshold) seems unrealistic. Let’s say you have an organisation of 10,000 staff. Let’s say a contract might run for 5 years (printing, laptops, software etc.). The threshold means that any contract worth about £3 per member of staff per year must be subject to a full, open, procurement. Let’s say the vendor profit on the procurement is 20%, or £27,752. The procurement process will cost more than that!

The explicit aim of the current Public Procurement Policy is to obtain value for money. But people don’t need rules to enable them to obtain value for money when buying a holiday, or a car, or the weekly shopping. People will do this for themselves. What the public needs is rules to prevent corruption. Anything that knowingly does not obtain value for money is corrupt. The new Procurement Bill says it aims to do this: “Integrity must sit at the heart of the process. It means there must be good management, prevention of misconduct, and control in order to prevent fraud and corruption.”

I will leave it to others to describe the changes in the new bill. But it is interesting to consider how it might affect the two cases I mentioned.

  • A below-threshold contract is one worth more than £12,000 and less than (I think) £138,760
  • For a below-threshold contract, the contracting authority “may not restrict the submission of tenders by reference to an assessment of a supplier’s suitability to perform the contract [including technical ability]. I take that to mean that all procurements must be open to all potential suppliers and not shortlisted. That is admirable, and I see no difficulty in making all these tenders public. But for obscure and specialised requirements the result is likely to be a deluge of irrelevant tenders and/or no valid submissions at all.
  • This does not apply to frameworks, so the best way to procure anything below-threshold will always be through a framework. But frameworks can only sell commodities. They can’t sell niche specialised products.
  • Modifying an existing contract is covered in Section 74 and Schedule 8. I think a contract extension is limited to 10% of the term, i.e. 6 months of a five year contract. This is still not enough where a change of circumstances occurs during the contract.
  • The provision for additional goods, services or works during a contract seem less restrictive then before. “A modification is a permitted modification if (a) the modification provides for the supply of goods, services or works in addition to the goods, services or works already provided for in the contract, (b) using a different supplier would result in the supply of goods, services or works that are different from, or incompatible with, those already provided for in the contract, (c) the contracting authority considers that the difference or incompatibility would result in (i) disproportionate technical difficulties in operation or maintenance or other significant inconvenience, and (ii) the substantial duplication of costs for the authority, and (d) the modification would not increase the estimated value of the contract by more than 50 per cent.” That seems to be a lot more flexible than before.

The scope of government contracts, even just IT contracts, is vast and I don’t know how it is possible to codify the rules governing them except by introducing  a great deal of bureaucracy and expense.

Curiously, the word “integrity”, despite being one of the bill’s objectives, only occurs once in the bill, other than in the statement of the objective. It occurs in the context of the supplier’s integrity. But, when a private sector organisation contracts with a vendor, the organisation is relying on the integrity of the staff, not the vendor. If the staff act with integrity, the organisation is confident the best choice will be made.

Speaking for an SME, I’m glad the bill has provisions to make it easier for small businesses to obtain contracts from government. But I have difficulty seeing how that will work in practice. Bidding is an expensive process. The way a small business manages the cost of bidding is to screen the opportunities for a competitive advantage. This might be having a good reputation with previous clients, or offering a high quality of service, or having strong skills in a particular area. These are intangibles that are screened out in a bureaucratic tendering process.

Autopilot and Intune Faults

“When sorrows come, they come not single spies, but in battalions.” We are deploying thousands of devices with Autopilot and Intune, and the service faults come in battalions.

We have been tracking these faults for a while. There are two types:

  1. Microsoft identifies a fault with a service announcement
  2. We raise a ticket, there is no cause found for the fault. No service announcement.

In mid-May, account setup failed to complete on pre-provisioned devices. The setup just hung. No cause found.

There was a service incident at the same time (now rolled over in the logs). Users unable to use Autopilot. Different problem, but possibly related.

Application failed to unzip after downloading. No cause found.

Application failed to download from Intune, with “endpoint failed to respond.” No cause found.

Late June, Autopilot failed at the beginning, before entering ESP. Error is 80072ee2. DNS query failed for “enterpriseregistration.microsoft.com”. Network timeout trying to register the device at DRS. No cause found.

From 21 June to 7 July, incident IT396955 “Users’ devices may have incorrectly appeared as non-compliant after Autopilot pre-provisioning in Microsoft Intune”. We don’t allow non-compliant devices to connect, so this caused a complete failure. Root cause: “A recent fix for an unrelated issue.” Although the incident dates from 21 June, it was only identified as an incident on 4 July.

On 21 July, incident IT402961 “Users and admins may have been unable to access the Microsoft Intune service or see limited functionality.” Root cause: “a network gateway outage.”

The facts show that the Autopilot service, with Intune, is fundamentally unreliable. If it were Intune alone, users would experience a failure of policy updates, or application deployments. But, during Autopilot, the result is a failed deployment.

At present, I recommend not using Autopilot to deploy devices, for the next year or so. It is too unreliable. My guess is that an internal service agreement has the wrong incentives.

Intune, WDAC and Managed Installer

WDAC has an option (Option 13) to allow apps installed by a Managed Installer. This sounds great! Everything you install using your preferred installer would be allowed, without going to the trouble of creating rules. But there’s a snag. There is no Configuration Service Provider (CSP) to deliver this policy in Intune.

The Managed Installer option actually uses the same method to allow executables to run as the Intelligent Security Graph option (Option 14). When a file is authorised by one of these methods, an extended attribute is written to the file. You can see this attribute with the fsutil utility. The method is documented here: Automatically allow apps deployed by a managed installer with Windows Defender Application Control.

The documentation on Managed Installer is a little confusing. The main documentation shows a policy that allows the Intune Management Extension, as well as the SCCM extension.

<FilePublisherRule Id="55932f09-04b8-44ec-8e2d-3fc736500c56" Name="MICROSOFT.MANAGEMENT.SERVICES.INTUNEWINDOWSAGENT.EXE version 1.39.200.2 or greater in MICROSOFT® INTUNE™ from O=MICROSOFT CORPORATION, L=REDMOND, S=WASHINGTON, C=US" Description="" UserOrGroupSid="S-1-1-0" Action="Allow"> <Conditions> <FilePublisherCondition PublisherName="O=MICROSOFT CORPORATION, L=REDMOND, S=WASHINGTON, C=US" ProductName="*" BinaryName="MICROSOFT.MANAGEMENT.SERVICES.INTUNEWINDOWSAGENT.EXE"> <BinaryVersionRange LowSection="1.39.200.2" HighSection="*" /> </FilePublisherCondition> </Conditions> </FilePublisherRule>

So, looking at that, we would obviously be able to allow Intune apps in Intune, right? But we cannot. The reason is that the documentation also describes implementing this policy in a GPO. But in Intune we cannot use GPO’s and, instead, we use Configuration Service Providers (CSP). The Managed Installer option is implemented as an AppLocker policy, and the AppLocker CSP does not contain a section for the Managed Installer rule collection type.

Although we cannot implement this as an Intune policy (because there is no CSP), we could theoretically implement it another way. With a registry key, for example, even if there were no CSP to configure the registry key, we could simply add, change or delete it in script. With AppLocker policies, we can use PowerShell to create a policy from an XML file, using Set-AppLockerPolicy. So the solution is to deliver a custom AppLocker policy with PowerShell, to enable the Intune agent as a Managed Installer in WDAC.

There are three significant drawbacks:

  1. The effort and constraints in managing the policies manually through PowerShell. For example, there is no Remove cmdlet for a policy in PowerShell
  2. Managed Installer tags the installed files, but not automatic updates. To allow the updates, you would either have to reinstall, or apply rules to allow the updated files, which would defeat the purpose.

Autopilot Faults and Logs

This is a post about where to look to find the cause when Autopilot fails.

By “Autopilot”, I am referring to the whole process of deploying, enrolling and setting up a Windows device. The process really contains several distinct parts:

  • The Out of Box Experience (OOBE) like selecting language, region and keyboard
  • Enrolment in Intune and joining the Azure AD domain (or hybrid)
  • Implementing all the policies and apps assigned to the user or device by Intune
  • The Enrollment Status Page (ESP) which monitors and controls the process nearly from the beginning to the end.

But I am using the term “Autopilot” to refer to all these, for convenience.

We can distinguish two types of failure. One, when setting the process up and testing it to see if it works. Another, during deployments when everything is supposed to be working. This post is about the second. For the first, you can generally follow the guides for setting up Autopilot and ESP, and search the documentation if it is not working. For the second, you need a good understanding of how the process works, what happens when it goes wrong, and where to look to find the cause.

Here is the best end-to-end diagram of the process: Windows Autopilot deployment process. And here is the page that best describes what happens in each phase of EPS: Enrollment Status Page tracking information. It is worth studying these in detail.

Getting the logs for Autopilot is straightforward. From a command prompt, run:

“mdmdiagnosticstool.exe -area DeviceEnrollment;DeviceProvisioning;Autopilot;Tpm –cab C:\Temp\Autopilot.cab”.

You will need to run elevated to get the TPM diagnostics. You will also need to make sure that whoever runs the command is able to save in the location specified. If you are asking a standard user to run the diagnostics, you can use Settings > Accounts > Access work or school > Export your management log files.

Michael Niehaus has written scripts to provide a quick interpretation of the diagnostics logs: Get-AutopilotDiagnostics. Running this script against the cab file is the first place to start.

There are a few points to note about the diagnostics:

  • You need to run it as soon as possible after the problem occurs. Several of the event logs contained in the diagnostics collection roll over, so events will be lost if you only run it later.
  • If you are reproducing a reproducible fault, you can add other tools like netsh, by breaking in with Shift+Fn10 before the fault occurs.
  • The diagnostic logging only captures events relating to Autopilot and enrolment. If you want a wider selection of logs, you may want to run One Data Collector at the same time. It only takes a few minutes to run.
  • Running Get-AutopilotDiagnostics with the –Online parameter will fetch the application name to match the ID, saving you a lot of time in trawling logs to find the name of a failed app ID.

If the failure occurred before the ESP started, then this is Autopilot proper. The place to look is in the event log: Applications and Services logs > Microsoft > Windows > ModernDeployment-Diagnostics-Provider > Autopilot.

When the ESP starts, we can find the place that it fails in the registry. This is obtained from: HKLM\Software\Microsoft\Provisioning\AutopilotSettings. Each sub-category has a status of: notStarted; inProgress, succeeded; or failed. It would be great if these could be surfaced somewhere more accessible. As it is, they can be found in the diagnostics log file: MdmDiagReport_RegistryDump.reg.

For example, see this obscure fault: Error code 0x80180014 when re-enrolling using self-deployment or pre-provisioning mode. The documentation is not correct. Normally, when you do a “Fresh Start” or a “Wipe”, the object in Intune is soft-deleted. But, occasionally, this back-end process fails. As a result, when you do a reset, it will fail in Device Preparation, at the “Registering your device for mobile management” stage i.e. enrolling in Intune. The status is recorded at: HKLM\Software\Microsoft\Provisioning\AutopilotSettings\DevicePreparationStatus.Category.

This happens because the Intune object has not been deleted as it should be. The solution is to find the object and delete it manually. Knowing the stage it failed at enables you to investigate why this happened.

It can be useful to know exactly when the fault occurred. This helps us to correlate different logs. Bizarrely, the time is shown in the event log: Applications and Services logs > Microsoft > Windows > Shell-Core > Operational. Use the “Find” action to search for the word “failed”. This will show the CloudExperienceHost Web App Event with “Subcategory ID = DevicePreparation.MdmEnrollmentSubcategory; state = failed.” When you know the time, you can search other logs to see if anything distinctive happened at exactly that time.

As I said at the beginning, the purpose of this post is to describe how to investigate unexpected faults. Once you know where and when in the process the fault occurred, you can follow the trail to diagnose it. If you know the type of fault, you may well need to reproduce it with additional logging to find the cause.

Autopilot Faults and the Network

The list of network requirements for Intune and Autopilot is extensive. This post is about finding out if the client cannot connect to one or more of the required endpoints during Autopilot.

Microsoft publishes the list of required endpoints for Autopilot and Intune. There is no point in repeating the information here. The important points to note are:

  • The documents also contain references to secondary services. For example, it requires access to Windows Activation, Azure AD (for authentication), NTP time service and others
  • In an enterprise, HTTP and HTTPS traffic will be intercepted by a proxy. Proxy servers may inspect the traffic, and require authentication. All this traffic needs to bypass the proxy; or bypass inspection and authentication. This is not easy to do, and it is easy to make mistakes.
  • The lists often require URL’s and wildcards, for example *.microsoftaik.azure.net for TPM attestation.

As a result, you may need to troubleshoot things that should be working, but are not. You need to be able to find out what is blocked, or not connecting.

The simplest way, which everyone will be familiar with, is to use Wireshark. You can install Wireshark at the beginning of Autopilot, and use Alt+Tab to get back to it after the problem has occurred (see Autopilot and Shift+Fn10).

You may think you are looking for a dropped connection. But the place to look for that is on the firewall. In reality the packet capture will require a lot more work to analyse than that. Almost all the traffic is encrypted HTTPS. You cannot see the contents of the transactions. What you will see (roughly) is:

  • A DNS query to resolve one of the required endpoints
  • A DNS response resolving the query to an IP address corresponding to a traffic manager, load balancer, CDN
  • TCP SYN
  • TCP ACK
  • Client Hello
  • Server Hello, Certificate, Server Key Exchange, Server Hello Done
  • Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
  • Application Data

Azure Traffic Manager is a DNS service that distributes client traffic to the closest available Azure endpoint. If the client makes a DNS query for (for example) fe2cr.update.microsoft.com, it will obtain the IP address of an endpoint corresponding to fe2cr.update.msft.com.trafficmanager.net.

This means you can’t expect to see traffic to and from the endpoints contained in the lists of endpoints. A query for account.live.com will result in traffic to and from l-0013.l-msedge.net.

If you use a display filter of “dns”, you will see all the queries made. You can check on the proxy that all those names bypass inspection and authentication. If they do not, you may expect to see:

  • No application data after a handshake
  • A TCP FIN followed by another attempt to connect with a new TCP SYN.

As the Autopilot traffic is nearly all HTTPS, you could also use Fiddler to capture the exchange between client and server. When you install Fiddler, you choose to accept the Fiddler certificate. This means that the traffic can be proxied and decrypted by Fiddler. As a result, you should be able to see the content of the transactions.

I recently had a case where Autopilot failed at the start of the Account Setup phase of Autopilot. As it happens, I could see in the Microsoft-Windows-Shell-Core/Operational event log that a “AAD user token request failed” but not why it failed. I wanted to do a packet capture during the time this request failed. But you cannot break into Autopilot during the account setup phase and, even if you could, it would be too late to start a capture. Ideally you want to start a capture at a stopped point in the process, so that you have time to set it up before continuing.

This was a case where the Network Shell utility (netsh.exe) can help. Netsh can record a trace file (*.etl) using different providers. Most importantly, it is not limited to packet capture. It records the activity of each provider specified in the command. Providers are bundled up in “scenarios” for different types of network problem. The “InternetClient” scenario, for example, contains the following providers that we may be interested in:

  • Web-Http
  • WinInet
  • WebIO
  • WinHttp
  • DNS-Client
  • TCPIP.

A unique advantage of Netsh is the “Persistent” parameter. This enables the trace to continue through a restart. By making it persistent, we can start a trace in the Device Setup phase, continue through a restart into the Account Setup phase, and then stop the trace when the fault has occurred.

A difficulty to overcome is that the trace needs to be started and stopped in the same security context. In Device Setup, the console user is DefaultUser0 with admin rights. But this account does not persist into Account Setup.

To get around this, we need to use PsExec. This is the SysInternals utility that enables me (among many other things) to run a command in the system context. If I start the trace in the system context, then I can do the same to stop it. So the full plan of attack is:

  • Shift+Fn10.
  • In the console, download PsExec or copy it from a USB stick.
  • Run: “psexec.exe –s –i cmd.exe”. This brings up a new console running in the System context.
  • In the new console, run: “netsh trace start capture=yes report=yes persistent=yes traceFile=[path to trace.etl] scenario=InternetClient”. Depending on the circumstances, you can use maxSize to limit the size of the trace file. By default, the logging is circular.
  • Continue through to when the fault occurs.
  • In this case, because Autopilot was failing in Account Setup, I needed to use the “Continue” option in the Enrollment Status Page (ESP) profile, so that I could finish and get back to a user session, to be able to run PsExec again and stop the trace with “Netsh trace stop”.

A significant disadvantage of Netsh is that the best way to read the ETL file is with Microsoft Message Analyzer. But this is now obsolete and no longer available to download. There are utilities to convert the ETL to pcap, so it can be read in Wireshark, but that loses the benefit of the network provider logging, like DNS-Client and WinInet. I keep a copy of Microsoft Message Analyzer just for this.

So, this post has given three tools and methods to investigate a network fault causing Autopilot to fail. You will need to adapt the methods to the particular circumstances of the case.

Autopilot Faults and Shift+Fn10

Most people probably know that you can break into Autopilot by pressing Shift+Fn10 to bring up a command prompt. Here are some aspects that people may be less aware of.

The Command Prompt runs in the security context of DefaultUser0. This is a local admin account used for deployment. The account and the profile are supposed to be deleted at the end of Autopilot, but they are not. You can still see the profile in C:\Users.

Obviously, you can run any command line actions. But you can also bring up GUI applications:

  • “explorer” to open Windows Explorer. From there, you can navigate to run other GUI applications.
  • “control panel” to open the Control Panel (or “appwiz.cpl” to bring up Programs and Features)
  • “compmgmt” to bring up Computer Management, for Services, Device Manager, Disk Manager, Event Viewer, Scheduled Tasks, Users and Group
  • “eventvwr” for Event Viewer
  • “taskmgr” for Task Manager
  • “start ms-settings:” to bring up the Settings app.

You can run “powershell.exe” to run PowerShell cmdlets in the Cmd console. Or you can use Explorer to navigate to PowerShell and run it from there.

You can open the Edge Chromium browser to download or upload stuff.

You can use Shift+Fn10 at the beginning, to create a configuration before the main Enrollment Status Page (ESP) phases begin. You can use it when Autopilot fails in Device Preparation or Device Setup. You can use it any time in between. But you cannot use it in the Account Setup phase because, in this phase, the session runs in the user security context and not DefaultUser0.

Some of the things you may want to do at the beginning:

  • Increase the size of the Microsoft-Windows-DeviceManagement-Enterprise-Diagnostics-Provider/Operational event log, from the default 1028 KB
  • Install Wireshark (but, if you do this, you need to make sure the VC++ dependency does not interfere with any VC++ app deployment)
  • Create a C:\Temp folder to save logs and exports.

After you open the Command Prompt, if you navigate back to the ESP, the console will be hidden again. But it is still running. Click Alt+Tab to open it again, or to select any of the GUI applications that are still running. For example, if you started Wireshark at the beginning, then you can go back to the same Wireshark at any time with Alt+Tab.

Troubleshooting Faults in Autopilot

I have experienced a lot of faults in the Autopilot service recently. Autopilot is the Microsoft service that configures a Windows desktop out of the box (OOBE), when it first boots up.

If you google for Autopilot Failure, or Enrollment Status Page (ESP) Failure, you will find a lot of Microsoft documentation, and secondary documentation expanding on similar faults. There’s a very long list of resolved issues, and a list of continuing known issues. To be fair, there are many variables, and many possible scenarios leading to different types of failure: virtual machine; TPM version; user-driven vs. self-deploying; hybrid AAD join. The most obvious cause of a failure is that you might actually have something wrong with the network, or with an application deployment.

But this post is not about these failures. It is about what to do if the service is failing and it does not seem to be caused by one of these documented faults. We have experienced quite a number of these. The fact is, the troubleshooting steps documented by Microsoft are based on known and expected faults, and not on unexpected faults; and certainly not on faults in the Microsoft infrastructure supporting the Autopilot service.

I think I will probably break this post out into separate snippets on how to perform various troubleshooting tasks. Otherwise it will be too cluttered. But let me first illustrate what I mean. If you go to the Intune Tenant Administration page, you will see a tab for the “Service health and message center”. The incidents here roll over very fast, but you will see a continuous flow of faults. We’ve found that only a percentage of faults are acknowledged in this way. But you can get an idea of what is happening from the explanations.

“Admins may have seen provisioning failures when attempting to create new Cloud PCs within Windows 365.”

  • Start time: Wednesday, January 26, 2022, at 12:00 PM UTC End time: Friday, April 1, 2022, at 11:59 PM UTC [over two months later]
  • Root cause: An Azure Host Operating System change resulted in provisioning failures when admins attempted to create new Cloud PCs within Windows 365.

If you are lucky, when a fault occurs, you will find a known service health incident. But how many faults in infrastructure and code are actually acknowledged? Based on my experience it is less than half. We had a total outage of the Autopilot service from 14 Feb to 28 Feb. There was no service health message. But we received an explanation from an internal MS channel to tell us that a code update had gone wrong. I think the reason for this disparity is that Intune and Autopilot are not really homogenous services. They are supposed to be. But it seems clear that they are really a collection of instances of services. The incidents often refer to “customers on the affected infrastructure”. How do you know if you are on the affected infrastructure? You don’t. How does Microsoft know if the affected infrastructure is working or not? It doesn’t. Microsoft incidents refer to telemetry. But telemetry will only tell you what the metrics are measuring. This does not seem to include actual successful completion of transactions.  We have often asked Microsoft Premier Support to reproduce faults for themselves, instead of asking us to repeatedly collect logs for them. They can’t.

So, let’s get into the elements of how to find out what is going on.

  1. Shift+Fn10
  2. Network
  3. Logs

Basic Flaws in Intune Package for Microsoft 365 Apps

Intune (or Microsoft Endpoint Manager) has a built in package for Microsoft 365 Apps. This is the obvious package to use to deploy Office 365. But there are several fatal flaws.

Intune deploys the Microsoft 365 Apps as a policy. This sounds strange, but it is deployed by the Office CSP. What happens is that this policy installs two small Click to Run MSI’s, which then pull down the Microsoft 365 Apps like Word, Excel, Outlook etc.

Sometime back in about 2018 Microsoft chained the Teams installation to Office 365. Consequently, you can select Teams as one of the suite of apps to install. What happens is that sometime after the other apps have been installed, the Teams MSI runs to install the Teams Installer. The Teams Installer is the component that detects a user logon and installs Teams itself for the user, in the user’s %AppData% folder. Teams is not actually installed as part of the Office Click to Run. It is a separate MSI chained to the Office setup.

Because the Microsoft 365 Apps are installed by a policy, the installation is performed by the OMA-DM policy agent, and not by the Intune Management Extension (IME). The IME is the agent that is used to install all Win32 apps. Policies are applied as soon as the Device Setup starts, so before even the IME agent (itself an MSI) has been installed, and before the Win32 apps to be installed have been evaluated. Because the Office Click to Run MSI’s are small, they are installed before other app processing has started.

Microsoft warns that you should not mix Win32 apps and Line of Business (LOB) apps in an Autopilot deployment. The wording in Microsoft docs is particularly poor, because “LOB” can mean different things. I have spoken to several Premier Support engineers who clearly don’t know what it is supposed to mean. In this case it refers to the single MSI packages that were used before Microsoft introduced the IME agent. Before the IME agent, you could install applications but only if they used a single MSI file (no folders, no transform). This installation of a LOB MSI is performed by the OMA-DM agent.

As an example of poor documentation, this page Add a Windows line-of-business app to Microsoft Intune says the following: ”When deploying Win32 apps using an installation file with the .msi extension (packaged in an .intunewin file using the Content Prep Tool), consider using Intune Management Extension. If you mix the installation of Win32 apps and line-of-business apps during AutoPilot enrollment, the app installation may fail.” But it fails to say what is meant by a “line-of-business” app (and it is Autopilot, not AutoPilot).

In contrast, this page Set up the Enrollment Status Page says the following: “It’s preferable to deploy the offline-licensed Microsoft Store for Business apps. Don’t mix LOB and Win32 apps. Both LOB (MSI) and Win32 installers use TrustedInstaller, which doesn’t allow simultaneous installations. If the OMA DM agent starts an MSI installation, the Intune Management Extension plugin starts a Win32 app installation by using the same TrustedInstaller. In this situation, Win32 app installation fails and returns an ‘Another installation is in progress, please try again later’ error message. In this situation, ESP fails. Therefore, don’t mix LOB and Win32 apps in any type of Autopilot enrollment.”

The reason not to mix LOB and Win32 apps is obvious. Any MSI is actually executed by msiexec.exe, which calls the Windows Installer service. If you have ever tried to run an MSI when one is already running, you will have seen a warning that another installation is in progress, please wait.

If you have two different agents running msiexec and calling Windows Installer, they will conflict. This is exactly what happens with the Teams MSI in the Microsoft 365 Apps package. In other words, the Microsoft 365 Apps package is guilty of exactly the fault that Microsoft warns you to avoid.

In the verbose Microsoft 365 Apps log, you will see this error:

01/01/2021 01:01:01.010        OFFICECL (0x2264)        0xdc4                Click-To-Run Non Task Error        co7l2        Monitorable        C2R::TeamsAddon::Install {“MachineId”: “a3b97b64ce6fff4980eaef7d16bed3bd”, “SessionID”: “a7e42957-b65a-45f2-93d7-fdb88842597a”, “GeoID”: 242, “Ver”: “16.0.14430.20342”, “C2RClientVer”: “16.0.14430.20314”, “ContextData”: “{\”message\”:\”InstallTeams: MsiInstallProduct failed due to install already in progress.\”,\”Status\”:\”1618\”}”}

You get the verbose log by breaking into Autopilot and setting this registry key: HKLM\SOFTWARE\Microsoft\ClickToRun\OverRide: LogLevel = DWORD 3.

Amazingly, the Microsoft 365 Apps setup process does not detect that the Teams installation has failed. It carries on and exits with a success code. So it will never try again. In practice, what happens is that sometimes Teams wins the race for Windows Installer, and gets installed, and sometimes it doesn’t. If it wins, then the standard retry in the IME will ensure that whichever app lost to Teams will try again and get installed. So the result is that, apparently randomly, some devices will complete Autopilot without Teams, and with no way to get it back.

But there is more. The Intune package for Microsoft 365 Apps automatically includes OneDrive. You can’t select or unselect it. OneDrive is already natively installed on Windows, but as a per-user application in the user’s %AppData% folder. The Microsoft 365 Apps package installs the 32-bit per-device version of OneDrive, and removes the per-user version. But Microsoft recommends using the 64-bit version of OneDrive on standard Windows devices. So, if you create a package to install the 64-bit version, Intune will first install the 32-bit version, then (hopefully) uninstall it and install the 64-bit version.

And there is more. Visio and Project are both available in the Microsoft 365 Apps package. But, if you select (say) Visio without the other apps (Word, Excel, Outlook etc), then all those other apps will be uninstalled. One Microsoft 365 Apps packages entirely replaces, rather than adds to, another. If you select Visio with the other apps, then the existing installation of Office will be removed, and replaced with the new one. Obviously, there are different combinations: with Visio but not Project; with Project but not Visio; with Visio and Project. It is completely unworkable. In addition, the installation of Visio or Project will force-close other Office apps, which is not nice.

And there’s more! Microsoft Surface devices come pre-installed with Microsoft 365 Apps in 13 languages. If you run the built-in Intune package, it will not uninstall the pre-installed languages. You might say this doesn’t matter. But, if you do a Fresh Start on a device, then it will not have the 13 languages, and so you will have different devices with different configurations.

Taking all this into account, we decided to build our own packages for all Office apps. Once you have a basic script to run the setup, and after you generate the required XML, it is fairly easy to adapt to the different packages you need for Office 365, Teams, OneDrive, Visio and Project. The good thing is that you are still using exactly the same core method: setup.exe with an XML file. You can do some of this by using the XML option instead of the Configuration Designer option in the Intune package, but we needed to do more, including making the Visio and Project installations interactive, so we chose to go entirely custom.