Autopilot Faults and Shift+Fn10

Most people probably know that you can break into Autopilot by pressing Shift+Fn10 to bring up a command prompt. Here are some aspects that people may be less aware of.

The Command Prompt runs in the security context of DefaultUser0. This is a local admin account used for deployment. The account and the profile are supposed to be deleted at the end of Autopilot, but they are not. You can still see the profile in C:\Users.

Obviously, you can run any command line actions. But you can also bring up GUI applications:

  • “explorer” to open Windows Explorer. From there, you can navigate to run other GUI applications.
  • “control panel” to open the Control Panel (or “appwiz.cpl” to bring up Programs and Features)
  • “compmgmt” to bring up Computer Management, for Services, Device Manager, Disk Manager, Event Viewer, Scheduled Tasks, Users and Group
  • “eventvwr” for Event Viewer
  • “taskmgr” for Task Manager
  • “start ms-settings:” to bring up the Settings app.

You can run “powershell.exe” to run PowerShell cmdlets in the Cmd console. Or you can use Explorer to navigate to PowerShell and run it from there.

You can open the Edge Chromium browser to download or upload stuff.

You can use Shift+Fn10 at the beginning, to create a configuration before the main Enrollment Status Page (ESP) phases begin. You can use it when Autopilot fails in Device Preparation or Device Setup. You can use it any time in between. But you cannot use it in the Account Setup phase because, in this phase, the session runs in the user security context and not DefaultUser0.

Some of the things you may want to do at the beginning:

  • Increase the size of the Microsoft-Windows-DeviceManagement-Enterprise-Diagnostics-Provider/Operational event log, from the default 1028 KB
  • Install Wireshark (but, if you do this, you need to make sure the VC++ dependency does not interfere with any VC++ app deployment)
  • Create a C:\Temp folder to save logs and exports.

After you open the Command Prompt, if you navigate back to the ESP, the console will be hidden again. But it is still running. Click Alt+Tab to open it again, or to select any of the GUI applications that are still running. For example, if you started Wireshark at the beginning, then you can go back to the same Wireshark at any time with Alt+Tab.

Troubleshooting Faults in Autopilot

I have experienced a lot of faults in the Autopilot service recently. Autopilot is the Microsoft service that configures a Windows desktop out of the box (OOBE), when it first boots up.

If you google for Autopilot Failure, or Enrollment Status Page (ESP) Failure, you will find a lot of Microsoft documentation, and secondary documentation expanding on similar faults. There’s a very long list of resolved issues, and a list of continuing known issues. To be fair, there are many variables, and many possible scenarios leading to different types of failure: virtual machine; TPM version; user-driven vs. self-deploying; hybrid AAD join. The most obvious cause of a failure is that you might actually have something wrong with the network, or with an application deployment.

But this post is not about these failures. It is about what to do if the service is failing and it does not seem to be caused by one of these documented faults. We have experienced quite a number of these. The fact is, the troubleshooting steps documented by Microsoft are based on known and expected faults, and not on unexpected faults; and certainly not on faults in the Microsoft infrastructure supporting the Autopilot service.

I think I will probably break this post out into separate snippets on how to perform various troubleshooting tasks. Otherwise it will be too cluttered. But let me first illustrate what I mean. If you go to the Intune Tenant Administration page, you will see a tab for the “Service health and message center”. The incidents here roll over very fast, but you will see a continuous flow of faults. We’ve found that only a percentage of faults are acknowledged in this way. But you can get an idea of what is happening from the explanations.

“Admins may have seen provisioning failures when attempting to create new Cloud PCs within Windows 365.”

  • Start time: Wednesday, January 26, 2022, at 12:00 PM UTC End time: Friday, April 1, 2022, at 11:59 PM UTC [over two months later]
  • Root cause: An Azure Host Operating System change resulted in provisioning failures when admins attempted to create new Cloud PCs within Windows 365.

If you are lucky, when a fault occurs, you will find a known service health incident. But how many faults in infrastructure and code are actually acknowledged? Based on my experience it is less than half. We had a total outage of the Autopilot service from 14 Feb to 28 Feb. There was no service health message. But we received an explanation from an internal MS channel to tell us that a code update had gone wrong. I think the reason for this disparity is that Intune and Autopilot are not really homogenous services. They are supposed to be. But it seems clear that they are really a collection of instances of services. The incidents often refer to “customers on the affected infrastructure”. How do you know if you are on the affected infrastructure? You don’t. How does Microsoft know if the affected infrastructure is working or not? It doesn’t. Microsoft incidents refer to telemetry. But telemetry will only tell you what the metrics are measuring. This does not seem to include actual successful completion of transactions.  We have often asked Microsoft Premier Support to reproduce faults for themselves, instead of asking us to repeatedly collect logs for them. They can’t.

So, let’s get into the elements of how to find out what is going on.

  1. Shift+Fn10
  2. Network
  3. Logs

Autopilot Failure

A few weeks ago, Autopilot stopped working. Autopilot is the service that builds a Windows desktop from scratch when it first boots up, a bit like MDT. If the device hardware ID is registered in the Autopilot service then, when it starts up, it contacts Autopilot and runs the Out of Box Experience (OOBE) according to the settings in the profile.

On 14 Feb 2022, routine setup of new devices stopped working. We did not at first know why, so we spent a couple of days testing and trying different things. Then we raised a ticket with Microsoft Premier Support. The symptoms were that Store Apps were failing to install (they said “blocked by AppLocker”) and Win32 apps were failing with random errors.

Microsoft Premier Support clearly failed to understand the basics of the problem. They ran through known issues, and asked us to check basic things, and to exclude things that were failing. But Autopilot was working before 14 Feb. Why would applications that were working before 14 Feb start not working?

We spent two weeks in pointless dialog. Then on 25 Feb Autopilot started working again. We made no changes. It just started to work, as it had before 14 Feb. Premier Support offered no analysis and no explanation.

If you were in the middle of a rollout, that could be hundreds or thousands of devices that would fail. How is it even possible that Microsoft had no clue that the service had failed, no clue about why, and no clue about how to fix it?

Some time later, from a different source, we received an explanation. The issue was caused by an error in the code, where devices would sometimes randomly hit a web exception when trying to fetch the content or content info of some Win32 apps from the service. Microsoft created a hotfix and, as a result, error 0x81036502 no longer occurs.

But Premier Support did not know about the fault, did not know about the hotfix, and did not even connect our problem with the fault. This is a failure on a grand scale.