Troubleshooting Faults in Autopilot

I have experienced a lot of faults in the Autopilot service recently. Autopilot is the Microsoft service that configures a Windows desktop out of the box (OOBE), when it first boots up.

If you google for Autopilot Failure, or Enrollment Status Page (ESP) Failure, you will find a lot of Microsoft documentation, and secondary documentation expanding on similar faults. There’s a very long list of resolved issues, and a list of continuing known issues. To be fair, there are many variables, and many possible scenarios leading to different types of failure: virtual machine; TPM version; user-driven vs. self-deploying; hybrid AAD join. The most obvious cause of a failure is that you might actually have something wrong with the network, or with an application deployment.

But this post is not about these failures. It is about what to do if the service is failing and it does not seem to be caused by one of these documented faults. We have experienced quite a number of these. The fact is, the troubleshooting steps documented by Microsoft are based on known and expected faults, and not on unexpected faults; and certainly not on faults in the Microsoft infrastructure supporting the Autopilot service.

I think I will probably break this post out into separate snippets on how to perform various troubleshooting tasks. Otherwise it will be too cluttered. But let me first illustrate what I mean. If you go to the Intune Tenant Administration page, you will see a tab for the “Service health and message center”. The incidents here roll over very fast, but you will see a continuous flow of faults. We’ve found that only a percentage of faults are acknowledged in this way. But you can get an idea of what is happening from the explanations.

“Admins may have seen provisioning failures when attempting to create new Cloud PCs within Windows 365.”

  • Start time: Wednesday, January 26, 2022, at 12:00 PM UTC End time: Friday, April 1, 2022, at 11:59 PM UTC [over two months later]
  • Root cause: An Azure Host Operating System change resulted in provisioning failures when admins attempted to create new Cloud PCs within Windows 365.

If you are lucky, when a fault occurs, you will find a known service health incident. But how many faults in infrastructure and code are actually acknowledged? Based on my experience it is less than half. We had a total outage of the Autopilot service from 14 Feb to 28 Feb. There was no service health message. But we received an explanation from an internal MS channel to tell us that a code update had gone wrong. I think the reason for this disparity is that Intune and Autopilot are not really homogenous services. They are supposed to be. But it seems clear that they are really a collection of instances of services. The incidents often refer to “customers on the affected infrastructure”. How do you know if you are on the affected infrastructure? You don’t. How does Microsoft know if the affected infrastructure is working or not? It doesn’t. Microsoft incidents refer to telemetry. But telemetry will only tell you what the metrics are measuring. This does not seem to include actual successful completion of transactions.  We have often asked Microsoft Premier Support to reproduce faults for themselves, instead of asking us to repeatedly collect logs for them. They can’t.

So, let’s get into the elements of how to find out what is going on.

  1. Shift+Fn10
  2. Network
  3. Logs

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.