What APD Means for NFS Datastores
For NFS, an APD essentially means the ESXi host has lost communication on the TCP connection to the NFS server for long enough that the storage stack starts treating the datastore as unavailable. An internal APD timer starts after a few seconds of no communication on the NFS TCP stream; if this continues for roughly 140 seconds (the default value of Misc.APDTimeout), the host declares an APD timeout for that datastore.
Once a datastore is in APD, VM I/O continues to retry while management operations such as browsing the datastore, mounting ISOs, or snapshot consolidation can start to fail quickly. From a vSphere client perspective the datastore may appear dimmed or inaccessible, and VMs can look hung if they rely heavily on that datastore.
How APD Shows Up in ESXi Logs
When an APD event occurs, vmkernel and vobd are the primary places to look. On recent ESXi versions the logs are typically under /var/run/log/, though many environments still collect from /var/log/vmkernel.log and /var/log/vobd.log.
The lifecycle of a single APD usually looks like this in vobd.log:
APD start, for example:[APDCorrelator] ... [vob.storage.apd.start] Device or filesystem with identifier [8a5a1336-3d574c6d] has entered the All Paths Down state.
APD timeout, after about 140 seconds:[APDCorrelator] ... [esx.problem.storage.apd.timeout] Device or filesystem with identifier [8a5a1336-3d574c6d] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
APD exit when the host finally sees the storage again:[APDCorrelator] ... [esx.problem.storage.apd.recovered] Device or filesystem with identifier [8a5a1336-3d574c6d] has exited the All Paths Down state.
In vmkernel.log the same event is reflected by the APD handler messages. A typical sequence is:
StorageApdHandler: 248: APD Timer started for ident [8a5a1336-3d574c6d]StorageApdHandler: 846: APD Start for ident [8a5a1336-3d574c6d]!StorageApdHandler: 902: APD Exit for ident [8a5a1336-3d574c6d]
On NFS datastores you usually see loss‑of‑connectivity messages around the same time, such as warnings from the NFS client that it has lost connection to the server or that latency has spiked:
WARNING: NFS: NFSVolumeLatencyUpdate: NFS volume <datastore> performance has deteriorated. I/O latency increased ... Exceeded threshold 10000(us)
These messages are often the early warning before APD actually triggers.
NFSv4‑Specific Errors in vmkernel and vobd
With NFSv4.1 the client maintains stateful sessions and slot tables, so problems are not always just simple timeouts. ESXi may log warnings such as NFS41SessionSlotUnassign when the available session slots drop too low; when this happens under heavy load it can lead to session resets and eventually to APD on that datastore if the session cannot be re‑established cleanly.
Another category of issues are NFSv4 errors like NFS4ERR_SHARE_DENIED that show up if an OPEN call conflicts with an existing share reservation on the same file. While these errors do not in themselves mean APD, they often appear in the same time window when applications are competing for locks or when the NFS server is under stress and struggling with state management; the end result can be perceived as I/O hangs on the ESXi side.
When reviewing logs, it is useful to separate pure connectivity problems (socket resets, RPC timeouts) from v4‑specific state problems (session slot issues, share or lock errors). The former almost always have a clear APD signature in vobd; the latter may manifest as intermittent stalls or file‑level errors without a full datastore APD.
What to Look For on the NFS Server
Once you have the APD start and exit timestamps from ESXi, the next step is to line those up with the storage array or NFS server logs. On an ONTAP‑style array, for example, APD windows on the ESXi side often correspond to connection reset entries such as:
kernel: Nblade.nfsConnResetAndClose:error]: Shutting down connection with the client ... network data protocol is NFS ... client IP address:port is x.x.x.x:yyyy ... reason is CSM error - Maximum number of rewind attempts has been exceeded
This type of message indicates that the NFS server terminated the TCP session to the ESXi host, typically due to internal error handling or congestion. If the server is busy or recovering from a failover, there might also be log lines for node failover, LIF migration, or high latency on the backend disks at the same time.
On general Linux NFS servers, the relevant information is usually in /var/log/messages or /var/log/syslog. Around the APD time you want to see whether there were RPC timeouts, transport errors, NIC resets, or NFS service restarts for the host IP that corresponds to the ESXi VMkernel interface. If the issue is configuration‑related (for example, export rules suddenly not matching, Kerberos failures, or NFSv4 grace periods), that also tends to show clearly in these logs.
Other platforms show similar patterns. Hyperconverged solutions may log controller failovers or filesystem service restarts in their own management logs at the same timestamps that ESXi reports APD. In many documented cases, APD is ultimately traced to a short loss of network connectivity or to the NFS service being restarted while ESXi still has active sessions.
Practical Troubleshooting Workflow
In practice, troubleshooting an NFS APD usually starts with a simple question: did all hosts and datastores see APD at the same time, or was the event limited to a single host, a single datastore, or a subset of the fabric? A single host and one datastore tends to point to a host‑side or network issue, such as a NIC problem or VLAN mis‑tag; simultaneous APDs across multiple hosts and the same datastore are more likely to be array‑side or network‑core events.
From the ESXi side, the first task is to build a clear timeline. Grab the vobd and vmkernel logs, extract all the vob.storage.apd messages, and list for each device or filesystem identifier when APD started, whether it hit the 140‑second timeout, and when it exited. Once you have the APD window, you can overlay any NFS warnings, networking errors, or TCP issues that appear in vmkernel around those times. This timeline is often more useful than individual error messages because it tells you exactly how long the host was blind to the datastore.
In parallel, check the current state of the environment. On an affected host, esxcli storage filesystem list will confirm whether the NFS datastore is still mounted, in an inaccessible state, or has recovered. If the datastore is still visible but VMs are sluggish, look for ongoing NFS latency messages or packet‑loss symptoms; if the datastore has disappeared entirely from the host view, then the focus shifts more to export definitions, DNS, routing, and the NFS service itself.
Once the ESXi view is clear, move to the NFS server and the switching infrastructure. Using the APD timestamps, review array or server logs for connection resets, session drops, failovers, or heavy latency. If, for instance, the array log shows that the connection from the ESXi IP was reset because of a TCP or congestion issue exactly at the APD start time, the root cause is probably somewhere between that controller and the host. In environments where network packet loss triggers slow‑start behavior and repeated retransmissions, the effective throughput can collapse to the point that ESXi perceives it as an APD even though the interface never technically goes down.
A common outcome of this analysis is that the real problem is either a transient network issue (link flap, misconfigured MTU, queue drops) or a storage‑side transient (controller failover, NFS daemon restart). Addressing that underlying cause usually prevents further APDs. If the APD condition persists or if the host has been stuck in APD for an extended period, many vendors recommend a controlled reboot of affected ESXi hosts after the storage problem has been resolved, to clear any stale device state and residual APD references.