Atomic Test And Set Of Disk Block Returned False For Equality «SIMPLE — OVERVIEW»

The error message "Atomic test and set of disk block returned false for equality" typically indicates a locking failure within VMware ESXi environments using VMFS (Virtual Machine File System).

This occurs during an Atomic Test and Set (ATS) operation, a hardware-accelerated locking primitive where a host attempts to claim or update metadata on a shared storage array. When the "test" (checking if the block's current value matches what the host expects) fails—returning false for equality—it means another host likely changed that block since it was last read, causing a miscompare. Feature Overview: VAAI Atomic Test and Set (ATS)

ATS is part of the vStorage APIs for Array Integration (VAAI), designed to replace traditional, inefficient SCSI reservations.

Primary Function: It provides Hardware-Assisted Locking, allowing a host to lock only specific disk sectors/metadata blocks rather than the entire LUN. Mechanism:

Test: The host reads a block and prepares a "compare" value.

Set: It issues a command to the storage array to update the block only if the current value still matches the "compare" value.

Atomic Nature: The array performs this check and write as a single, indivisible operation.

Benefit: Greatly improves performance in clusters by allowing parallel metadata access, which is critical during "boot storms" or simultaneous VM provisioning. Why the Feature Fails ("False for Equality") The failure usually stems from one of three areas:

Concurrency Contention: Too many hosts are trying to update the same metadata simultaneously (e.g., heavy VM power-on/off cycles), leading to frequent retries and miscompares.

Storage Latency: High I/O latency or "deteriorated performance" on the storage array can cause the ATS heartbeat to time out or mismatch.

Configuration Mismatch: Attempting to extend an "ATS-only" datastore with a non-ATS LUN, or issues with ATS Heartbeats on certain storage firmware. Troubleshooting & Resolution

If you are seeing this error in your logs, consider these steps from industry guides:

Verify Storage Compatibility: Ensure your storage array fully supports VAAI ATS.

Check Performance Logs: Look for ScsiDeviceIO warnings in the VMkernel log that indicate high latency (e.g., jumps from 3ms to 300ms).

Adjust Heartbeat Settings: In some cases, disabling ATS heartbeats (while keeping ATS for metadata) can resolve connectivity drops caused by array timeouts. The error message "Atomic test and set of

Re-mount Datastore: For persistent mount failures, some admins found success by removing and re-adding the datastore via the esxcli command line.

Are you experiencing this error during a specific operation like a VM power-on, or is it happening randomly across the cluster? Performance issues with VM operations

The system tried to claim a specific block of data, but the "handshake" failed.

In computing, an atomic test-and-set is a "do-it-all-at-once" operation. It looks at a value, checks if it matches what it expects, and—if it does—updates it instantly. This prevents two different processes from accidentally grabbing the same resource at the exact same time. When it returns false for equality, it means:

Expectation vs. Reality: The system said, "I’ll take this block if it’s currently empty (0)."

The Conflict: It looked at the block and found something else (1), likely because another process got there a millisecond faster.

The Result: The operation failed to "set" the new value because the "test" didn't pass. In short: Someone else already has the keys to that block.

This error message typically appears in VMware ESXi logs (such as vmkernel.log) and indicates a failure in the Atomic Test and Set (ATS) locking mechanism, which is part of the vSphere Storage APIs for Array Integration (VAAI). What it Means

When a host wants to lock a metadata block on a shared datastore, it sends an ATS command (specifically the SCSI COMPARE AND WRITE command) to the storage array.

The "Test": The host provides the data it expects to find in that disk block.

The "Equality": The storage array compares the actual data on the disk with the host's provided data.

The "False" Result: If the data on the disk does not match what the host expected, the equality check returns false (a "miscompare").

Because the comparison failed, the storage array refuses to perform the "Set" (write) operation. This is a safety mechanism to prevent data corruption when multiple hosts are competing for the same resource. Common Causes

High Latency: Extreme I/O latency can cause a host to receive outdated information about a block before it tries to lock it, leading to a mismatch when the actual ATS command arrives. Related APIs and Commands | API/Command | Purpose

Concurrency Conflicts: If another host successfully updated the block metadata just milliseconds before, the original host's "expected" data is now stale, triggering the miscompare.

Storage Array Issues: Firmware bugs or lack of proper VAAI support on the storage array can cause it to handle ATS commands incorrectly.

Multipathing/Driver Errors: Issues with the HBA (Host Bus Adapter) or the multipathing driver can disrupt the "handshake" between the host and the storage. Troubleshooting Steps

Check Latency: Review your storage performance metrics for spikes in latency that coincide with these log entries.

Verify Compatibility: Ensure your storage array firmware and ESXi drivers are on the VMware Compatibility Guide.

Disable ATS Heartbeat: If you are seeing "Lost access to datastore" messages alongside this error, VMware often recommends disabling ATS for heartbeating (switching back to legacy SCSI reservations) as a workaround on affected arrays.

Update Firmware: Check for known ATS-related bugs in your storage array's firmware version, as some vendors have specific patches for "false ATS miscompares". ESXi host HBAs offline - Broadcom support portal

In a storage context, the error "atomic test and set of disk block returned false for equality" typically indicates a locking failure in VMware ESXi environments using VAAI (vSphere Storage APIs for Array Integration) .

It occurs when a host attempts to update a disk block (such as a VMFS metadata heart-beat) but finds that the data currently on the disk does not match what it expected to see before making the change . Core Mechanism: Atomic Test and Set (ATS)

Traditional storage uses "SCSI Reservations" to lock an entire LUN (volume), which can cause performance bottlenecks. Modern systems use ATS (also known as Hardware Assisted Locking) to lock only specific disk blocks .

The "Test": The host reads a block and compares it to a "test-image" (expected data) .

The "Set": If they match (equality), the host immediately writes new data to the block in one atomic operation .

The Failure: If the block on the disk has changed since the host last checked it, the equality test returns false. The array then returns an "ATS Miscompare" error . Common Causes of This Error

Race Conditions: Multiple ESXi hosts are trying to access or update the same metadata block at the same time . Scenario: A distributed system node attempts to write

Delayed I/O (Timeouts): An earlier ATS "set" command actually reached the disk even though the host thought it timed out. When the host retries with the original "test" data, it no longer matches the already-updated disk content .

Storage Array Issues: Firmware bugs or misconfigurations on the storage array can lead to incorrect reporting of block states.

Network/Fabric Instability: Dropped packets or high latency in the SAN can cause the host and storage to become out of sync regarding the lock state . Troubleshooting Steps

Check VMkernel Logs: Look for "ATS Miscompare" or SCSI sense key MISCOMPARE (0xE or 14) in your ESXi logs .

Verify VAAI Support: Ensure your storage array's firmware is compatible with the version of ESXi you are running .

Monitor Path Latency: High latency often triggers the "timeout and retry" loop that leads to miscompares .

Consider Disabling ATS: As a last resort for stability, you can temporarily disable ATS heartbeat to revert to traditional SCSI reservations, though this may impact performance .

Are you seeing this error in a VMware VMkernel log, or is it appearing during a specific operation like mounting a datastore?

This phrase seems to describe a low-level concurrency or transactional issue, likely in the context of database systems, file systems, or persistent memory. Here’s a technical review of what this could mean and the implications.

Related APIs and Commands

| API/Command | Purpose | |-------------|---------| | sync_file_range(2) + fdatasync(2) | Control write ordering | | io_uring_ops with IORING_OP_COMPARE_AND_WRITE | Linux native TAS on block devices | | fcntl(F_OFD_SETLK) | POSIX file locking (not block-level) | | nvme compare and nvme write | NVMe’s compare-and-write primitives | | rados cas (Ceph) | Object-level atomic compare-and-swap |

B. Hardware Atomicity (Rare/Specific)

Some advanced storage controllers support atomic operations directly on hardware sectors.

Scenario: A distributed system node attempts to write a "token" to a specific disk block to claim leadership.
Result: TS returns false.
Meaning: Another node has already written data to that block; the equality check against the "empty" signature failed.

Solution 2: Clear Orphaned Reservations

If a dead node left a reservation, clear it:

# Register a new key
sg_persist -o -G -K 0x12345678 /dev/sdX
How to fix it
You cannot fix this with a code patch (unless you own the database). You must fix the environment.
The Checklist:

Check your file system: Are you using tmpfs or a network drive (NFS)? Don't. Use ext4 or XFS with direct I/O disabled (or properly configured).
Check your block alignment: Is your partition starting on sector 2048? Use parted to verify. Misalignment is the #1 cause of this error on spinning rust.
Disable the disk cache (for testing): Run hdparm -W 0 /dev/sda (disable write cache). If the error stops, your SSD has a broken firmware cache.
Update firmware: Seriously. Old Samsung EVO drives and early NVMe drives had infamous "stale read" bugs that triggered exactly this error.
Check for hypervisor quirks: If running on VMware or VirtualBox, disable "Host I/O Cache" for the virtual disk.