Saturday, July 15, 2023

ZFS zpool replace "no such device in pool"

Since the last time I built my ZFS pool in 2016 with 4x 1TB SSDs in a thunderbolt 2 enclosure, I ran out of space in 2019 and upgraded the SSDs to 4x 2TB. I didn't have an extra enclosure, so I took a fairly risky approach to yank each SSD out, put in a new SSD, and let ZFS resilver. It only worked because my pool is a raidz1. During resilvering, the pool was in a degraded state due to the lack of redundancy.

And I am just about to run out of space again after 4 years. I also lucked out because the SSDs are heavy discounted right now due to oversupply. The chip makers are trying to reduce production, so the price might rise again.

This time, I bought 4x 4TB M.2 NVMe (non-affiliate B&H link) and put them in a Thunderbolt 3 enclosure (non-affiliate vendor link). This means that I could do a proper replace without risking the degraded state.

But when I try to replace the SSDs, I keep getting the "no such device in pool" error. I tried these variations:

$ sudo zpool replace tank disk2 disk13
cannot replace disk2 with disk13: no such device in pool
$ sudo zpool replace tank disk2 /dev/disk13
cannot replace disk2 with /dev/disk13: no such device in pool
$ sudo zpool replace tank disk2 media-728C3B9F-647B-4C40-9DDF-E140E55D8C89
cannot replace disk2 with media-728C3B9F-647B-4C40-9DDF-E140E55D8C89: no such device in pool
$ sudo zpool replace tank disk2 /var/run/disk/by-id/media-728C3B9F-647B-4C40-9DDF-E140E55D8C89
cannot replace disk2 with /var/run/disk/by-id/media-728C3B9F-647B-4C40-9DDF-E140E55D8C89: no such device in pool

None of the above worked, and it was driving me nuts. When I was younger, I would have pulled my hair out. Now that my hair is more spase, I've learned to take better care of it by showing restraint.

It turned out that "no such device in pool" actually meant it could not find disk2, even though it is able to find disk13 just fine for some reason. This worked:

$ sudo zpool replace tank media-E235653B-0924-2F46-B708-FDEFFAB5CB15 disk13

The colon in the error message makes it appear that "no such device in pool" is about the new disk, and it is confusing why the new disk have to be added to the pool first. The error is actually about ZFS not being able to identify the existing disk in pool that needs replacement. In fact, if it could not find the new disk, the error message is different:

$ sudo zpool replace tank disk2 disk777
cannot open 'disk777': no such device in /dev
must be a full path or shorthand device name

The error message could be improved, for sure. I wonder if baldness of the people in the tech industry could have been prevented if more error messages are helpful.

Another thing I learned is that when upgrading raidz1, it is far more efficient to replace all the disks at once and let ZFS resilver them in parallel. It takes the same amount of time to resilver one disk as to resilver all disks. The reason is that resilvering requires reading all disks in order to check the parity, whether it is resilvering one disk or many. In my case, the process is bottlenecked by the read, not the write.