When Good Drives Try To Go Bad

I logged into my FreeNAS box to check a few things and maybe do a software update when I looked at the notifications and saw something that every admin hates to see.

Since I was still able to log into to web interface and access the volumes I decided it wasn't a complete and utter failure. I logged onto my server via SSH and decided to run good old zpool status.

  pool: freenas-boot
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:07:53 with 2 errors on Wed Jul 15 03:52:53 2020
config:

	NAME                                          STATE     READ WRITE CKSUM
	freenas-boot                                  DEGRADED     2     0     0
	  gptid/7d4cfaf2-cf70-11e5-8a07-000c29242dfe  DEGRADED     2     0    60  too many errors

errors: 2 data errors, use '-v' for a list

Okay only two data errors that's not too bad. Let's take a look at what files they are.

  pool: freenas-boot
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 0 days 00:07:53 with 2 errors on Wed Jul 15 03:52:53 2020
config:

	NAME                                          STATE     READ WRITE CKSUM
	freenas-boot                                  DEGRADED     2     0     0
	  gptid/7d4cfaf2-cf70-11e5-8a07-000c29242dfe  DEGRADED     2     0    60  too many errors

errors: Permanent errors have been detected in the following files:

        freenas-boot/ROOT/11.3-U2.1@2016-02-09-14:33:33:/usr/local/lib/python2.7/email/message.pyc
        freenas-boot/ROOT/11.3-U2.1@2016-02-09-14:33:33:/usr/local/lib/python2.7/uu.pyc

Seems to be two python library files that I'm not even sure are used for anything. Still it's not great since it may mean the flash drive I've been using as a boot drive is failing. Before doing anything else the first thing I did was export out my FreeNAS configuration file so I could restore if I needed to reinstall.

I didn't have a flash drive handy, had already backed up the configuration and the NAS data pool was healthy so I decided to do something I normally wouldn't recommend. I decided to run a system update in the hopes it would replace the busted files and I wouldn't have to devote a chunk of time to rebuilding my FreeNAS box.

Of course if this makes the problem worse I'm going to need to rebuild the damn thing anyway. But I decided to go for the (potentially) quick option. The good news is on that on first glance it seemed to have worked.


  pool: freenas-boot
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0 days 00:07:24 with 0 errors on Wed Jul 15 22:21:45 2020
config:

	NAME                                          STATE     READ WRITE CKSUM
	freenas-boot                                  DEGRADED     0     0     0
	  gptid/7d4cfaf2-cf70-11e5-8a07-000c29242dfe  DEGRADED     0     0     0  too many errors

errors: No known data errors

Well it kind of worked, now I'm getting a slightly different error. Time to check it with a zpool scrub freenas-boot.

root@Thorn:~ # zpool status freenas-boot
  pool: freenas-boot
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0 days 00:08:39 with 0 errors on Wed Jul 15 23:24:26 2020
config:

	NAME                                          STATE     READ WRITE CKSUM
	freenas-boot                                  DEGRADED     0     0     0
	  gptid/7d4cfaf2-cf70-11e5-8a07-000c29242dfe  DEGRADED     0     0     0  too many errors

errors: No known data errors
root@Thorn:~ #

Since the run of zpool scrub didn't seem to identify any file errors I may have gotten lucky with the system update. I decided to run zpool clear freenas-boot to clear the errors and then re-run zpool scrub. A little under eight minutes later it came back clean

root@Thorn:~ # zpool status freenas-boot
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:07:49 with 0 errors on Wed Jul 15 23:38:26 2020
config:

	NAME                                          STATE     READ WRITE CKSUM
	freenas-boot                                  ONLINE       0     0     0
	  gptid/7d4cfaf2-cf70-11e5-8a07-000c29242dfe  ONLINE       0     0     0

errors: No known data errors

Knowing I likely got incredibly lucky I ordered a new flash drive online and once it shows up I'll make it a copy of the existing boot drive and see if that works. Or I may just do a fresh install for paranoia's sake. Thankfully this is something that many other people have done so there are plenty of guides out there. I'll update this post when I get that working.

I really should get around to getting some sort of external monitoring set up but that's a project for another post.