LXD报错：Error LXD unix socket not accessible Get "http://unix.socket/1.0" EOF

最近重启了下服务器，结果LXD的容器全部无法启动，差点没给我整崩溃了。。。

查看容器命令报以下错误：

$ sudo lxc list
Error: LXD unix socket not accessible: Get "http://unix.socket/1.0": EOF

因为只是进行了重启，没有进行其他操作，所以怀疑是ZFS存储有问题，毕竟有两块是机械硬盘，于是查看zpool的状态：

$ sudo zpool status -v
  pool: lxd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 9h3m with 1 errors on Sun Apr 14 09:27:53 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	lxd                                       ONLINE       0     0     0
	  /var/snap/lxd/common/lxd/disks/lxd.img  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        lxd/containers/amr:/rootfs/home/amr/Data/manipulated_sequences/Deepfakes/c40/videos/DF_C40/41756.png

果然zpool报错，显示某个容器下的某一个路径的图片损坏。先把整个镜像lxd.img整体备份，怕到时候丢东西。还好只有3TB，NAS能装下。

还好，问题不大，上网查找说ZFS有自动扫描并恢复的命令。

尝试运行了两遍sudo zpool scrub和sudo zpool clear，每次长达9小时的校验后问题并没有解决，于是只好手动删除受损文件。

将对应的容器挂载到/mnt，然后进入对应的目录，手动删除受损文件，为了方便，使用root用户，具体命令如下：

$ sudo mount -t zfs lxd/containers/amr /mnt

$ su
$ cd /mnt/rootfs/home/amr/Data/manipulated_sequences/Deepfakes/c40/videos/DF_C40/
$ rm -f 41756.png
$ exit

$ sudo umount /mnt

此时查看zpool状态报错信息如下：

$ sudo zpool status -v
  pool: lxd
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 8h50m with 1 errors on Sat Apr 20 04:36:35 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	lxd                                       ONLINE       0     0     0
	  /var/snap/lxd/common/lxd/disks/lxd.img  ONLINE       0     0     4

errors: Permanent errors have been detected in the following files:

        lxd/containers/amr:<0x1ea187>

再次执行sudo zpool scrub：

$ sudo zpool status -v
  pool: lxd
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 9h20m with 0 errors on Sat Apr 20 19:52:37 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	lxd                                       ONLINE       0     0     0
	  /var/snap/lxd/common/lxd/disks/lxd.img  ONLINE       0     0     4

errors: No known data errors

显示错误貌似消失了，但是还是让执行sudo zpool clear：

$ sudo zpool status -v
  pool: lxd
 state: ONLINE
  scan: scrub repaired 0B in 9h17m with 0 errors on Sun Apr 21 05:41:04 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	lxd                                       ONLINE       0     0     0
	  /var/snap/lxd/common/lxd/disks/lxd.img  ONLINE       0     0     0

errors: No known data errors

至此，LXD使用的ZFS文件系统已经没有问题了，但是执行sudo lxc list还是会报错：

Error: LXD unix socket "/var/snap/lxd/common/lxd/unix.socket" not accessible: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: resource temporarily unavailable

查看lxd的调试信息如下：

$ sudo lxd --debug --group lxd
DEBUG  [2024-04-21T08:08:14+08:00] Connecting to a local LXD over a Unix socket
DEBUG  [2024-04-21T08:08:14+08:00] Sending request to LXD                        etag= method=GET url="http://unix.socket/1.0"
INFO   [2024-04-21T08:08:14+08:00] LXD is starting                               mode=normal path=/var/snap/lxd/common/lxd version=5.21.1
INFO   [2024-04-21T08:08:14+08:00] Kernel uid/gid map:
INFO   [2024-04-21T08:08:14+08:00]  - u 0 0 4294967295
INFO   [2024-04-21T08:08:14+08:00]  - g 0 0 4294967295
INFO   [2024-04-21T08:08:14+08:00] Configured LXD uid/gid map:
INFO   [2024-04-21T08:08:14+08:00]  - u 0 1000000 1000000000
INFO   [2024-04-21T08:08:14+08:00]  - g 0 1000000 1000000000
INFO   [2024-04-21T08:08:14+08:00] Kernel features:
INFO   [2024-04-21T08:08:14+08:00]  - closing multiple file descriptors efficiently: no
INFO   [2024-04-21T08:08:14+08:00]  - netnsid-based network retrieval: yes
INFO   [2024-04-21T08:08:14+08:00]  - pidfds: no
INFO   [2024-04-21T08:08:14+08:00]  - core scheduling: no
INFO   [2024-04-21T08:08:14+08:00]  - uevent injection: yes
INFO   [2024-04-21T08:08:14+08:00]  - seccomp listener: yes
INFO   [2024-04-21T08:08:14+08:00]  - seccomp listener continue syscalls: yes
INFO   [2024-04-21T08:08:14+08:00]  - seccomp listener add file descriptors: no
INFO   [2024-04-21T08:08:14+08:00]  - attach to namespaces via pidfds: no
INFO   [2024-04-21T08:08:14+08:00]  - safe native terminal allocation : yes
INFO   [2024-04-21T08:08:14+08:00]  - unprivileged file capabilities: yes
INFO   [2024-04-21T08:08:14+08:00]  - cgroup layout: hybrid
WARNING[2024-04-21T08:08:14+08:00]  - Couldn't find the CGroup blkio.weight, disk priority will be ignored
WARNING[2024-04-21T08:08:14+08:00]  - Couldn't find the CGroup memory swap accounting, swap limits will be ignored
INFO   [2024-04-21T08:08:14+08:00]  - idmapped mounts kernel support: no
INFO   [2024-04-21T08:08:14+08:00] Instance type operational                     driver=lxc features="map[]" type=container
ERROR  [2024-04-21T08:08:14+08:00] Unable to run feature checks during QEMU initialization: Unable to locate the file for firmware "OVMF_CODE.4MB.fd"
WARNING[2024-04-21T08:08:14+08:00] Instance type not operational                 driver=qemu err="QEMU failed to run feature checks" type=virtual-machine
INFO   [2024-04-21T08:08:14+08:00] Initializing local database
DEBUG  [2024-04-21T08:08:14+08:00] Refreshing identity cache with local trusted certificates
INFO   [2024-04-21T08:08:14+08:00] Set client certificate to server certificate  fingerprint=7bfa6d5710e943f5f23524bcca9f0a51bb5f58f819d1b9fb3e1d843facc0a20b
DEBUG  [2024-04-21T08:08:14+08:00] Initializing database gateway
INFO   [2024-04-21T08:08:14+08:00] Starting database node                        id=1 local=1 role=voter
ERROR  [2024-04-21T08:08:14+08:00] Failed to start the daemon                    err="Failed to start dqlite server: raft_start(): io: load closed segment 0000000000185550-0000000000185550: entries batch 45 starting at byte 487448: entries count in preamble is zero"
INFO   [2024-04-21T08:08:14+08:00] Starting shutdown sequence                    signal=interrupt
INFO   [2024-04-21T08:08:14+08:00] Not unmounting temporary filesystems (instances are still running)
INFO   [2024-04-21T08:08:14+08:00] Daemon stopped
Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000185550-0000000000185550: entries batch 45 starting at byte 487448: entries count in preamble is zero

错误信息主要是 0000000000185550-0000000000185550 数据库io错误。

于是找到该数据库的路径为 /var/snap/lxd/common/lxd/database/global/。

删除该路径下的 0000000000185550-0000000000185550 以及该编号之后的所有数据库，删除之前需备份。

删除后执行 sudo lxc list 不报错了，但是所有容器都是stopped状态，并且无法开启。查看LXD存储状态报错：

$ sudo lxc storage list
Error: Required tool 'zpool' is missing

然而zpool是安装好的。

查看LXD的版本：

$ snap list lxd
Name  Version         Rev    Tracking     Publisher   Notes
lxd   5.21.1-98dad8f  28323  5.21/stable  canonical✓  -

发现已经是最新版本的5.21.1，于是尝试降级再升级：

$ sudo snap refresh lxd --channel=5.20/stable
$ sudo snap refresh lxd --channel=5.21/stable

这之后问题解决：

$ sudo lxc storage list
+------+--------+----------------------------------------+-------------+---------+---------+
| NAME | DRIVER |                 SOURCE                 | DESCRIPTION | USED BY |  STATE  |
+------+--------+----------------------------------------+-------------+---------+---------+
| lxd  | zfs    | /var/snap/lxd/common/lxd/disks/lxd.img |             | 30      | CREATED |
+------+--------+----------------------------------------+-------------+---------+---------+

所有容器都可以正常启动，几乎没有数据丢失。最后估计是LXD需要重新安装即可，我在降级后LXD调试报以下错误：

Error: Failed to initialize global database: failed to ensure schema: schema version '73' is more recent than expected '69'

估计是版本不对应，于是升级回来，发现错误就都解决了。

最后的总结就是，ZFS文件系统是个好东西，最好定期备份，并且硬盘做好冗余，尽量不要使用机械硬盘。

在此进行记录，希望能帮助到你~

参考： - Error: LXD unix socket “/var/snap/lxd/common/lxd/unix.socket” not accessible: Get “http://unix.socket/1.0”: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: resource temporarily unavailable - Permanent errors have been detected in the following files: #9705 - Clear a permanent ZFS error in a healthy pool - Ubuntu 22.04, LXD 5.0.2 - “Required tool ‘zpool’ is missing” after apt upgrade