最近重启了下服务器,结果LXD的容器全部无法启动,差点没给我整崩溃了。。。

查看容器命令报以下错误:

$ sudo lxc list
Error: LXD unix socket not accessible: Get "http://unix.socket/1.0": EOF

因为只是进行了重启,没有进行其他操作,所以怀疑是ZFS存储有问题,毕竟有两块是机械硬盘,于是查看zpool的状态:

$ sudo zpool status -v
pool: lxd
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 9h3m with 1 errors on Sun Apr 14 09:27:53 2024
config:

NAME STATE READ WRITE CKSUM
lxd ONLINE 0 0 0
/var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 0

errors: Permanent errors have been detected in the following files:

lxd/containers/amr:/rootfs/home/amr/Data/manipulated_sequences/Deepfakes/c40/videos/DF_C40/41756.png

果然zpool报错,显示某个容器下的某一个路径的图片损坏。先把整个镜像lxd.img整体备份,怕到时候丢东西。还好只有3TB,NAS能装下。

还好,问题不大,上网查找说ZFS有自动扫描并恢复的命令。

尝试运行了两遍sudo zpool scrubsudo zpool clear,每次长达9小时的校验后问题并没有解决,于是只好手动删除受损文件。

将对应的容器挂载到/mnt,然后进入对应的目录,手动删除受损文件,为了方便,使用root用户,具体命令如下:

$ sudo mount -t zfs lxd/containers/amr /mnt

$ su
$ cd /mnt/rootfs/home/amr/Data/manipulated_sequences/Deepfakes/c40/videos/DF_C40/
$ rm -f 41756.png
$ exit

$ sudo umount /mnt

此时查看zpool状态报错信息如下:

$ sudo zpool status -v
pool: lxd
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: scrub repaired 0B in 8h50m with 1 errors on Sat Apr 20 04:36:35 2024
config:

NAME STATE READ WRITE CKSUM
lxd ONLINE 0 0 0
/var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 4

errors: Permanent errors have been detected in the following files:

lxd/containers/amr:<0x1ea187>

再次执行sudo zpool scrub

$ sudo zpool status -v
pool: lxd
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 0B in 9h20m with 0 errors on Sat Apr 20 19:52:37 2024
config:

NAME STATE READ WRITE CKSUM
lxd ONLINE 0 0 0
/var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 4

errors: No known data errors

显示错误貌似消失了,但是还是让执行sudo zpool clear

$ sudo zpool status -v
pool: lxd
state: ONLINE
scan: scrub repaired 0B in 9h17m with 0 errors on Sun Apr 21 05:41:04 2024
config:

NAME STATE READ WRITE CKSUM
lxd ONLINE 0 0 0
/var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 0

errors: No known data errors

至此,LXD使用的ZFS文件系统已经没有问题了,但是执行sudo lxc list还是会报错:

Error: LXD unix socket "/var/snap/lxd/common/lxd/unix.socket" not accessible: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: resource temporarily unavailable

查看lxd的调试信息如下:

$ sudo lxd --debug --group lxd
DEBUG [2024-04-21T08:08:14+08:00] Connecting to a local LXD over a Unix socket
DEBUG [2024-04-21T08:08:14+08:00] Sending request to LXD etag= method=GET url="http://unix.socket/1.0"
INFO [2024-04-21T08:08:14+08:00] LXD is starting mode=normal path=/var/snap/lxd/common/lxd version=5.21.1
INFO [2024-04-21T08:08:14+08:00] Kernel uid/gid map:
INFO [2024-04-21T08:08:14+08:00] - u 0 0 4294967295
INFO [2024-04-21T08:08:14+08:00] - g 0 0 4294967295
INFO [2024-04-21T08:08:14+08:00] Configured LXD uid/gid map:
INFO [2024-04-21T08:08:14+08:00] - u 0 1000000 1000000000
INFO [2024-04-21T08:08:14+08:00] - g 0 1000000 1000000000
INFO [2024-04-21T08:08:14+08:00] Kernel features:
INFO [2024-04-21T08:08:14+08:00] - closing multiple file descriptors efficiently: no
INFO [2024-04-21T08:08:14+08:00] - netnsid-based network retrieval: yes
INFO [2024-04-21T08:08:14+08:00] - pidfds: no
INFO [2024-04-21T08:08:14+08:00] - core scheduling: no
INFO [2024-04-21T08:08:14+08:00] - uevent injection: yes
INFO [2024-04-21T08:08:14+08:00] - seccomp listener: yes
INFO [2024-04-21T08:08:14+08:00] - seccomp listener continue syscalls: yes
INFO [2024-04-21T08:08:14+08:00] - seccomp listener add file descriptors: no
INFO [2024-04-21T08:08:14+08:00] - attach to namespaces via pidfds: no
INFO [2024-04-21T08:08:14+08:00] - safe native terminal allocation : yes
INFO [2024-04-21T08:08:14+08:00] - unprivileged file capabilities: yes
INFO [2024-04-21T08:08:14+08:00] - cgroup layout: hybrid
WARNING[2024-04-21T08:08:14+08:00] - Couldn't find the CGroup blkio.weight, disk priority will be ignored
WARNING[2024-04-21T08:08:14+08:00] - Couldn't find the CGroup memory swap accounting, swap limits will be ignored
INFO [2024-04-21T08:08:14+08:00] - idmapped mounts kernel support: no
INFO [2024-04-21T08:08:14+08:00] Instance type operational driver=lxc features="map[]" type=container
ERROR [2024-04-21T08:08:14+08:00] Unable to run feature checks during QEMU initialization: Unable to locate the file for firmware "OVMF_CODE.4MB.fd"
WARNING[2024-04-21T08:08:14+08:00] Instance type not operational driver=qemu err="QEMU failed to run feature checks" type=virtual-machine
INFO [2024-04-21T08:08:14+08:00] Initializing local database
DEBUG [2024-04-21T08:08:14+08:00] Refreshing identity cache with local trusted certificates
INFO [2024-04-21T08:08:14+08:00] Set client certificate to server certificate fingerprint=7bfa6d5710e943f5f23524bcca9f0a51bb5f58f819d1b9fb3e1d843facc0a20b
DEBUG [2024-04-21T08:08:14+08:00] Initializing database gateway
INFO [2024-04-21T08:08:14+08:00] Starting database node id=1 local=1 role=voter
ERROR [2024-04-21T08:08:14+08:00] Failed to start the daemon err="Failed to start dqlite server: raft_start(): io: load closed segment 0000000000185550-0000000000185550: entries batch 45 starting at byte 487448: entries count in preamble is zero"
INFO [2024-04-21T08:08:14+08:00] Starting shutdown sequence signal=interrupt
INFO [2024-04-21T08:08:14+08:00] Not unmounting temporary filesystems (instances are still running)
INFO [2024-04-21T08:08:14+08:00] Daemon stopped
Error: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000185550-0000000000185550: entries batch 45 starting at byte 487448: entries count in preamble is zero

错误信息主要是 0000000000185550-0000000000185550 数据库io错误。

于是找到该数据库的路径为 /var/snap/lxd/common/lxd/database/global/

删除该路径下的 0000000000185550-0000000000185550 以及该编号之后的所有数据库,删除之前需备份。

删除后执行 sudo lxc list 不报错了,但是所有容器都是stopped状态,并且无法开启。查看LXD存储状态报错:

$ sudo lxc storage list
Error: Required tool 'zpool' is missing

然而zpool是安装好的。

查看LXD的版本:

$ snap list lxd
Name Version Rev Tracking Publisher Notes
lxd 5.21.1-98dad8f 28323 5.21/stable canonical✓ -

发现已经是最新版本的5.21.1,于是尝试降级再升级:

$ sudo snap refresh lxd --channel=5.20/stable
$ sudo snap refresh lxd --channel=5.21/stable

这之后问题解决:

$ sudo lxc storage list
+------+--------+----------------------------------------+-------------+---------+---------+
| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |
+------+--------+----------------------------------------+-------------+---------+---------+
| lxd | zfs | /var/snap/lxd/common/lxd/disks/lxd.img | | 30 | CREATED |
+------+--------+----------------------------------------+-------------+---------+---------+

所有容器都可以正常启动,几乎没有数据丢失。最后估计是LXD需要重新安装即可,我在降级后LXD调试报以下错误:

Error: Failed to initialize global database: failed to ensure schema: schema version '73' is more recent than expected '69'

估计是版本不对应,于是升级回来,发现错误就都解决了。

最后的总结就是,ZFS文件系统是个好东西,最好定期备份,并且硬盘做好冗余,尽量不要使用机械硬盘。

在此进行记录,希望能帮助到你~

参考: - Error: LXD unix socket “/var/snap/lxd/common/lxd/unix.socket” not accessible: Get “http://unix.socket/1.0”: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: resource temporarily unavailable - Permanent errors have been detected in the following files: #9705 - Clear a permanent ZFS error in a healthy pool - Ubuntu 22.04, LXD 5.0.2 - “Required tool ‘zpool’ is missing” after apt upgrade