LXD报错:Error LXD unix socket not accessible Get "http://unix.socket/1.0" EOF 2024-04-22 | lxd, zfs | 知识点 文章目录 最近重启了下服务器,结果LXD的容器全部无法启动,差点没给我整崩溃了。。。 查看容器命令报以下错误: $ sudo lxc listError: LXD unix socket not accessible: Get "http://unix.socket/1.0": EOF 因为只是进行了重启,没有进行其他操作,所以怀疑是ZFS存储有问题,毕竟有两块是机械硬盘,于是查看zpool的状态: $ sudo zpool status -v pool: lxd state: ONLINEstatus: One or more devices has experienced an error resulting in data corruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 9h3m with 1 errors on Sun Apr 14 09:27:53 2024config: NAME STATE READ WRITE CKSUM lxd ONLINE 0 0 0 /var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 0errors: Permanent errors have been detected in the following files: lxd/containers/amr:/rootfs/home/amr/Data/manipulated_sequences/Deepfakes/c40/videos/DF_C40/41756.png 果然zpool报错,显示某个容器下的某一个路径的图片损坏。先把整个镜像lxd.img整体备份,怕到时候丢东西。还好只有3TB,NAS能装下。 还好,问题不大,上网查找说ZFS有自动扫描并恢复的命令。 尝试运行了两遍sudo zpool scrub和sudo zpool clear,每次长达9小时的校验后问题并没有解决,于是只好手动删除受损文件。 将对应的容器挂载到/mnt,然后进入对应的目录,手动删除受损文件,为了方便,使用root用户,具体命令如下: $ sudo mount -t zfs lxd/containers/amr /mnt$ su$ cd /mnt/rootfs/home/amr/Data/manipulated_sequences/Deepfakes/c40/videos/DF_C40/$ rm -f 41756.png$ exit$ sudo umount /mnt 此时查看zpool状态报错信息如下: $ sudo zpool status -v pool: lxd state: ONLINEstatus: One or more devices has experienced an error resulting in data corruption. Applications may be affected.action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://zfsonlinux.org/msg/ZFS-8000-8A scan: scrub repaired 0B in 8h50m with 1 errors on Sat Apr 20 04:36:35 2024config: NAME STATE READ WRITE CKSUM lxd ONLINE 0 0 0 /var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 4errors: Permanent errors have been detected in the following files: lxd/containers/amr:<0x1ea187> 再次执行sudo zpool scrub: $ sudo zpool status -v pool: lxd state: ONLINEstatus: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 0B in 9h20m with 0 errors on Sat Apr 20 19:52:37 2024config: NAME STATE READ WRITE CKSUM lxd ONLINE 0 0 0 /var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 4errors: No known data errors 显示错误貌似消失了,但是还是让执行sudo zpool clear: $ sudo zpool status -v pool: lxd state: ONLINE scan: scrub repaired 0B in 9h17m with 0 errors on Sun Apr 21 05:41:04 2024config: NAME STATE READ WRITE CKSUM lxd ONLINE 0 0 0 /var/snap/lxd/common/lxd/disks/lxd.img ONLINE 0 0 0errors: No known data errors 至此,LXD使用的ZFS文件系统已经没有问题了,但是执行sudo lxc list还是会报错: Error: LXD unix socket "/var/snap/lxd/common/lxd/unix.socket" not accessible: Get "http://unix.socket/1.0": dial unix /var/snap/lxd/common/lxd/unix.socket: connect: resource temporarily unavailable 查看lxd的调试信息如下: $ sudo lxd --debug --group lxdDEBUG [2024-04-21T08:08:14+08:00] Connecting to a local LXD over a Unix socketDEBUG [2024-04-21T08:08:14+08:00] Sending request to LXD etag= method=GET url="http://unix.socket/1.0"INFO [2024-04-21T08:08:14+08:00] LXD is starting mode=normal path=/var/snap/lxd/common/lxd version=5.21.1INFO [2024-04-21T08:08:14+08:00] Kernel uid/gid map:INFO [2024-04-21T08:08:14+08:00] - u 0 0 4294967295INFO [2024-04-21T08:08:14+08:00] - g 0 0 4294967295INFO [2024-04-21T08:08:14+08:00] Configured LXD uid/gid map:INFO [2024-04-21T08:08:14+08:00] - u 0 1000000 1000000000INFO [2024-04-21T08:08:14+08:00] - g 0 1000000 1000000000INFO [2024-04-21T08:08:14+08:00] Kernel features:INFO [2024-04-21T08:08:14+08:00] - closing multiple file descriptors efficiently: noINFO [2024-04-21T08:08:14+08:00] - netnsid-based network retrieval: yesINFO [2024-04-21T08:08:14+08:00] - pidfds: noINFO [2024-04-21T08:08:14+08:00] - core scheduling: noINFO [2024-04-21T08:08:14+08:00] - uevent injection: yesINFO [2024-04-21T08:08:14+08:00] - seccomp listener: yesINFO [2024-04-21T08:08:14+08:00] - seccomp listener continue syscalls: yesINFO [2024-04-21T08:08:14+08:00] - seccomp listener add file descriptors: noINFO [2024-04-21T08:08:14+08:00] - attach to namespaces via pidfds: noINFO [2024-04-21T08:08:14+08:00] - safe native terminal allocation : yesINFO [2024-04-21T08:08:14+08:00] - unprivileged file capabilities: yesINFO [2024-04-21T08:08:14+08:00] - cgroup layout: hybridWARNING[2024-04-21T08:08:14+08:00] - Couldn't find the CGroup blkio.weight, disk priority will be ignoredWARNING[2024-04-21T08:08:14+08:00] - Couldn't find the CGroup memory swap accounting, swap limits will be ignoredINFO [2024-04-21T08:08:14+08:00] - idmapped mounts kernel support: noINFO [2024-04-21T08:08:14+08:00] Instance type operational driver=lxc features="map[]" type=containerERROR [2024-04-21T08:08:14+08:00] Unable to run feature checks during QEMU initialization: Unable to locate the file for firmware "OVMF_CODE.4MB.fd"WARNING[2024-04-21T08:08:14+08:00] Instance type not operational driver=qemu err="QEMU failed to run feature checks" type=virtual-machineINFO [2024-04-21T08:08:14+08:00] Initializing local databaseDEBUG [2024-04-21T08:08:14+08:00] Refreshing identity cache with local trusted certificatesINFO [2024-04-21T08:08:14+08:00] Set client certificate to server certificate fingerprint=7bfa6d5710e943f5f23524bcca9f0a51bb5f58f819d1b9fb3e1d843facc0a20bDEBUG [2024-04-21T08:08:14+08:00] Initializing database gatewayINFO [2024-04-21T08:08:14+08:00] Starting database node id=1 local=1 role=voterERROR [2024-04-21T08:08:14+08:00] Failed to start the daemon err="Failed to start dqlite server: raft_start(): io: load closed segment 0000000000185550-0000000000185550: entries batch 45 starting at byte 487448: entries count in preamble is zero"INFO [2024-04-21T08:08:14+08:00] Starting shutdown sequence signal=interruptINFO [2024-04-21T08:08:14+08:00] Not unmounting temporary filesystems (instances are still running)INFO [2024-04-21T08:08:14+08:00] Daemon stoppedError: Failed to start dqlite server: raft_start(): io: load closed segment 0000000000185550-0000000000185550: entries batch 45 starting at byte 487448: entries count in preamble is zero 错误信息主要是 0000000000185550-0000000000185550 数据库io错误。 于是找到该数据库的路径为 /var/snap/lxd/common/lxd/database/global/。 删除该路径下的 0000000000185550-0000000000185550 以及该编号之后的所有数据库,删除之前需备份。 删除后执行 sudo lxc list 不报错了,但是所有容器都是stopped状态,并且无法开启。查看LXD存储状态报错: $ sudo lxc storage listError: Required tool 'zpool' is missing 然而zpool是安装好的。 查看LXD的版本: $ snap list lxdName Version Rev Tracking Publisher Noteslxd 5.21.1-98dad8f 28323 5.21/stable canonical✓ - 发现已经是最新版本的5.21.1,于是尝试降级再升级: $ sudo snap refresh lxd --channel=5.20/stable$ sudo snap refresh lxd --channel=5.21/stable 这之后问题解决: $ sudo lxc storage list+------+--------+----------------------------------------+-------------+---------+---------+| NAME | DRIVER | SOURCE | DESCRIPTION | USED BY | STATE |+------+--------+----------------------------------------+-------------+---------+---------+| lxd | zfs | /var/snap/lxd/common/lxd/disks/lxd.img | | 30 | CREATED |+------+--------+----------------------------------------+-------------+---------+---------+ 所有容器都可以正常启动,几乎没有数据丢失。最后估计是LXD需要重新安装即可,我在降级后LXD调试报以下错误: Error: Failed to initialize global database: failed to ensure schema: schema version '73' is more recent than expected '69' 估计是版本不对应,于是升级回来,发现错误就都解决了。 最后的总结就是,ZFS文件系统是个好东西,最好定期备份,并且硬盘做好冗余,尽量不要使用机械硬盘。 在此进行记录,希望能帮助到你~ 参考: - Error: LXD unix socket “/var/snap/lxd/common/lxd/unix.socket” not accessible: Get “http://unix.socket/1.0”: dial unix /var/snap/lxd/common/lxd/unix.socket: connect: resource temporarily unavailable - Permanent errors have been detected in the following files: #9705 - Clear a permanent ZFS error in a healthy pool - Ubuntu 22.04, LXD 5.0.2 - “Required tool ‘zpool’ is missing” after apt upgrade