七台异常 VPS 根因核查

结论:不是七台完全同一原因。HK、KR、KRB、TK、US 这 5 台高度一致,主因是 IP-Sentinel 残留服务反复退出刷日志;AR 是 Hermes gateway 短时失败日志;HKA 只有少量 Hermes gateway 历史失败日志。所有已连通节点当前 nginx/核心服务未见整体宕机,资源也正常。

7
核查节点
5
同一主因:IP-Sentinel 残留
0
当前 failed systemd 单元
0
确认 OOM/磁盘满

根因分组

逐台摘要

节点状态判断24h关键词数资源
AR
129.146.59.53
观察Hermes gateway 短时失败日志(当前核心服务已运行)274200/23974 18%
59G/116G 51%
KR
131.186.27.212
需处理IP-Sentinel 残留服务反复退出刷日志32947477/954 50%
4.4G/48G 10%
HK
82.158.88.91
需处理IP-Sentinel 残留服务反复退出刷日志32944453/1967 23%
11G/39G 27%
TK
103.232.213.10
需处理IP-Sentinel 残留服务反复退出刷日志13338263/957 27%
8.8G/20G 47%
KRB
161.118.130.5
需处理IP-Sentinel 残留服务反复退出刷日志32940441/954 46%
3.1G/45G 8%
US
186.244.244.52
需处理IP-Sentinel 残留服务反复退出刷日志33678971/3915 25%
12G/29G 42%
HKA
38.76.188.244
观察Hermes gateway 短时失败日志(当前核心服务已运行)2764/3915 20%
13G/29G 43%

逐台证据

AR 129.146.59.53

判断
Hermes gateway 短时失败日志(当前核心服务已运行)
系统/资源
Ubuntu 24.04.4 LTS;内存 4200/23974 18%;根盘 59G/116G 51%
服务状态
xray:not-found/-; nginx:loaded/active; hermes-gateway:loaded/active; docker:loaded/active; ip-sentinel-agent-daemon:not-found/-; site_total:not-found/-
失败单元
24小时关键词数
27

主要日志模式:

3 May N TIME instance-N-N python[N]: Traceback (most recent call last): 2 May N TIME instance-N-N systemd[N]: hermes-gateway.service: Main process exited, code=exited, status=N/FAILURE 2 May N TIME instance-N-N systemd[N]: hermes-gateway.service: Failed with result 'exit-code'. 2 May N TIME instance-N-N systemd[N]: hermes-gateway-health.service: Failed with result 'exit-code'. 2 May N TIME instance-N-N python[N]: The above exception was the direct cause of the following exception: 2 May N TIME instance-N-N python[N]: File "/root/.hermes/hermes-agent/venv/lib/pythonN.N/site-packages/httpx/_transports/default.py", line N, in map_httpcore_exceptions 2 May N TIME instance-N-N python[N]: self.gen.throw(typ, value, traceback) 1 May N TIME instance-N-N systemd[N]: hermes-gateway-stock.service: Main process exited, code=exited, status=N/FAILURE 1 May N TIME instance-N-N systemd[N]: hermes-gateway-stock.service: Failed with result 'exit-code'. 1 May N TIME instance-N-N systemd[N]: hermes-gateway-news.service: Main process exited, code=exited, status=N/FAILURE 1 May N TIME instance-N-N systemd[N]: hermes-gateway-news.service: Failed with result 'exit-code'. 1 May N TIME instance-N-N systemd[N]: hermes-gateway-network.service: Main process exited, code=exited, status=N/FAILURE

KR 131.186.27.212

判断
IP-Sentinel 残留服务反复退出刷日志
系统/资源
Ubuntu 24.04.4 LTS;内存 477/954 50%;根盘 4.4G/48G 10%
服务状态
xray:loaded/active; nginx:loaded/active; hermes-gateway:not-found/-; docker:not-found/-; ip-sentinel-agent-daemon:loaded/activating; site_total:not-found/-
失败单元
24小时关键词数
32947

主要日志模式:

16470 May N TIME instance-N-N systemd[N]: ip-sentinel-agent-daemon.service: Main process exited, code=exited, status=N/n/a 16470 May N TIME instance-N-N systemd[N]: ip-sentinel-agent-daemon.service: Failed with result 'exit-code'. 6 May N TIME instance-N-N pythonN[N]: /usr/bin/apt-key: N: cannot create /dev/null: Permission denied 1 May N TIME instance-N-N xray[N]: N/N/N TIME.N from N.N.N.N:N accepted tcp:zoom.us:N [z_direct_outbound] email: NcfdN-vless_reality_vision

HK 82.158.88.91

判断
IP-Sentinel 残留服务反复退出刷日志
系统/资源
Ubuntu 24.04.4 LTS;内存 453/1967 23%;根盘 11G/39G 27%
服务状态
xray:loaded/active; nginx:loaded/active; hermes-gateway:not-found/-; docker:not-found/-; ip-sentinel-agent-daemon:loaded/activating; site_total:not-found/-
失败单元
24小时关键词数
32944

主要日志模式:

16472 May N TIME serqNtNzyrkNxk systemd[N]: ip-sentinel-agent-daemon.service: Main process exited, code=exited, status=N/n/a 16472 May N TIME serqNtNzyrkNxk systemd[N]: ip-sentinel-agent-daemon.service: Failed with result 'exit-code'.

TK 103.232.213.10

判断
IP-Sentinel 残留服务反复退出刷日志
系统/资源
Ubuntu 22.04.5 LTS;内存 263/957 27%;根盘 8.8G/20G 47%
服务状态
xray:loaded/active; nginx:loaded/active; hermes-gateway:not-found/-; docker:not-found/-; ip-sentinel-agent-daemon:loaded/activating; site_total:not-found/-
失败单元
24小时关键词数
13338

主要日志模式:

6669 May N TIME jpproN-N systemd[N]: ip-sentinel-agent-daemon.service: Main process exited, code=exited, status=N/n/a 6669 May N TIME jpproN-N systemd[N]: ip-sentinel-agent-daemon.service: Failed with result 'exit-code'.

KRB 161.118.130.5

判断
IP-Sentinel 残留服务反复退出刷日志
系统/资源
Ubuntu 24.04.4 LTS;内存 441/954 46%;根盘 3.1G/45G 8%
服务状态
xray:not-found/-; nginx:loaded/active; hermes-gateway:not-found/-; docker:not-found/-; ip-sentinel-agent-daemon:loaded/activating; site_total:not-found/-
失败单元
24小时关键词数
32940

主要日志模式:

16469 May N TIME instance-N-N systemd[N]: ip-sentinel-agent-daemon.service: Main process exited, code=exited, status=N/n/a 16469 May N TIME instance-N-N systemd[N]: ip-sentinel-agent-daemon.service: Failed with result 'exit-code'.

US 186.244.244.52

判断
IP-Sentinel 残留服务反复退出刷日志
系统/资源
Ubuntu 24.04.4 LTS;内存 971/3915 25%;根盘 12G/29G 42%
服务状态
xray:not-found/-; nginx:loaded/active; hermes-gateway:not-found/-; docker:loaded/active; ip-sentinel-agent-daemon:loaded/activating; site_total:loaded/inactive
失败单元
24小时关键词数
33678

主要日志模式:

16466 May N TIME serN systemd[N]: ip-sentinel-agent-daemon.service: Main process exited, code=exited, status=N/n/a 16466 May N TIME serN systemd[N]: ip-sentinel-agent-daemon.service: Failed with result 'exit-code'. 370 May N TIME serN systemd[N]: site_total.service: Main process exited, code=exited, status=N/EXEC 370 May N TIME serN systemd[N]: site_total.service: Failed with result 'exit-code'.

HKA 38.76.188.244

判断
Hermes gateway 短时失败日志(当前核心服务已运行)
系统/资源
Ubuntu 24.04.4 LTS;内存 764/3915 20%;根盘 13G/29G 43%
服务状态
xray:not-found/-; nginx:loaded/active; hermes-gateway:not-found/-; docker:loaded/active; ip-sentinel-agent-daemon:not-found/-; site_total:not-found/-
失败单元
24小时关键词数
2

主要日志模式:

1 May N TIME serN systemd[N]: hermes-gateway.service: Main process exited, code=exited, status=N/TEMPFAIL 1 May N TIME serN systemd[N]: hermes-gateway.service: Failed with result 'exit-code'.

建议

  1. 优先处理 5 台同因节点的 IP-Sentinel 残留服务:先备份 unit 状态,再停止残留拉起链路、daemon-reload、reset-failed,并验证 24 小时日志不再暴涨。
  2. US 还要单独确认 site_total.service 是否已退役;如果已退役,按残留服务处理;如果仍需要,应修复 Exec 路径。
  3. AR/HKA 暂不建议按 IP-Sentinel 同因处理,只需观察 Hermes gateway 的短时失败是否复现。

本次为只读核查,未重启、未修改、未删除任何远端配置。