0%

ceph mon 启动失败

版本

1
Ceph 10.2.11

故障问题

服务器重启之后,查看ceph状态,提示如下错误

1
2
3
4
5
6
7
[root@ceph-node1 ~]# ceph -s
2021-08-10 08:40:38.044398 7f62381d8700 0 -- :/851362361 >> 192.168.1.128:6789/0 pipe(0x7f6234063c00 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f623405c270).fault
2021-08-10 08:40:41.050401 7f6230ff9700 0 -- :/851362361 >> 192.168.1.128:6789/0 pipe(0x7f6228000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6228001f90).fault
2021-08-10 08:40:44.062299 7f62381d8700 0 -- :/851362361 >> 192.168.1.128:6789/0 pipe(0x7f62280052b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6228006570).fault
2021-08-10 08:40:47.071908 7f6230ff9700 0 -- :/851362361 >> 192.168.1.128:6789/0 pipe(0x7f6228000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6228002410).fault
2021-08-10 08:40:50.080834 7f62381d8700 0 -- :/851362361 >> 192.168.1.128:6789/0 pipe(0x7f62280052b0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6228002f60).fault
2021-08-10 08:40:53.097169 7f6230ff9700 0 -- :/851362361 >> 192.168.1.128:6789/0 pipe(0x7f6228000c80 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f62280036d0).fault

重启ceph-mon服务,依然无法正常启动服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@ceph-node1 ~]# systemctl status ceph-mon@ceph-node1.service 
● ceph-mon@ceph-node1.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2021-08-10 08:47:04 CST; 35min ago
Main PID: 2084 (code=killed, signal=ABRT)

Aug 10 08:46:53 ceph-node1 ceph-mon[2084]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aug 10 08:46:53 ceph-node1 systemd[1]: ceph-mon@ceph-node1.service: main process exited, code=killed, status=6/ABRT
Aug 10 08:46:53 ceph-node1 systemd[1]: Unit ceph-mon@ceph-node1.service entered failed state.
Aug 10 08:46:53 ceph-node1 systemd[1]: ceph-mon@ceph-node1.service failed.
Aug 10 08:47:04 ceph-node1 systemd[1]: ceph-mon@ceph-node1.service holdoff time over, scheduling restart.
Aug 10 08:47:04 ceph-node1 systemd[1]: Stopped Ceph cluster monitor daemon.
Aug 10 08:47:04 ceph-node1 systemd[1]: start request repeated too quickly for ceph-mon@ceph-node1.service
Aug 10 08:47:04 ceph-node1 systemd[1]: Failed to start Ceph cluster monitor daemon.
Aug 10 08:47:04 ceph-node1 systemd[1]: Unit ceph-mon@ceph-node1.service entered failed state.
Aug 10 08:47:04 ceph-node1 systemd[1]: ceph-mon@ceph-node1.service failed.

采用命令行启动服务,提示如下错误

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@ceph-node1 ~]# /usr/bin/ceph-mon -f --cluster ceph --id ceph-node1 --setuser ceph --setgroup ceph
starting mon.ceph-node1 rank 0 at 192.168.1.128:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph-node1 fsid ecc704c4-0f19-4ab1-9738-6a2658fa2387
mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f18f9e58780 time 2021-08-10 09:22:50.632707
mon/AuthMonitor.cc: 163: FAILED assert(ret == 0)
ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55fedf0f6fe5]
2: (AuthMonitor::update_from_paxos(bool*)+0x1953) [0x55fedee20073]
3: (PaxosService::refresh(bool*)+0x1a5) [0x55feded30735]
4: (Monitor::refresh_from_paxos(bool*)+0x15b) [0x55fedecc743b]
5: (Monitor::init_paxos()+0x95) [0x55fedecc78d5]
6: (Monitor::preinit()+0x949) [0x55fedecda489]
7: (main()+0x242d) [0x55fedec1a24d]
8: (__libc_start_main()+0xf5) [0x7f18f7039555]
9: (()+0x2df77f) [0x55fedecb877f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2021-08-10 09:22:50.634298 7f18f9e58780 -1 mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f18f9e58780 time 2021-08-10 09:22:50.632707
mon/AuthMonitor.cc: 163: FAILED assert(ret == 0)

恢复过程

采取重建ceph-mon方式

1
2
3
4
5
6
7
cd /var/lib/ceph/mon/
ceph mon getmap -o /tmp/monmap
cp -a ceph-ceph-node1/keyring /tmp/
rm -rf ceph-ceph-node1/
ceph-mon --cluster ceph --id ceph-node1 --mkfs --monmap /tmp/monmap --keyring /tmp/keyring
chown -R ceph:ceph ceph-node1/
systemctl start ceph-mon@ceph-node1.service