为了更快的定位您的问题,请提供以下信息,谢谢
【详述】FE 节点大量的告警信息
【背景】从3.0升级到3.1
【业务影响】FE Leader节点会假死,可能;其它风险未知
【是否存算分离】否
【StarRocks版本】3.1.7
【集群规模】5fe(3 follower+2observer)+3be
【机器信息】CPU虚拟核/内存/网卡,16C/64G/万兆
【联系方式】363698476@qq.com
【附件】
2024-01-30 15:43:00,063 WARN (thrift-server-pool-1646|1909) [LeaderImpl.finishTask():239] task type: CLONE, status_code: RUNTIME_ERROR, clone failed., backendId: 10003, signature: 15765959
2024-01-30 15:43:20,362 WARN (thrift-server-pool-1656|1945) [LeaderImpl.finishTask():191] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:172.18.211.28, be_port:9060, http_port:8041), task_type:CLONE, signature:15765959, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[clone failed.]))
2024-01-30 15:43:20,362 WARN (thrift-server-pool-1656|1945) [LeaderImpl.finishTask():239] task type: CLONE, status_code: RUNTIME_ERROR, clone failed., backendId: 10003, signature: 15765959
2024-01-30 15:43:40,764 WARN (thrift-server-pool-1650|1933) [LeaderImpl.finishTask():191] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:172.18.211.28, be_port:9060, http_port:8041), task_type:CLONE, signature:15765959, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[clone failed.]))
2024-01-30 15:43:40,764 WARN (thrift-server-pool-1650|1933) [LeaderImpl.finishTask():239] task type: CLONE, status_code: RUNTIME_ERROR, clone failed., backendId: 10003, signature: 15765959
2024-01-30 15:44:01,031 WARN (thrift-server-pool-1646|1909) [LeaderImpl.finishTask():191] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:172.18.211.28, be_port:9060, http_port:8041), task_type:CLONE, signature:15765959, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[clone failed.]))
2024-01-30 15:44:01,031 WARN (thrift-server-pool-1646|1909) [LeaderImpl.finishTask():239] task type: CLONE, status_code: RUNTIME_ERROR, clone failed., backendId: 10003, signature: 15765959
2024-01-30 15:44:21,350 WARN (thrift-server-pool-1656|1945) [LeaderImpl.finishTask():191] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:172.18.211.28, be_port:9060, http_port:8041), task_type:CLONE, signature:15765959, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[clone failed.]))
2024-01-30 15:44:21,350 WARN (thrift-server-pool-1656|1945) [LeaderImpl.finishTask():239] task type: CLONE, status_code: RUNTIME_ERROR, clone failed., backendId: 10003, signature: 15765959
2024-01-30 15:44:29,976 WARN (starrocks-mysql-nio-pool-37|1696) [ReadListener.lambda$handleEvent$0():81] Exception happened in one session(com.starrocks.mysql.nio.NConnectContext@e235708).
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?]
at org.xnio.nio.NioSocketConduit.read(NioSocketConduit.java:289) ~[xnio-nio-3.8.10.Final.jar:3.8.10.Final]
at org.xnio.conduits.ConduitStreamSourceChannel.read(ConduitStreamSourceChannel.java:127) ~[xnio-api-3.8.10.Final.jar:3.8.10.Final]
at org.xnio.channels.Channels.readBlocking(Channels.java:344) ~[xnio-api-3.8.10.Final.jar:3.8.10.Final]
at com.starrocks.mysql.nio.NMysqlChannel.realNetRead(NMysqlChannel.java:53) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlChannel.readAllPlain(MysqlChannel.java:162) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlChannel.readAll(MysqlChannel.java:155) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlChannel.fetchOnePacket(MysqlChannel.java:186) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:745) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
2024-01-30 15:44:35,402 WARN (starrocks-mysql-nio-pool-37|1696) [ReadListener.lambda$handleEvent$0():81] Exception happened in one session(com.starrocks.mysql.nio.NConnectContext@237e3a4e).
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?]
at org.xnio.nio.NioSocketConduit.read(NioSocketConduit.java:289) ~[xnio-nio-3.8.10.Final.jar:3.8.10.Final]
at org.xnio.conduits.ConduitStreamSourceChannel.read(ConduitStreamSourceChannel.java:127) ~[xnio-api-3.8.10.Final.jar:3.8.10.Final]
at org.xnio.channels.Channels.readBlocking(Channels.java:344) ~[xnio-api-3.8.10.Final.jar:3.8.10.Final]
at com.starrocks.mysql.nio.NMysqlChannel.realNetRead(NMysqlChannel.java:53) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlChannel.readAllPlain(MysqlChannel.java:162) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlChannel.readAll(MysqlChannel.java:155) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.MysqlChannel.fetchOnePacket(MysqlChannel.java:186) ~[starrocks-fe.jar:?]
at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:745) ~[starrocks-fe.jar:?]
at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) ~[?:?]
2024-01-30 15:44:41,636 WARN (thrift-server-pool-1652|1935) [LeaderImpl.finishTask():191] finish task reports bad. request: TFinishTaskRequest(backend:TBackend(host:172.18.211.28, be_port:9060, http_port:8041), task_type:CLONE, signature:15765959, task_status:TStatus(status_code:RUNTIME_ERROR, error_msgs:[clone failed.]))
2024-01-30 15:44:41,636 WARN (thrift-server-pool-1652|1935) [LeaderImpl.finishTask():239] task type: CLONE, status_code: RUNTIME_ERROR, clone failed., backendId: 10003, signature: 15765959
对应BE 节点信息:
W0130 16:35:56.820621 24251 tablet_updates.cpp:2147] remove_expired_versions failed, tablet updates is in error state: tablet:15765959 _apply_rowset_commit error: load primary index failed: Already exist: insert found duplicate key new(rssid=9874 rowid=87423) old(rssid=9873 rowid=87423) key=BCBP.USROBBINS SPENCER B.39807200%8 [424342502E55530000524F4242494E53205350454E43455220422E000033393830370000323030000080258338]
/build/starrocks/be/src/storage/primary_index.cpp:771 get_index_by_length(keys[idx_begin].size)->insert(rssid, rowids, pks, idx_begin, i) tablet:15765959 #version:2 [5547 5547@0 5548] pending: rowsets:9
9867 [seg:1 row:793493 del:793493 bytes:17631259 compaction:338960492 partial_update_by_column:false]
9868 [seg:1 row:793493 del:793493 bytes:17090517 compaction:336797524 partial_update_by_column:false]
9869 [seg:1 row:793493 del:793493 bytes:17182266 compaction:337164520 partial_update_by_column:false]
9870 [seg:1 row:793493 del:793493 bytes:16933073 compaction:336167748 partial_update_by_column:false]
9871 [seg:1 row:793493 del:793493 bytes:17222408 compaction:337325088 partial_update_by_column:false]
9872 [seg:1 row:793493 del:793493 bytes:17158764 compaction:337070512 partial_update_by_column:false]
9873 [seg:1 row:793493 del:793492 bytes:17287551 compaction:337585550 partial_update_by_column:false]
9874 [seg:1 row:793493 del:0 bytes:16924658 compaction:251510798 partial_update_by_column:false]
9875 [seg:1 row:793493 del:0 bytes:16993456 compaction:251442000 partial_update_by_column:false]
I0130 16:35:56.820638 24251 engine_clone_task.cpp:992] Loaded snapshot of tablet 15765959, removing directory /data/storage/data/375/15765959/381052829/clone
I0130 16:35:56.823736 24251 tablet_manager.cpp:891] Reporting tablet info. tablet_id=15765959
W0130 16:35:56.825347 24251 engine_clone_task.cpp:348] Fail to clone tablet. tablet_id:15765959, schema_hash:381052829, signature:15765959, version:5548, expected_version: 6920
然而,show proc ‘/dbs/10170/15765953’,则查不到这个tableid对应的表。
暂时解决。把删除的表recover 恢复,然后 drop force 删除它。就不会有这个错误了。
请问升级前后进行了什么特别的操作,导致这个问题产生,方便提供一下吗?