为了更快的定位您的问题,请提供以下信息,谢谢
【详述】正常flink写入埋点数据,数据量每天大概三百多G,写了大概一个多月的数据,突然经常性的写入CHECKPOINT提交异常,其他小表flink写入能正常写入,也能正常查询
【背景】无
【业务影响】
【是否存算分离】
【StarRocks版本】3.3.1
【集群规模】例如:3fe(1 follower+2observer)+5be(fe与be分开部署)
【机器信息】CPU虚拟核/内存/网卡,例如:80C/500G/万兆
【表模型】例如:主键模型
【导入或者导出方式】例如:Flink
【联系方式】为了在解决问题过程中能及时联系到您获取一些日志信息,请补充下您的联系方式,例如:社区群4-小李或者邮箱,谢谢
【附件】
- fe.log fe每隔30秒报下面的错误,不知道有没有影响
2024-09-24 00:00:09.552+08:00 ERROR (TableKeeper|130) [RepoExecutor.executeDDL():114] execute DDL error: ALTER TABLE statistics.task_run_history SET (‘replication_num’=‘3’)
com.starrocks.common.DdlException: This is a range partitioned table, you should specify partitions with MODIFY PARTITION clause. If you want to set default replication number, please use ‘default.replication_num’ instead of ‘replication_num’ to escape misleading.
at com.starrocks.server.LocalMetastore.modifyTableReplicationNum(LocalMetastore.java:3800)
at com.starrocks.alter.SchemaChangeHandler.analyzeAndCreateJob(SchemaChangeHandler.java:1829)
at com.starrocks.alter.SchemaChangeHandler.process(SchemaChangeHandler.java:2028)
at com.starrocks.alter.AlterJobMgr.processAlterTable(AlterJobMgr.java:568)
at com.starrocks.server.LocalMetastore.alterTable(LocalMetastore.java:2918)
at com.starrocks.server.MetadataMgr.alterTable(MetadataMgr.java:389)
at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.lambda$visitAlterTableStatement$17(DDLStmtExecutor.java:394)
at com.starrocks.common.ErrorReport.wrapWithRuntimeException(ErrorReport.java:113)
at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitAlterTableStatement(DDLStmtExecutor.java:393)
at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitAlterTableStatement(DDLStmtExecutor.java:181)
at com.starrocks.sql.ast.AlterTableStmt.accept(AlterTableStmt.java:64)
at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:71)
at com.starrocks.qe.DDLStmtExecutor.execute(DDLStmtExecutor.java:167)
at com.starrocks.load.pipe.filelist.RepoExecutor.executeDDL(RepoExecutor.java:112)
at com.starrocks.scheduler.history.TableKeeper.correctTable(TableKeeper.java:117)
at com.starrocks.scheduler.history.TableKeeper.run(TableKeeper.java:76)
at com.starrocks.scheduler.history.TableKeeper$TableKeeperDaemon.runAfterCatalogReady(TableKeeper.java:237)
at com.starrocks.common.util.FrontendDaemon.runOneCycle(FrontendDaemon.java:72)
at com.starrocks.common.util.Daemon.run(Daemon.java:107)
2024-09-24 00:00:09.555+08:00 ERROR (TableKeeper|130) [TableKeeper.run():84] error happens in Keeper: com.starrocks.common.DdlException: This is a range partitioned table, you should specify partitions with MODIFY PARTITION clause. If you want to set default replication number, please use ‘default.replication_num’ instead of ‘replication_num’ to escape misleading.
java.lang.RuntimeException: com.starrocks.common.DdlException: This is a range partitioned table, you should specify partitions with MODIFY PARTITION clause. If you want to set default replication number, please use ‘default.replication_num’ instead of ‘replication_num’ to escape misleading.
at com.starrocks.load.pipe.filelist.RepoExecutor.executeDDL(RepoExecutor.java:115)
at com.starrocks.scheduler.history.TableKeeper.correctTable(TableKeeper.java:117)
at com.starrocks.scheduler.history.TableKeeper.run(TableKeeper.java:76)
at com.starrocks.scheduler.history.TableKeeper$TableKeeperDaemon.runAfterCatalogReady(TableKeeper.java:237)
at com.starrocks.common.util.FrontendDaemon.runOneCycle(FrontendDaemon.java:72)
at com.starrocks.common.util.Daemon.run(Daemon.java:107)
Caused by: com.starrocks.common.DdlException: This is a range partitioned table, you should specify partitions with MODIFY PARTITION clause. If you want to set default replication number, please use ‘default.replication_num’ instead of ‘replication_num’ to escape misleading.
at com.starrocks.server.LocalMetastore.modifyTableReplicationNum(LocalMetastore.java:3800)
at com.starrocks.alter.SchemaChangeHandler.analyzeAndCreateJob(SchemaChangeHandler.java:1829)
at com.starrocks.alter.SchemaChangeHandler.process(SchemaChangeHandler.java:2028)
at com.starrocks.alter.AlterJobMgr.processAlterTable(AlterJobMgr.java:568)
at com.starrocks.server.LocalMetastore.alterTable(LocalMetastore.java:2918)
at com.starrocks.server.MetadataMgr.alterTable(MetadataMgr.java:389)
at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.lambda$visitAlterTableStatement$17(DDLStmtExecutor.java:394)
at com.starrocks.common.ErrorReport.wrapWithRuntimeException(ErrorReport.java:113)
at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitAlterTableStatement(DDLStmtExecutor.java:393)
at com.starrocks.qe.DDLStmtExecutor$StmtExecutorVisitor.visitAlterTableStatement(DDLStmtExecutor.java:181)
at com.starrocks.sql.ast.AlterTableStmt.accept(AlterTableStmt.java:64)
at com.starrocks.sql.ast.AstVisitor.visit(AstVisitor.java:71)
at com.starrocks.qe.DDLStmtExecutor.execute(DDLStmtExecutor.java:167)
at com.starrocks.load.pipe.filelist.RepoExecutor.executeDDL(RepoExecutor.java:112)
… 5 more
/be.INFO 这个是写入数据报错的时候BE出现如下日志
I0920 09:27:43.435473 1532 tablet_updates.cpp:3006] update compaction start tablet:55883 version:9985 score:536443072 merge levels:0 pick:2/valid:4/all:4 24036,24038 #pick_segments:2 #valid_segments:4 #rows:12898->12898 bytes:417.82 KB->417.82 KB(estimate)
I0920 09:27:43.451020 1532 rowset_merger.cpp:322] update compaction merge finished. tablet=55883 #key=2 algorithm=VERTICAL_COMPACTION column_group_size=5 chunk_size min:4096 max:4096 input(entry=2 rows=12898 del=0 actual=12898 bytes=417.82 KB) output(rows=12898 chunk=6 bytes=413.29 KB) duration: 15ms
I0920 09:27:43.451380 1532 tablet_updates.cpp:2112] commit compaction tablet:55883 version:9985.1 rowset:24039 #seg:1 #row:12898 size:413.29 KB #pending:0 state_memory:0
I0920 09:27:43.451488 79713 tablet_updates.cpp:2174] apply_compaction_commit start tablet:55883 version:9985.1 rowset:24039
I0920 09:27:43.508199 79713 tablet_updates.cpp:2422] apply_compaction_commit finish tablet:55883 version:9985.1 total del/row:6/508968 0% rowset:24039 #row:12898 #del:0 #delvec:1 duration:57ms(0/57/0)
W0920 09:27:43.539618 561 segment_replicate_executor.cpp:181] Failed to send rpc to SyncChannnel [host: 10.5.5.26, port: 8060, load_id: 5f478c5b-cc59-2adf-7b37-c67c17635297, tablet_id: 1116696, txn_id: 615298] err=Internal error: no associated load channel 5f478c5b-cc59-2adf-7b37-c67c17635297
W0920 09:27:43.539712 561 segment_replicate_executor.cpp:320] Failed to sync segment SyncChannnel [host: 10.5.5.26, port: 8060, load_id: 5f478c5b-cc59-2adf-7b37-c67c17635297, tablet_id: 1116696, txn_id: 615298] err Internal error: no associated load channel 5f478c5b-cc59-2adf-7b37-c67c17635297
be/src/storage/segment_replicate_executor.cpp:131 _wait_response(replicate_tablet_infos, failed_tablet_infos)
W0920 09:27:43.539839 54206 delta_writer.cpp:722] Cancelled: cancel
W0920 09:27:43.539923 54206 async_delta_writer.cpp:67] Fail to write or commit. txn_id: 615298 tablet_id: 1116696: Cancelled: cancel
W0920 09:27:43.540086 1569 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540131 1569 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540153 1569 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540170 1569 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540293 1572 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540336 1572 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540364 1572 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540381 1572 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
W0920 09:27:43.540546 1563 async_delta_writer.cpp:186] Fail to execution_queue_execute: 22
- 完整的报错异常栈
Caused by: java.lang.Exception: Could not perform checkpoint 9229 for operator Source: sourceStream -> sourceStreamMap -> Sink: Unnamed (4/5)#0.
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1184)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$13(StreamTask.java:1131)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.lang.Thread.run(Thread.java:748)
Suppressed: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Transaction prepare failed, db: ods, table: appevent, label: flink-a6c795e0-3f82-45f6-8ee0-32f333d4b3b0,
responseBody: {
“Status”: “INTERNAL_ERROR”,
“Message”: “[E1008]Reached timeout=300000ms @10.15.15.25:8060”
}
errorLog: null
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.close(StarRocksDynamicSinkFunctionV2.java:251)
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:115)
at org.apache.flink.streaming.runtime.tasks.StreamOperatorWrapper.close(StreamOperatorWrapper.java:163)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.closeAllOperators(RegularOperatorChain.java:125)
at org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:1043)
at org.apache.flink.util.IOUtils.closeAll(IOUtils.java:255)
at org.apache.flink.core.fs.AutoCloseableRegistry.doClose(AutoCloseableRegistry.java:72)
at org.apache.flink.util.AbstractAutoCloseableRegistry.close(AbstractAutoCloseableRegistry.java:127)
at org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUp(StreamTask.java:951)
at org.apache.flink.runtime.taskmanager.Task.lambda$restoreAndInvoke$0(Task.java:934)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:934)
… 3 more
Caused by: com.starrocks.data.load.stream.exception.StreamLoadFailException: Transaction prepare failed, db: ods, table: appevent, label: flink-a6c795e0-3f82-45f6-8ee0-32f333d4b3b0,
responseBody: {
“Status”: “INTERNAL_ERROR”,
“Message”: “[E1008]Reached timeout=300000ms @10.15.15.25:8060”
}
errorLog: null
at com.starrocks.data.load.stream.TransactionStreamLoader.prepare(TransactionStreamLoader.java:221)
at com.starrocks.data.load.stream.v2.TransactionTableRegion.commit(TransactionTableRegion.java:247)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.lambda$init$0(StreamLoadManagerV2.java:210)
… 1 more
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 9229 for operator Source: sourceStream -> sourceStreamMap -> Sink: Unnamed (4/5)#0. Failure reason: Checkpoint was declined.
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:336)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:228)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:213)
at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:350)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$14(StreamTask.java:1299)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:93)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1287)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointAsyncInMailbox(StreamTask.java:1172)
… 14 more
Caused by: java.lang.RuntimeException: com.starrocks.data.load.stream.exception.StreamLoadFailException: Transaction prepare failed, db: ods, table: appevent, label: flink-a6c795e0-3f82-45f6-8ee0-32f333d4b3b0,
responseBody: {
“Status”: “INTERNAL_ERROR”,
“Message”: “[E1008]Reached timeout=300000ms @10.15.15.25:8060”
}
errorLog: null
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.AssertNotException(StreamLoadManagerV2.java:427)
at com.starrocks.data.load.stream.v2.StreamLoadManagerV2.flush(StreamLoadManagerV2.java:355)
at com.starrocks.connector.flink.table.sink.StarRocksDynamicSinkFunctionV2.snapshotState(StarRocksDynamicSinkFunctionV2.java:264)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:88)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:222)
… 25 more
[CIRCULAR REFERENCE:com.starrocks.data.load.stream.exception.StreamLoadFailException: Transaction prepare failed, db: ods, table: appevent, label: flink-a6c795e0-3f82-45f6-8ee0-32f333d4b3b0,
responseBody: {
“Status”: “INTERNAL_ERROR”,
“Message”: “[E1008]Reached timeout=300000ms @10.15.15.25:8060”
}
errorLog: null]