导数任务卡在 99%

为了更快的定位您的问题,请提供以下信息,谢谢
【详述】使用 insert into select 从 hive 里面 导数导 sr集群, 提交任务的方式如下
SUBMIT /+set_var(query_timeout=259200)/ TASK task_tv_total_info_1201

AS

INSERT into `enlightent_daily`.`tv_total_info`

SELECT md5(concat(`name`, `channel`)) name_channel_md5,

`index`, `videoType`,`date`, `id`, `name`,

`channel`, `play_times`, `dayPlayTimes`,`up`, `dayUp`, `down`, `dayDown`,

`comment_count`, `totalComments`, `barrageCount`, `dayBarrageCount`,

`rating`, `fake`, `dayPlayTimesPredicted`, `playTimesPredicted`

FROM hive.mysql_prod_enlightent_daily.tv_total_info_oss_table

【背景】 表数据大小 在 1.5T 左右
【业务影响】
【是否存算分离】 存算分离
【StarRocks版本】例如:3.1.4
【集群规模】例如:1fe + 2be(fe与be混部)
【机器信息】阿里云 万兆
【联系方式】社区群17-不知不觉
【附件】

  • fe.log/beINFO/相应截图
    fe.warnning 没有报警日志
    be.info warnning: 报
    W1201 14:40:19.600549 14906 query_context.cpp:610] Retrying ReportExecStatus: No more data to read.

任务状态信息:
Name |Value |
--------------------±--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
JOB_ID |22554 |
LABEL |insert_5d0255e8-8fe9-11ee-84f2-00163e354101 |
DATABASE_NAME |enlightent_daily |
STATE |LOADING |
PROGRESS |ETL:100%; LOAD:99% |
TYPE |INSERT |
PRIORITY |NORMAL |
SCAN_ROWS |0 |
FILTERED_ROWS |0 |
UNSELECTED_ROWS |0 |
SINK_ROWS |2906109860 |
ETL_INFO | |
TASK_INFO |resource:N/A; timeout(s):259200; max_filter_ratio:0.0 |
CREATE_TIME |2023-12-01 09:31:35 |
ETL_START_TIME |2023-12-01 09:31:37 |
ETL_FINISH_TIME |2023-12-01 09:31:37 |
LOAD_START_TIME |2023-12-01 09:31:37 |
LOAD_FINISH_TIME | |
JOB_DETAILS |{“All backends”:{“5d0255e8-8fe9-11ee-84f2-00163e354101”:[10034]},“FileNumber”:0,“FileSize”:0,“InternalTableLoadBytes”:480064304611,“InternalTableLoadRows”:2906109860,“ScanBytes”:0,“ScanRows”:0,“TaskNumber”:1,“Unfinished backends”:{“5d0255e8-8fe9-11ee-84f2-00163e354101”:[10034]}}|
ERROR_MSG | |
TRACKING_URL | |
TRACKING_SQL | |
REJECTED_RECORD_PATH| |

麻烦看看这个BE节点有什么报错?be.INFO be.warn.log

从十点多开始一直是这个

warnning日志


be.info 日志
I1201 15:15:40.951562 14990 starlet.cc:90] Report worker state to ‘172.16.132.124:6090’

I1201 15:15:45.205499 14944 tablet_manager.cpp:931] Report all 0 tablets info

I1201 15:15:47.449193 14863 pipeline_driver_executor.cpp:325] [Driver] Succeed to report exec state: fragment_instance_id=5d0255e8-8fe9-11ee-84f2-00163e354102

W1201 15:15:49.800873 14906 query_context.cpp:610] Retrying ReportExecStatus: No more data to read.

I1201 15:15:50.964874 14990 starlet.cc:90] Report worker state to ‘172.16.132.124:6090’

I1201 15:15:54.896476 14630 daemon.cpp:211] Current memory statistics: process(4558707960), query_pool(379791896), load(3150758912), metadata(441660868), compaction(0), schema_change(0), column_pool(58647493), page_cache(310893184), update(0), chunk_allocator(14475984), clone(0), consistency(0)

I1201 15:15:57.449023 14863 pipeline_driver_executor.cpp:325] [Driver] Succeed to report exec state: fragment_instance_id=5d0255e8-8fe9-11ee-84f2-00163e354102

I1201 15:16:00.979295 14990 starlet.cc:90] Report worker state to ‘172.16.132.124:6090’

I1201 15:16:07.449020 14863 pipeline_driver_executor.cpp:325] [Driver] Succeed to report exec state: fragment_instance_id=5d0255e8-8fe9-11ee-84f2-00163e354102

I1201 15:16:09.898350 14630 daemon.cpp:211] Current memory statistics: process(4555793272), query_pool(377879752), load(3150885344), metadata(441660868), compaction(0), schema_change(0), column_pool(58647493), page_cache(310893184), update(0), chunk_allocator(14475984), clone(0), consistency(0)

I1201 15:16:10.995126 14990 starlet.cc:90] Report worker state to ‘172.16.132.124:6090’

I1201 15:16:17.449082 14863 pipeline_driver_executor.cpp:325] [Driver] Succeed to report exec state: fragment_instance_id=5d0255e8-8fe9-11ee-84f2-00163e354102

W1201 15:16:19.804841 14906 query_context.cpp:610] Retrying ReportExecStatus: No more data to read.

I1201 15:16:21.008074 14990 starlet.cc:90] Report worker state to ‘172.16.132.124:6090’

I1201 15:16:24.900362 14630 daemon.cpp:211] Current memory statistics: process(4548798816), query_pool(374898944), load(3149770448), metadata(441660868), compaction(0), schema_change(0), column_pool(58647493), page_cache(310893184), update(0), chunk_allocator(14475984), clone(0), consistency(0)

内存情况

be 节点挂了
W1201 15:21:08.768373 14718 rowset_update_state.cpp:61] limit: 3094705667; consumption: 9815585376; allocation: 19986974051768; deallocation: 19977350902872; label: 5d0255e88fe911ee-84f200163e354101; all tracker size: 3; limit trackers size: 3; parent is null: false;
W1201 15:21:08.768394 14718 rowset_update_state.cpp:41] bad RowsetUpdateState released tablet:22658
E1201 15:21:08.768399 14718 update_manager.cpp:671] lake primary table preload_update_state id:22658 error:Memory limit exceeded: Memory of 5d0255e88fe911ee-84f200163e354101 exceed limit. LoadSegments Used: 9745812392, Limit: 3094705667.
/build/starrocks/be/src/storage/lake/rowset.cpp:213 tls_thread_status.mem_tracker()->check_mem_limit(“LoadSegments”)
/build/starrocks/be/src/storage/lake/rowset.cpp:149 load_segments(&segments, false)
/build/starrocks/be/src/storage/lake/rowset_update_state.cpp:76 _do_load_upserts_deletes(op_write, *tablet_schema, tablet, rowset_ptr.get())
W1201 15:21:08.694662 14711 rowset_update_state.cpp:57] load RowsetUpdateState error: Memory limit exceeded: Memory of 5d0255e88fe911ee-84f200163e354101 exceed limit. LoadSegments Used: 9820508920, Limit: 3094705667.
/build/starrocks/be/src/storage/lake/rowset.cpp:213 tls_thread_status.mem_tracker()->check_mem_limit(“LoadSegments”)
/build/starrocks/be/src/storage/lake/rowset.cpp:149 load_segments(&segments, false)
/build/starrocks/be/src/storage/lake/rowset_update_state.cpp:76 _do_load_upserts_deletes(op_write, *tablet_schema, tablet, rowset_ptr.get()) tablet:22660 stack:
@ 0x47cc450 _ZZSt9call_onceIZN9starrocks4lake17RowsetUpdateState4loadERKNS1_16TxnLogPB_OpWriteERKNS1_16TabletMetadataPBElPNS1_6TabletEPKNS1_15MetaFileBuilderEbEUlvE_JEEvRSt9once_flagOT_DpOT0_ENUlvE0_4_FUNEv
@ 0x7fe47f4ad20b __pthread_once_slow
@ 0x47ce2c0 starrocks::lake::RowsetUpdateState::load()
@ 0x47c11d7 starrocks::lake::UpdateManager::preload_update_state()
@ 0x4cdd246 starrocks::lake::DeltaWriterImpl::finish()
@ 0x4cdd545 starrocks::lake::DeltaWriter::finish()
@ 0x593ca50 starrocks::lake::AsyncDeltaWriterImpl::execute()
@ 0x64e3a2c bthread::ExecutionQueueBase::_execute()
@ 0x64e47a8 bthread::ExecutionQueueBase::_execute_tasks()
@ 0x4f782b2 starrocks::ThreadPool::dispatch_thread()
@ 0x4f72d4a starrocks::thread::supervise_thread()
@ 0x7fe47f4aeea5 start_thread
@ 0x7fe47e8afb0d __clone
@ (nil) (unknown)

W1201 15:21:08.862403 20844 rowset_update_state.cpp:61] limit: 3094705667; consumption: 9854341016; allocation: 19987779215336; deallocation: 19978048901864; label: 5d0255e88fe911ee-84f200163e354101; all tracker size: 3; limit trackers size: 3; parent is null: false;

W1201 15:21:08.862429 20844 rowset_update_state.cpp:41] bad RowsetUpdateState released tablet:22644

E1201 15:21:08.862434 20844 update_manager.cpp:671] lake primary table preload_update_state id:22644 error:Memory limit exceeded: Memory of 5d0255e88fe911ee-84f200163e354101 exceed limit. LoadSegments Used: 9800252256, Limit: 3094705667.

/build/starrocks/be/src/storage/lake/rowset.cpp:213 tls_thread_status.mem_tracker()->check_mem_limit(“LoadSegments”)

/build/starrocks/be/src/storage/lake/rowset.cpp:149 load_segments(&segments, false)

/build/starrocks/be/src/storage/lake/rowset_update_state.cpp:76 _do_load_upserts_deletes(op_write, *tablet_schema, tablet, rowset_ptr.get())

~

麻烦贴一下,be.out

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

start time: Fri Nov 24 18:57:22 CST 2023

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

start time: Wed Nov 29 16:48:04 CST 2023

start time: Wed Nov 29 18:34:52 CST 2023

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

start time: Thu Nov 30 17:39:39 CST 2023

WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/jni-packages/starrocks-jdbc-bridge-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/data/StarRocks-3.1.4/be/lib/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

这个都是昨天的日志, be.out 没有新日志

可以上传下完整的be.INFO日志么?包括节点崩溃时间点的

稍等,我看下如何下载

1赞

be.INFO.log.20231201-014927.zip (1.0 MB) @ricky

1赞

对应时间段fe.log方便传一下么?

2023-12-01 13:17:40,675 INFO (leaderCheckpointer|104) [BDBJEJournal.getFinalizedJournalId():275] database names: 316778
109985 2023-12-01 13:17:40,675 INFO (leaderCheckpointer|104) [Checkpoint.runAfterCatalogReady():95] checkpoint imageVersion 316777 , checkpointVersion 0
109986 2023-12-01 13:17:44,632 INFO (tablet stat mgr|29) [TabletStatMgr.updateLocalTabletStat():158] finished to get local tablet stat of all backends. cost: 0 ms
109987 2023-12-01 13:17:44,633 INFO (tablet stat mgr|29) [TabletStatMgr.runAfterCatalogReady():126] finished to update index row n um of all databases. cost: 0 ms
109988 2023-12-01 13:17:45,182 INFO (ReportHandler|180) [ReportHandler.tabletReport():380] backend[10034] reports 0 tablet(s). rep ort version: 17013371790000
109989 2023-12-01 13:17:45,182 INFO (ReportHandler|180) [TabletInvertedIndex.tabletReport():301] finished to do tablet diff with b ackend[10034]. sync: 0. metaDel: 0. foundValid: 0. foundInvalid: 0. migration: 0. found invalid transactions 0. found repub lish transactions 0 cost: 0 ms
109990 2023-12-01 13:17:50,413 INFO (tablet scheduler|34) [ClusterLoadStatistic.classifyBackendByLoad():163] classify backend by l oad. medium: SSD, avg load score: 0.0, low/mid/high: 0/1/0
109991 2023-12-01 13:17:50,413 INFO (tablet scheduler|34) [TabletScheduler.updateClusterLoadStatistic():477] update cluster load s tatistic:
109992 be id: 10034, is available: true, mediums: [{medium: HDD, replica: 0, used: 0, total: 0B, score: NaN},{medium: SSD, rep lica: 0, used: 0, total: 344.1GB, score: 0.0},], paths: [{path: /data/starrocks_data_be, path hash: 7613061313301635441, be : 10034, medium: SSD, used: 0, total: 369477971968},]
109993
109994 2023-12-01 13:17:54,567 INFO (colocate group clone checker|112) [ColocateTableBalancer.matchGroups():901] finished to match colocate group. cost: 0 ms, in lock time: 0 ms
109995 2023-12-01 13:17:59,704 INFO (tablet checker|35) [TabletChecker.doCheck():419] finished to check tablets. isUrgent: true, u nhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms, wait time: 0ms
109996 2023-12-01 13:17:59,704 INFO (tablet checker|35) [TabletChecker.doCheck():419] finished to check tablets. isUrgent: false, unhealthy/total/added/in_sched/not_ready: 0/0/0/0/0, cost: 0 ms, in lock time: 0 ms, wait time: 0ms
109997 2023-12-01 13:17:59,705 INFO (tablet checker|35) [TabletChecker.runAfterCatalogReady():209] TStat :
109998 TStat num of tablet check round: 29627 (+1)
109999 TStat cost of tablet check(ms): 2042 (+0)
110000 TStat num of tablet checked in tablet checker: 0 (+0)
110001 TStat num of unhealthy tablet checked in tablet checker: 0 (+0)
110002 TStat num of tablet being added to tablet scheduler: 0 (+0)
110003 TStat num of tablet schedule round: 592450 (+20)
110004 TStat cost of tablet schedule(ms): 5146 (+1)
110005 TStat num of tablet being scheduled: 0 (+0)
110006 TStat num of tablet being scheduled succeeded: 0 (+0)

fe.log (29.3 MB) @ricky fe 日志

坏了, 我重启集群之后, 只入一天的数据, 进程 一直是0%,不动了。
感觉是有锁锁住了。
SUBMIT /+set_var(query_timeout=259200)/ TASK task_tv_total_info_1129_1

AS

INSERT into `enlightent_daily`.`tv_total_info`

SELECT md5(concat(`name`, `channel`)) name_channel_md5,

`index`, `videoType`,`date`, `id`, `name`,

`channel`, `play_times`, `dayPlayTimes`,`up`, `dayUp`, `down`, `dayDown`,

`comment_count`, `totalComments`, `barrageCount`, `dayBarrageCount`,

`rating`, `fake`, `dayPlayTimesPredicted`, `playTimesPredicted`

FROM hive.mysql_prod_enlightent_daily.tv_total_info_oss_table

where `date` = '2023-11-29'

fe leader
grep -i lock fe.log > fe_lock.log