flink导入从ckp处重启报错

Jeffy · 2021年12月6日 03:13

【详述】通过flink从ckp处重做任务，执行一段时间后任务报错

flink sql 配置
WITH (
‘connector’ = ‘starrocks’,
‘jdbc-url’ = ‘jdbc:mysql://10.10.xx.xx:9030?useUnicode=true&characterEncoding=UTF-8&serverTimezone=GMT%2b8&useSSL=true&tinyInt1isBit=false’,
‘load-url’ = ‘10.10.xx.xx:8030’,
‘database-name’ = ‘ODS’,
‘table-name’ = ‘EBS_WIP_WIP_DISCRETE_JOBS’,
‘username’ = ‘it’,
‘password’ = ‘xxxxxx’,
‘sink.semantic’ = ‘at-least-once’,
‘sink.buffer-flush.interval-ms’ = ‘30000’,
‘sink.max-retries’ = ‘3’,
‘sink.properties.column_separator’ = ‘\x01’,
‘sink.properties.row_delimiter’ = ‘\x02’

任务并发度为1 ，一共有78个任务在执行导入

【背景】采用的是primary key数据模型，通过flink导入数据
【业务影响】
【StarRocks版本】例如：1.19.1
【集群规模】例如：3fe（1 follower+2observer）+7be
【附件】

dongquan · 2021年12月6日 03:12

您be.conf中streaming_load_rpc_max_alive_time_sec；tablet_writer_rpc_timeout_sec这两个参数设置的多大呢？

Jeffy · 2021年12月6日 03:16

这两个参数没有额外配置过应该是默认值

U_1616566230846_9994 · 2021年12月6日 07:06

我也是碰到了这个问题，agg 模型宽表（20个维度列，30多指标列，单条数据 1257 bytes，因为支只是做测试，只是按照某个高基数维度做了5个分桶。
flink2doris 的吞吐量达到 8w/s 后就开始频发这个错误，吞吐量在 6wqps 以下没有这个问题。按照官网提示修改了BE 配置并重启，还是会触发这个错误。

streaming_load_rpc_max_alive_time_sec=2400
tablet_writer_open_rpc_timeout_sec=120

观察集群情况（3FE+5BE），cpu，内存使用率均不高，网络io比较高，主要是收发包请求。compaction 增量合并在 200 ~ 300 个版本/s.
另外有个疑问，tablet_writer_open_rpc_timeout_sec 这个参数在 starrocks 文档中没有搜到，在 apcahe doris 文档中倒是有，这块很容易让人产生疑惑，到底这个参数在 starrocks 中是存在还是不存在，还是有其他的参数别名。感觉社区文档亟需建设和 review

dongquan · 2021年12月6日 03:21

StarRocks中是tablet_writer_rpc_timeout_sec

U_1616566230846_9994 · 2021年12月6日 03:26

请教下关于 stream load 数据写入的具体原理分析是啥呢，有啥相关文档推荐下吗？想了解下后台 compaction 和数据摄入两者之间的关联及影响

chaoyli · 2021年12月6日 04:36

能够看到be.INFO中，导入具体失败的原因吗？看看是具体的错误是什么，index channel has intolerable failure只是一个表象而已。

Jeffy · 2021年12月6日 06:07

大概只能找到这样的一些日志

U_1616566230846_9994 · 2021年12月6日 06:14

我理解是 tablet writer 写 rpc 的超时，但是是因为啥底层原因引起的不知道，宏观上看跟导入速度有关系。官网上建议是调节这两个 BE 参数，可以改下在观察下任务执行。

streaming_load_rpc_max_alive_time_sec=2400
tablet_writer_rpc_timeout_sec=120

chaoyli · 2021年12月6日 06:17

rpc超时，可能是因为目的端刷盘导致的，或者是说是锁竞争导致的。

chaoyli · 2021年12月6日 06:18

拿着这个线程好搜一下，上下文的日志。一节instance_id/fragment_id/query_id，搜一下。

U_1616566230846_9994 · 2021年12月6日 06:25

通过 Fragment 的信息可以看出来啥呢

Fragment 2d4d3981-7bf3-1fcb-46d0-5303c8c26b9c:(Active: 200.669ms, non-child: 0.00%)
   - AverageThreadTokens: 1.00 
   - MemoryLimit: 2.00 GB
   - PeakMemoryUsage: 9.71 MB
   - PeakReservation: 0
   - PeakUsedReservation: 0
   - RowsProduced: 16.38K
  BlockMgr:
     - BlockWritesOutstanding: 0
     - BlocksCreated: 0
     - BlocksRecycled: 0
     - BufferedPins: 0
     - BytesWritten: 0
     - MaxBlockSize: 8.00 MB
     - TotalBufferWaitTime: 0.000ns
     - TotalReadBlockTime: 0.000ns
  OlapTableSink:(Active: 8.252ms, non-child: 4.11%)
     - CloseWaitTime: 0.000ns
     - ConvertBatchTime: 0.000ns
     - NonBlockingSendTime: 0.000ns
     - OpenTime: 605.259us
     - RowsFiltered: 0
     - RowsRead: 0
     - RowsReturned: 0
     - SendDataTime: 0.000ns
     - SerializeBatchTime: 0.000ns
     - ValidateDataTime: 0.000ns
  FILE_SCAN_NODE (id=0):(Active: 218.121ms, non-child: 100.00%)
     - BytesRead: 0
     - NumDiskAccess: 0
     - PeakMemoryUsage: 2.24 MB
     - PerReadThreadRawHdfsThroughput: 0.00 /sec
     - RowsRead: 0
     - RowsReturned: 16.38K
     - RowsReturnedRate: 75.11 K/sec
     - ScannerThreadsInvoluntaryContextSwitches: 0
     - ScannerThreadsTotalWallClockTime: 0.000ns
       - MaterializeTupleTime(*): 0.000ns
       - ScannerThreadsSysTime: 0.000ns
       - ScannerThreadsUserTime: 0.000ns
     - ScannerThreadsVoluntaryContextSwitches: 0
     - ScannerTotalTimer: 225.859ms
     - TotalRawReadTime(*): 0.000ns
     - TotalReadThroughput: 0.00 /sec
     - WaitScannerTime: 192.941ms
    FileScanner:
       - CastChunkTimer: 4.055ms
       - CreateChunkTimer: 154.641us
       - FillTimer: 0.000ns
       - MaterializeTimer: 1.221ms
       - ReadTimer: 0.000ns
      FilePRead:
         - FileReadTimer: 0.000ns
W1205 23:53:48.270437 57085 fragment_mgr.cpp:183] Fail to open fragment 2d4d3981-7bf3-1fcb-46d0-5303c8c26b9c: Internal error: index channel has intoleralbe failure

chaoyli · 2021年12月6日 06:34

核心就是找到这个63099这个线程第一次出错的地方，然后看是什么错误。当然也有可能是远端机器把错误返回给了它，所以可能需要拿着打出来的ip，去远端去看。远端机器上就可以用instance_id/fragment_id/query_id去看。

Jeffy · 2021年12月6日 09:47

我这个好像是因为primary key 模式内存消耗太大导致的，而且大表没有添加分区，导致内存溢出报错的

chaoyli · 2021年12月7日 03:24

那估计是了，那现在改了partition之后，有解决吗？

Jeffy · 2021年12月7日 05:20

现在有改善，暂时还没有发生错误