查询ES外部表时,Operation timed out after 5000 milliseconds with 0 bytes received

【详述】在使用ES外部表做聚合操作时,会报错Operation timed out after 5000 milliseconds with 0 bytes received,正常select是没问题的,一旦使用聚合函数,如count、sum等就会报如上错误

【StarRocks版本】2.0.0-GA-2ffdf30

【集群规模】3fe + 3be(fe与be混部)

【附件】

  • fe.warn.log
    2022-03-10 13:55:18,190 WARN (thrift-server-pool-3494|237917) [Coordinator.updateFragmentExecStatus():1416] one instance report fail errorCode INTERNAL_ERROR Operationtimed out after 5000 milliseconds with 0 bytes received, query_id=a15e1b70-a036-11ec-aa36-52540083914f instance_id=a15e1b70-a036-11ec-aa36-525400839151
    2022-03-10 13:55:18,190 WARN (thrift-server-pool-3494|237917) [Coordinator.updateStatus():733] one instance report fail throw updateStatus(), need cancel. job id: -1, query id: a15e1b70-a036-11ec-aa36-52540083914f, instance id: a15e1b70-a036-11ec-aa36-525400839151
    2022-03-10 13:55:18,192 WARN (starrocks-mysql-nio-pool-65924|238637) [Coordinator.getNext():752] get next fail, need cancel. status errorCode CANCELLED Cancelled BufferControlBlock::cancel, query id: a15e1b70-a036-11ec-aa36-52540083914f
    2022-03-10 13:55:18,192 WARN (starrocks-mysql-nio-pool-65924|238637) [Coordinator.getNext():773] query failed: Operation timed out after 5000 milliseconds with 0 bytesreceived

  • be.warn.log
    W0310 13:55:09.031574 95784 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:09.053457 95768 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5002 milliseconds with 0 bytes received
    W0310 13:55:09.061493 95760 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5000 milliseconds with 0 bytes received
    W0310 13:55:09.153553 95791 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:09.204666 95821 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:17.699506 95747 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:17.702486 95786 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:17.792493 95787 http_client.cpp:169] fail to execute HTTP client, errmsg=Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:18.192759 151898 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-52540083915c: Cancelled: Cancelled SenderQueue::get_chunk
    W0310 13:55:18.192919 151896 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-525400839163: Cancelled: Cancelled SenderQueue::get_chunk
    W0310 13:55:18.193604 151929 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-52540083915f: Cancelled: Cancelled SenderQueue::get_chunk
    W0310 13:55:18.193786 151911 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-525400839161: Cancelled: Cancelled SenderQueue::get_chunk
    W0310 13:55:18.193789 151926 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-525400839168: Cancelled: Cancelled SenderQueue::get_chunk
    W0310 13:55:18.271019 151937 plan_fragment_executor.cpp:227] fail to open fragment, instance_id=a15e1b70-a036-11ec-aa36-525400839158, status=Internal error: Operation timed out after 5000 milliseconds with 0 bytes received
    W0310 13:55:18.271668 151937 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-525400839158: Internal error: Operation timed out after 5000 milliseconds with 0 bytes received
    W0310 13:55:19.490011 151924 plan_fragment_executor.cpp:227] fail to open fragment, instance_id=a15e1b70-a036-11ec-aa36-52540083915a, status=Internal error: Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:19.490787 151924 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-52540083915a: Internal error: Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:20.326599 151925 plan_fragment_executor.cpp:227] fail to open fragment, instance_id=a15e1b70-a036-11ec-aa36-525400839159, status=Internal error: Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:20.327257 151925 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-525400839159: Internal error: Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:21.138002 151950 plan_fragment_executor.cpp:227] fail to open fragment, instance_id=a15e1b70-a036-11ec-aa36-52540083915b, status=Internal error: Operation timed out after 5001 milliseconds with 0 bytes received
    W0310 13:55:21.138804 151950 fragment_mgr.cpp:194] Fail to open fragment a15e1b70-a036-11ec-aa36-52540083915b: Internal error: Operation timed out after 5001 milliseconds with 0 bytes received

补充一点,es共9个节点,早上的时候其中一个节点出现宕机,重启之后一直在做分片平衡,怀疑是不是这里占用了过多的系统资源,导致ES查询超时,但是直接查询ES是没有问题的。所以这里的Operation timed out after 5001 milliseconds能不能调大一些?调大一些是不是能解决这个问题呢?

1赞

补充一下进展:
经过研究发现,但凡使用了es外表的查询,如果产生较慢的查询即报如上错误,对正常的表进行慢查询也不会报错。尝试修改了remote_fragment_exec_timeout_ms、thrift_rpc_timeout_ms、qe_slow_log_ms等参数都没有效果。
随后对es集群进行重启,重启后不再报错。
所以问题应该是es集群造成的,但是这个5001ms的限制是哪里来的呢?这个能否调整?