【详述】表丢失tablet
【背景】be节点从4个节点缩减为3个节点
【业务影响】表大概率无法访问
【是否存算分离】否
【StarRocks版本】3.0.5
【集群规模】3fe +4be(fe与be混部)
【机器信息】48C/270G/万兆
【联系方式】
【附件】
四台be节点,由于一台be节点服务器磁盘故障,所以把那台be节点从集群中摘除了。结果导致大量表查询的时候报tablet找不到
ERROR 1064 (HY000) at line 213: failed to get tablet. tablet_id=1198268039,
with schema_hash=1269373243,
reason=tablet does not exist backend:10.161.24.44
报错的表是单副本的表?单副本的表磁盘故障数据就丢了,生产必须是3副本,不是的话连接leader fe看下show tablet xx,然后再执行上一步反馈的detailcmd看下结果
ERROR 1064 (HY000) at line 2: failed to get tablet. tablet_id=1248968651, with schema_hash=1303794361, reason=tablet does not exist backend:10.161.24.44
show tablet 1248968651;
ods ods_site_scan_dtl p20231212 ods_site_scan_dtl 10367 985491023 1248968642 985491024 true SHOW PROC ‘/dbs/10367/985491023/partitions/1248968642/985491024/1248968651’;
1248968652 | 10365 | 2 | 0 | 2 | 0 | -1 | 0 | -1 | 9082277 | 57870 | NORMAL | false | false | 2 | -1 | http://10.161.24.43:18040/api/meta/header/1248968651 | http://10.161.24.43:18040/api/compaction/show?tablet_id=1248968651&schema_hash=-1 | false | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1248968653 | 10364 | 2 | 0 | 2 | 0 | -1 | 0 | -1 | 9082277 | 57870 | NORMAL | false | false | 2 | -1 | http://10.161.24.42:18040/api/meta/header/1248968651 | http://10.161.24.42:18040/api/compaction/show?tablet_id=1248968651&schema_hash=-1 | false | |
1248968654 | 1248704040 | 2 | 0 | 2 | 0 | -1 | 0 | -1 | 9082277 | 57870 | NORMAL | false | false | 2 | -1 | http://10.161.24.41:18040/api/meta/header/1248968651 | http://10.161.24.41:18040/api/compaction/show?tablet_id=1248968651&schema_hash=-1 | false |
生产库全部都是三副本的表,看上面回复
以前也遇到过一次,重建表就解决了,这次重建表也不管用了,有没有临时解决的办法啊,现在生产上的任务都报错了
不对呀,你们动过fe吗,看报错连接的24.44,但是show tablet三副本没有24.44?
fe的话,也就是那台坏的磁盘所在机器停止了一下,然后现在又拉起来了,然后再报这个缺失tablet之后升级了JDK从1.8到11,然后执行过alter system create image语句; 然后所有的fe今天都重启过,是为了解决这个tablet缺失的问题,不过看起来没什么用。
除了触发重启还做了啥操作,目前看起来fe的元数据不太对呀,怎么还访问的24.44,这个tablet的三副本在24.41,24.42,24.43呀
一开始是41所在机器bedown掉了,把41从集群中摘除了,然后就剩下三台机器42、43、44,所以副本会出现在44上,然后现在又把41be加到集群上来了,所以现在看副本又在41、42、43三台机器了。
fe除了重启还有做啥操作?没有使用recover吧?
没有其它操作了,没使用recover
贴下3个fe的配置
我今天下午升级JDK到11之后,fe的JVM内存没改,还是默认的8192m,之前1.8的JVM内存是50g,会不会是这个影响呢?
the output dir of stderr and stdout
LOG_DIR = ${STARROCKS_HOME}/log
DATE = “$(date +%Y%m%d-%H%M%S)”
JAVA_OPTS="-Dlog4j2.formatMsgNoLookups=true -Xmx51200m -XX:+UseMembar -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xloggc:$STARROCKS_HOME/log/fe.gc.log.$DATE"
For jdk 9+, this JAVA_OPTS will be used as default JVM options
JAVA_OPTS_FOR_JDK_9="-Dlog4j2.formatMsgNoLookups=true -Xmx8192m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=7 -XX:+CMSClassUnloadingEnabled -XX:-CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -Xlog:gc*:$STARROCKS_HOME/log/fe.gc.log.$DATE:time"
the lowercase properties are read by main program.
DEBUG, INFO, WARN, ERROR, FATAL
sys_log_level = INFO
store metadata, create it if it is not exist.
Default value is ${STARROCKS_HOME}/meta
meta_dir = ${STARROCKS_HOME}/meta
http_port = 18030
rpc_port = 9020
query_port = 9030
edit_log_port = 19010
mysql_service_nio_enabled = true
Choose one if there are more than one ip except loopback address.
Note that there should at most one ip match this list.
If no ip match this rule, will choose one randomly.
use CIDR format, e.g. 10.10.10.0/24
Default value is empty.
priority_networks = 10.161.24.42
Advanced configurations
log_roll_size_mb = 1024
sys_log_dir = ${STARROCKS_HOME}/log
sys_log_roll_num = 10
sys_log_verbose_modules =
audit_log_dir = ${STARROCKS_HOME}/log
audit_log_modules = slow_query, query
audit_log_roll_num = 10
meta_delay_toleration_second = 10
qe_max_connection = 1024
max_conn_per_user = 100
qe_query_timeout_second = 300
qe_slow_log_ms = 5000
streaming_load_rpc_max_alive_time_sec=4800
tablet_writer_open_rpc_timeout_sec=480
max_routine_load_task_concurrent_num =12
max_routine_load_task_num_per_be =12
jvm改回去吧,另外show frontends现在fe是啥状态,fe的日志fe.log发下最新的
show frontends:
10.161.24.43_19010_1693280231501 | 10.161.24.43 | 19010 | 18030 | 9030 | 9020 | LEADER | 1745961488 | true | true | 492496337 | 2024-01-26 16:35:56 | true | 2024-01-26 13:40:32 | 3.0.5-cceec03 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10.161.24.42_19010_1668223778826 | 10.161.24.42 | 19010 | 18030 | 9030 | 9020 | FOLLOWER | 1745961488 | true | true | 492496326 | 2024-01-26 16:35:56 | true | 2024-01-26 13:42:51 | 3.0.5-cceec03 | |
10.161.24.41_19010_1668183330388 | 10.161.24.41 | 19010 | 18030 | 9030 | 9020 | FOLLOWER | 1745961488 | true | true | 492496326 | 2024-01-26 16:35:56 | true | 2024-01-26 13:37:42 | 3.0.5-cceec03 |
leader节点上的fe日志
fe.log:
2024-01-26 16:33:53,535 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067781042. because: redundant replica is deleted
2024-01-26 16:33:53,535 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1232200870, backend id: 10365. reason: DECOMMISSION state, force: false replicas: 10365:2/-1/2/0:DECOMMISSION,10366:2/-1/2/0:NORMAL,10364:2/-1/2/1:NORMAL,1248704040:2/-1/2/1:NORMAL,
2024-01-26 16:33:53,535 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1232200870. because: redundant replica is deleted
2024-01-26 16:33:53,543 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1067780994, backend id: 10364. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:NORMAL,10365:1/-1/1/1:NORMAL,10364:1/-1/1/1:DECOMMISSION,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,543 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067780994. because: redundant replica is deleted
2024-01-26 16:33:53,544 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1062321925, backend id: 10364. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:NORMAL,10364:1/-1/1/1:DECOMMISSION,10365:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,544 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1062321925. because: redundant replica is deleted
2024-01-26 16:33:53,544 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1067781038, backend id: 10366. reason: DECOMMISSION state, force: false replicas: 10364:1/-1/1/1:NORMAL,10366:1/-1/1/1:DECOMMISSION,10365:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,544 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067781038. because: redundant replica is deleted
2024-01-26 16:33:53,544 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1067781026, backend id: 10364. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:NORMAL,10364:1/-1/1/1:DECOMMISSION,10365:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,544 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067781026. because: redundant replica is deleted
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1067781066, backend id: 10366. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:DECOMMISSION,10364:1/-1/1/1:NORMAL,10365:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067781066. because: redundant replica is deleted
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1067781122, backend id: 10365. reason: DECOMMISSION state, force: false replicas: 10365:1/-1/1/1:DECOMMISSION,10364:1/-1/1/1:NORMAL,10366:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067781122. because: redundant replica is deleted
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1067781114, backend id: 10365. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:NORMAL,10364:1/-1/1/1:NORMAL,10365:1/-1/1/1:DECOMMISSION,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1067781114. because: redundant replica is deleted
2024-01-26 16:33:53,545 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1062255409, backend id: 10365. reason: DECOMMISSION state, force: false replicas: 10364:1/-1/1/1:NORMAL,10365:1/-1/1/1:DECOMMISSION,10366:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,546 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1062255409. because: redundant replica is deleted
2024-01-26 16:33:53,546 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1062321985, backend id: 10366. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:DECOMMISSION,10364:1/-1/1/1:NORMAL,10365:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,546 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1062321985. because: redundant replica is deleted
2024-01-26 16:33:53,546 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1062255369, backend id: 10366. reason: DECOMMISSION state, force: false replicas: 10366:1/-1/1/1:DECOMMISSION,10364:1/-1/1/1:NORMAL,10365:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,546 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1062255369. because: redundant replica is deleted
2024-01-26 16:33:53,546 INFO (tablet scheduler|42) [TabletScheduler.deleteReplicaInternal():1308] delete replica. tablet id: 1062322045, backend id: 10365. reason: DECOMMISSION state, force: false replicas: 10365:1/-1/1/1:DECOMMISSION,10364:1/-1/1/1:NORMAL,10366:1/-1/1/1:NORMAL,1248704040:1/-1/1/1:NORMAL,
2024-01-26 16:33:53,547 INFO (tablet scheduler|42) [TabletScheduler.removeTabletCtx():1509] remove the tablet 1062322045. because: redundant replica is deleted
###################
fe.warn.log
2024-01-26 16:33:03,856 WARN (thrift-server-pool-107212|109574) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 1248704040, signature: -613886856
2024-01-26 16:33:05,867 WARN (thrift-server-pool-107245|109608) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10366, signature: -695093235
2024-01-26 16:33:06,342 WARN (heartbeat mgr|25) [HeartbeatMgr.runAfterCatalogReady():173] get bad heartbeat response: type: BROKER, status: BAD, msg: java.net.ConnectException: Connection refused (Connection refused), name: broker41, host: 10.161.24.41, port: 8000
2024-01-26 16:33:08,308 WARN (thrift-server-pool-107344|109768) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10364, signature: 23382790
2024-01-26 16:33:11,852 WARN (thrift-server-pool-107223|109585) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10365, signature: 843235607
2024-01-26 16:33:39,730 WARN (thrift-server-pool-107511|109972) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 1248704040, signature: -613886856
2024-01-26 16:33:45,148 WARN (thrift-server-pool-107489|109949) [LeaderImpl.finishTask():209] cannot find task. type: PUBLISH_VERSION, backendId: 10366, signature: 135523536
2024-01-26 16:33:46,613 WARN (thrift-server-pool-107384|109808) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10366, signature: -695093235
2024-01-26 16:33:50,110 WARN (thrift-server-pool-107497|109957) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10364, signature: 23382790
2024-01-26 16:34:09,286 WARN (thrift-server-pool-107632|110107) [LeaderImpl.finishTask():209] cannot find task. type: PUBLISH_VERSION, backendId: 10366, signature: 135523637
2024-01-26 16:34:16,418 WARN (thrift-server-pool-107115|109477) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10365, signature: 843235607
2024-01-26 16:34:34,920 WARN (thrift-server-pool-107840|110317) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 1248704040, signature: -613886856
2024-01-26 16:34:36,774 WARN (thrift-server-pool-107905|110382) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10366, signature: -695093235
2024-01-26 16:34:49,220 WARN (thrift-server-pool-107985|110462) [LeaderImpl.finishTask():209] cannot find task. type: UPDATE_TABLET_META_INFO, backendId: 10364, signature: 23382790
2024-01-26 16:35:01,274 WARN (thrift-server-pool-107105|109467) [LeaderImpl.finishTask():209] cannot find task. type: PUBLISH_VERSION, backendId: 10365, signature: 135523835
2024-01-26 16:35:01,274 WARN (thrift-server-pool-108170|110647) [LeaderImpl.finishTask():209] cannot find task. type: PUBLISH_VERSION, backendId: 10366, signature: 135523835
jvm改回去查询还报错吗
刚刚改回50g了,我再观察一下。
现在没有再报错了