5

记录一次ES修复的经历

 2 years ago
source link: https://blog.leixin.wang/b07f71a1.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

0x00 起因

前端时间公司的日志服务器挂了,开发小伙伴吐槽说日志服务器无法我访问,登录到其中一台服务器上发现,elasticsearch无法启动,数据盘也没有挂载,fstab没有挂盘信息。elasticsearch是跑在阿里云上的,故障的ES节点被执行过DHCP故障的脚本,怀疑是脚本导致服务器重启。
通过挂载数据磁盘,启动ES节点,集群状态还是red,但节点已经恢复为3个。

$ curl -XGET -u user:passwd 'http://127.0.0.1:9200/_cluster/health?pretty'
{
"cluster_name" : "es_cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 1124,
"active_shards" : 1126,
"relocating_shards" : 0,
"initializing_shards" : 8,
"unassigned_shards" : 1178,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 48.702422145328725
}

0x01 红变黄

经过一小段时间的观察,发现unassigned_shards并没有减少,经过一系列的GOOGLE,找到是分片未分配导致,于是通过es api找到未分配的分片

curl -XGET -u user:passwd http://127.0.0.1:9200/_cat/shards  | grep UNASSIGNED

发现有大量未分配的分片和副本,其中p是分片,r是副本

logstash-2018.12.11             2 r UNASSIGNED
logstash-2018.12.11 1 r UNASSIGNED
logstash-2018.12.11 0 r UNASSIGNED
logstash-2019.05.06 3 p UNASSIGNED
logstash-2019.05.06 3 r UNASSIGNED
logstash-2019.05.06 4 p UNASSIGNED
logstash-2019.05.06 4 r UNASSIGNED
logstash-2019.05.06 2 p UNASSIGNED
logstash-2019.05.06 2 r UNASSIGNED
logstash-2019.05.06 1 p UNASSIGNED
logstash-2019.05.06 1 r UNASSIGNED
logstash-2019.05.06 0 p UNASSIGNED
logstash-2019.05.06 0 r UNASSIGNED
logstash-2019.03.16 4 r UNASSIGNED
logstash-2019.03.16 2 r UNASSIGNED
logstash-2019.03.16 1 r UNASSIGNED
logstash-2019.02.20 3 r UNASSIGNED
logstash-2019.02.20 2 r UNASSIGNED
logstash-2019.02.20 0 r UNASSIGNED
logstash-2019.02.06 3 r UNASSIGNED
logstash-2019.02.06 1 r UNASSIGNED
logstash-2019.02.06 0 r UNASSIGNED

于是尝试手工分配到节点,发现分片可以被分配,但是对副本进行分配,会报错已存在。

# 分片分配API
curl -H "Content-Type: application/json" -XPOST -u user:passwd http://127.0.0.1:9200/_cluster/reroute -d '{
"commands" : [ {
"allocate_empty_primary" : {
"index" : "logstash-2019.05.06", #索引,对应上面第一列
"shard" : "3", #分片号,对应上面第二列
"node": "node-1", #节点名称
"accept_data_loss" : true
} }]}'

修复命令测试OK,编写脚本,修复所有分片

#!/bin/bash
# 先使用
# curl -XGET -u user:passwd http://127.0.0.1:9200/_cat/shards | grep 'p UNASSIGNED' > error_shard
# 将数据导入到error_shard文件内
es_arry=(
"node-1"
"node-2"
"node-3"
)
sum=0

cat error_shard | while read line
do
index=`echo $line | awk '{print $1}'`
shard=`echo $line | awk '{print $2}'`
sum=`expr $sum + 1`
node=`expr $sum % 3`
curl -H "Content-Type: application/json" -XPOST -u user:passwd http://127.0.0.1:9200/_cluster/reroute -d '{
"commands" : [ {
"allocate_empty_primary" : {
"index" : "'${index}'",
"shard" : "'${shard}'",
"node": "'${es_arry[node]}'", "accept_data_loss" : true } }]}'

done

修复完所有未分配的切片后,查看集群状态,从red变成了yellow。

0x02 删除引起的插曲

经过一宿的观察,发现集群依旧处于yellow状态,检查服务器磁盘空间,发现一台服务器磁盘使用量在83%,另外两台87%,节点无法分配到切片和副本,多出来的切片和副本就这么被挂空了。经过沟通,决定只保留1个月的数据。

curl -u user:password  -H'Content-Type:application/json' -d'{
"query": {
"range": {
"@timestamp": {
"lt": "now-30d",
"format": "epoch_millis"
}
}
}
}
' -XPOST "http://127.0.0.1:9200/*-*/_delete_by_query?pretty"

上面的命令,跑了半天没反应,查看集群状态red,node数量2,懵逼。。。。。,ES节点删挂了一个。

0x03 黄变绿

经过一系列的操作,把挂掉的ES节点重新拉回集群,把red状态恢复到yellow状态。于是乎换了一个索引删除方式

curl -X DELETE -u user:passwd 'http://127.0.0.1:9200/logstash-2019.03*'

通过手动删除大量历史索引,集群磁盘空出大量空间,经过几个小时的等待,集群恢复到了green状态。

0x04 编写程序定时删除

package main

import (
"flag"
"fmt"
"net/http"
"strings"
"time"
)

type values struct {
index string
format string
day int
}

func main() {
f := &values{}
f.valueFlag()
now := time.Now()
t := now.AddDate(0, -1, 0-f.day)
err := delteEsIndex(f.index + t.Format(f.format))
if err != nil {
fmt.Println(err)
}

}

func (v *values) valueFlag() {
index := flag.String("index", "logstash-", "索引前缀")
format := flag.String("format", "yyyy.mm.dd", "索引的时间格式")
day := flag.Int("n", 30, "删除n天以前的数据")
flag.Parse()
*format = strings.Replace(*format, "yyyy", "2006", -1)
*format = strings.Replace(*format, "mm", "01", -1)
*format = strings.Replace(*format, "dd", "02", -1)
v.index = *index
v.format = *format
v.day = *day
}

func delteEsIndex(index string) error {
url := "http://user:[email protected]:9200/" + index
request, err := http.NewRequest("DELETE", url, nil)
if err != nil {
return err
}
request.Header.Set("Content-Type", "application/json")
client := &http.Client{}
_, err = client.Do(request)
if err != nil {
return err
}

return nil
}

其他的维护接口

  • 设置副本数量,设置为0表示取消副本
    curl -XPUT -u user:password 'http://127.0.0.1:9200/index/_settings' -d '{"number_of_replicas": 0}'

  • 查看所有index
    curl -XGET -u user:password http://127.0.0.1:9200/_cat/indices\?v

  • 查看详细的集群监控信息
    curl -XGET -u user:password http://127.0.0.1:9200/_cat/indices\?v


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK