The Graph Indexer - Logs and Alarms

The Graph Indexer - Logs and Alarms

Description

Graph node during its work write the logs and there a lot of different ways to manage them.

Since we are running Indexer in the Kubernetes, we are storing all the cluster logs in the Elasticsearch and also use a native way to deliver Graph node logs in the Elasticsearch.

Note: By the fact we are using OpenSearch instead of the Elasticsearch, but from Graph node point of view it looks the same.

Graph node indices names can’t be changed and we have them as is and Fluent Bit is used to delivery the rest of the logs.

Index Pods Delivered Time field
fluentbit-logs-{yyyy.MM.dd} indexer-agent, indexer-service, query-node-proxy, query-node Fluent Bit @timestamp
subgraph-logs index-node, query-node Direct timestamp
block-ingestor-logs index-node, query-node Direct timestamp
index-node-server-logs index-node, query-node Direct timestamp
graphql-server-logs* index-node, query-node Direct timestamp

Rotation

For log rotation in OpenSearch we are using Index State Management plugin which do a Rollover for hardcoded indices names and also delete old logs based on the creation date.

Monitors

# Monitor Index Trigger Severity Notifications Throttling Description
1 query-node-proxy-error fluentbit-logs* count > 5 3 - Medium Slack 10 minutes Query nodes reverse proxy errors
2 query-node-error index-node-server-logs* count > 5 1 - Highest Slack 10 minutes Query nodes errors
3 indexer-service-error fluentbit-logs* count > 5 1 - Highest Slack 10 minutes Indexer service errors
4 indexer-agent-error fluentbit-logs* count > 5 1 - Highest Slack 10 minutes Indexer agent errors
5 index-node-subgraph-error subgraph-logs* count > 0 1 - Highest Slack - Index node subgraph indexing errors
6 index-node-error subgraph-logs* count > 300 2 - High Slack 10 minutes Index node errors
7 graph-node-error block-ingestor-logs* count > 5 2 - High Slack 10 minutes Index node errors

For more information how to configure Alerting in OpenSearch Dashboards please follow documentation.

And a simple step-by-step guide - Kibana email alert - extracting field results.

query-node-proxy-error

Index

fluentbit-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_all": {
                        "boost": 1
                    }
                },
                {
                    "match_phrase": {
                        "kubernetes.container_name": {
                            "query": "query-node-proxy",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "match_phrase": {
                        "log": {
                            "query": "error",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "version": true
}
Trigger name Severity level Trigger condition
count > 5 3 - Medium ctx.results[0].hits.total.value > 5

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    fluentbit-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.kubernetes.pod_name}} - {{_source.time}}`
```{{_source.log}}```
{{/ctx.results.0.hits.hits}}

query-node-error

Index

index-node-server-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_phrase": {
                        "text": {
                            "query": "error",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    }
}
Trigger name Severity level Trigger condition
count > 5 1 - Highest ctx.results[0].hits.total.value > 5

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    index-node-server-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}

indexer-service-error

Index

fluentbit-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_all": {
                        "boost": 1
                    }
                },
                {
                    "match_phrase": {
                        "kubernetes.container_name": {
                            "query": "indexer-service",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "match_phrase": {
                        "log": {
                            "query": "IndexerError",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "version": true
}
Trigger name Severity level Trigger condition
count > 5 1 - Highest ctx.results[0].hits.total.value > 5

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    fluentbit-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.kubernetes.pod_name}} - {{_source.time}}`
```{{_source.log}}```
{{/ctx.results.0.hits.hits}}

indexer-agent-error

Index

fluentbit-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_all": {
                        "boost": 1
                    }
                },
                {
                    "match_phrase": {
                        "kubernetes.container_name": {
                            "query": "indexer-agent",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "match_phrase": {
                        "log": {
                            "query": "IndexerError",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "@timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "version": true
}
Trigger name Severity level Trigger condition
count > 5 1 - Highest ctx.results[0].hits.total.value > 5

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    fluentbit-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.kubernetes.pod_name}} - {{_source.time}}`
```{{_source.log}}```
{{/ctx.results.0.hits.hits}}

index-node-subgraph-error

Index

subgraph-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_phrase": {
                        "text": {
                            "query": "Mapping aborted",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    },
    "version": true
}
Trigger name Severity level Trigger condition
count > 0 1 - Highest ctx.results[0].hits.total.value > 0

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    subgraph-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```

{{#ctx.results.0.hits.hits}}
`{{_source.subgraphId}} - {{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}

index-node-error

Index

subgraph-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_phrase": {
                        "text": {
                            "query": "error",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    }
}
Trigger name Severity level Trigger condition
count > 300 1 - Highest ctx.results[0].hits.total.value > 300

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    subgraph-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```

{{#ctx.results.0.hits.hits}}
`{{_source.subgraphId}} - {{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}

graph-node-error

Index

block-ingestor-logs*
Extraction query
{
    "size": 3,
    "query": {
        "bool": {
            "filter": [
                {
                    "match_phrase": {
                        "text": {
                            "query": "error",
                            "slop": 0,
                            "zero_terms_query": "NONE",
                            "boost": 1
                        }
                    }
                },
                {
                    "range": {
                        "timestamp": {
                            "from": "now-5m",
                            "to": "now",
                            "include_lower": true,
                            "include_upper": true,
                            "boost": 1
                        }
                    }
                }
            ],
            "adjust_pure_negative": true,
            "boost": 1
        }
    }
}
Trigger name Severity level Trigger condition
count > 5 2 - High ctx.results[0].hits.total.value > 5

Message subject

*{{ctx.monitor.name}}* - {{ctx.periodEnd}}

Message

grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger:  {{ctx.trigger.name}}
Count:    {{ctx.results.0.hits.total.value}}
Index:    block-ingestor-logs*
Period:   {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}
5 Likes