The Graph Indexer - Logs and Alarms
Description
Graph node during its work write the logs and there a lot of different ways to manage them.
Since we are running Indexer in the Kubernetes, we are storing all the cluster logs in the Elasticsearch and also use a native way to deliver Graph node logs in the Elasticsearch.
Note: By the fact we are using OpenSearch instead of the Elasticsearch, but from Graph node point of view it looks the same.
Graph node indices names can’t be changed and we have them as is and Fluent Bit is used to delivery the rest of the logs.
Index | Pods | Delivered | Time field |
---|---|---|---|
fluentbit-logs-{yyyy.MM.dd} |
indexer-agent, indexer-service, query-node-proxy, query-node | Fluent Bit | @timestamp |
subgraph-logs |
index-node, query-node | Direct | timestamp |
block-ingestor-logs |
index-node, query-node | Direct | timestamp |
index-node-server-logs |
index-node, query-node | Direct | timestamp |
graphql-server-logs* |
index-node, query-node | Direct | timestamp |
Rotation
For log rotation in OpenSearch we are using Index State Management plugin which do a Rollover for hardcoded indices names and also delete old logs based on the creation date.
Monitors
# | Monitor | Index | Trigger | Severity | Notifications | Throttling | Description |
---|---|---|---|---|---|---|---|
1 | query-node-proxy-error | fluentbit-logs* |
count > 5 | 3 - Medium | Slack | 10 minutes | Query nodes reverse proxy errors |
2 | query-node-error | index-node-server-logs* |
count > 5 | 1 - Highest | Slack | 10 minutes | Query nodes errors |
3 | indexer-service-error | fluentbit-logs* |
count > 5 | 1 - Highest | Slack | 10 minutes | Indexer service errors |
4 | indexer-agent-error | fluentbit-logs* |
count > 5 | 1 - Highest | Slack | 10 minutes | Indexer agent errors |
5 | index-node-subgraph-error | subgraph-logs* |
count > 0 | 1 - Highest | Slack | - | Index node subgraph indexing errors |
6 | index-node-error | subgraph-logs* |
count > 300 | 2 - High | Slack | 10 minutes | Index node errors |
7 | graph-node-error | block-ingestor-logs* |
count > 5 | 2 - High | Slack | 10 minutes | Index node errors |
For more information how to configure Alerting in OpenSearch Dashboards please follow documentation.
And a simple step-by-step guide - Kibana email alert - extracting field results.
query-node-proxy-error
Index
fluentbit-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_all": {
"boost": 1
}
},
{
"match_phrase": {
"kubernetes.container_name": {
"query": "query-node-proxy",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"match_phrase": {
"log": {
"query": "error",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"@timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"version": true
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 5 |
3 - Medium |
ctx.results[0].hits.total.value > 5 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: fluentbit-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.kubernetes.pod_name}} - {{_source.time}}`
```{{_source.log}}```
{{/ctx.results.0.hits.hits}}
query-node-error
Index
index-node-server-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_phrase": {
"text": {
"query": "error",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 5 |
1 - Highest |
ctx.results[0].hits.total.value > 5 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: index-node-server-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}
indexer-service-error
Index
fluentbit-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_all": {
"boost": 1
}
},
{
"match_phrase": {
"kubernetes.container_name": {
"query": "indexer-service",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"match_phrase": {
"log": {
"query": "IndexerError",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"@timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"version": true
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 5 |
1 - Highest |
ctx.results[0].hits.total.value > 5 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: fluentbit-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.kubernetes.pod_name}} - {{_source.time}}`
```{{_source.log}}```
{{/ctx.results.0.hits.hits}}
indexer-agent-error
Index
fluentbit-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_all": {
"boost": 1
}
},
{
"match_phrase": {
"kubernetes.container_name": {
"query": "indexer-agent",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"match_phrase": {
"log": {
"query": "IndexerError",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"@timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"version": true
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 5 |
1 - Highest |
ctx.results[0].hits.total.value > 5 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: fluentbit-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.kubernetes.pod_name}} - {{_source.time}}`
```{{_source.log}}```
{{/ctx.results.0.hits.hits}}
index-node-subgraph-error
Index
subgraph-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_phrase": {
"text": {
"query": "Mapping aborted",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
"version": true
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 0 |
1 - Highest |
ctx.results[0].hits.total.value > 0 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: subgraph-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.subgraphId}} - {{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}
index-node-error
Index
subgraph-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_phrase": {
"text": {
"query": "error",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 300 |
1 - Highest |
ctx.results[0].hits.total.value > 300 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: subgraph-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.subgraphId}} - {{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}
graph-node-error
Index
block-ingestor-logs*
Extraction query
{
"size": 3,
"query": {
"bool": {
"filter": [
{
"match_phrase": {
"text": {
"query": "error",
"slop": 0,
"zero_terms_query": "NONE",
"boost": 1
}
}
},
{
"range": {
"timestamp": {
"from": "now-5m",
"to": "now",
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
Trigger name | Severity level | Trigger condition |
---|---|---|
count > 5 |
2 - High |
ctx.results[0].hits.total.value > 5 |
Message subject
*{{ctx.monitor.name}}* - {{ctx.periodEnd}}
Message
grafana.url - please investigate the issue
```
Severity: {{ctx.trigger.severity}}
Trigger: {{ctx.trigger.name}}
Count: {{ctx.results.0.hits.total.value}}
Index: block-ingestor-logs*
Period: {{ctx.periodStart}} - {{ctx.periodEnd}}
```
{{#ctx.results.0.hits.hits}}
`{{_source.timestamp}}`
```{{_source.text}}```
{{/ctx.results.0.hits.hits}}