Skip to content

🚨 Incident Response Runbooks

Версия: 1.0.0
Дата: 25 января 2026
Статус: PRODUCTION READY


📋 Содержание

  1. Общая информация
  2. Severity Classification
  3. Runbook: High Error Rate
  4. Runbook: High Latency
  5. Runbook: Service Unavailable
  6. Runbook: Database Issues
  7. Runbook: Message Queue Issues
  8. Runbook: Memory/CPU Issues
  9. Post-Incident Process

Общая информация

On-Call Rotation

Role Primary Secondary
Platform Engineer @platform-oncall @platform-backup
Database Admin @dba-oncall @dba-backup
Security @security-oncall @security-backup

Communication Channels

  • Slack: #incidents (primary), #incidents-war-room (major)
  • PagerDuty: procurement-api service
  • Status Page: status.example.com
  • Video Bridge: meet.example.com/incident-room

Severity Classification

Severity Description Response Time Example
P1 - Critical Complete service outage, data loss risk < 5 min API down, database corruption
P2 - High Significant degradation, SLO breach < 15 min Error rate > 5%, latency > 2s
P3 - Medium Minor degradation, approaching SLO < 1 hour Error rate > 1%, latency > 500ms
P4 - Low Minor issue, no user impact < 4 hours Log errors, non-critical alerts

Runbook: High Error Rate

Alert

1
2
SLO_SuccessRate_FastBurn_Critical
SLO_SuccessRate_FastBurn_Warning

Symptoms

  • Error rate exceeds 0.5% (SLO threshold)
  • Increased 5xx responses
  • User complaints about failures

Response Time: < 5 minutes (P1) / < 15 minutes (P2)

Step 1: Verify (1 minute)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 1. Check current error rate
kubectl exec -it prometheus-0 -n monitoring -- \
  promtool query instant '(1 - sli:http_requests_success:ratio_rate5m) * 100'

# 2. Check error breakdown by endpoint
kubectl exec -it prometheus-0 -n monitoring -- \
  promtool query instant 'sum by (path, status) (rate(http_requests_total{status=~"5.."}[5m]))'

# 3. View recent errors in logs
kubectl logs -n procurement -l app=procurement-api --since=5m | grep -i error | tail -50

Step 2: Assess (2 minutes)

Questions to answer:

  • Is it affecting all endpoints or specific ones?
  • Was there a recent deployment?
  • Is it correlated with external dependencies?
  • Is database healthy?
  • Is message queue healthy?
1
2
3
4
5
6
7
8
9
# Check recent deployments
kubectl rollout history deployment/procurement-api -n procurement

# Check dependency health
kubectl logs -n procurement -l app=procurement-api --since=5m | grep -E "connection|timeout|refused"

# Check database connections
kubectl exec -it postgres-0 -n procurement -- psql -c \
  "SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"

Step 3: Respond (2-5 minutes)

Option A: Rollback (if recent deployment)

1
2
3
4
5
# Rollback to previous version
kubectl rollout undo deployment/procurement-api -n procurement

# Verify rollback
kubectl rollout status deployment/procurement-api -n procurement

Option B: Scale up (if load-related)

1
2
3
4
5
6
# Increase replicas
kubectl scale deployment/procurement-api -n procurement --replicas=5

# Enable HPA emergency scaling
kubectl patch hpa procurement-api -n procurement \
  -p '{"spec":{"maxReplicas":10}}'

Option C: Enable circuit breaker fallback

1
2
3
# Set environment variable for circuit breaker
kubectl set env deployment/procurement-api -n procurement \
  CIRCUIT_BREAKER_FORCE_OPEN=true

Option D: Block problematic endpoint (temporary)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Apply rate limiting via Istio
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: emergency-rate-limit
  namespace: procurement
spec:
  workloadSelector:
    labels:
      app: procurement-api
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: SIDECAR_INBOUND
    patch:
      operation: INSERT_BEFORE
      value:
        name: envoy.filters.http.local_ratelimit
EOF

Step 4: Monitor (5+ minutes)

1
2
3
# Watch error rate recovery
watch -n 5 'kubectl exec -it prometheus-0 -n monitoring -- \
  promtool query instant "(1 - sli:http_requests_success:ratio_rate5m) * 100"'

Step 5: Notify

1
2
3
4
5
6
7
8
# Post to Slack
curl -X POST https://slack.com/api/chat.postMessage \
  -H "Authorization: Bearer $SLACK_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "channel": "#incidents",
    "text": "🚨 *High Error Rate Incident*\nStatus: Investigating\nImpact: API error rate elevated\nCurrent rate: X%"
  }'

Runbook: High Latency

Alert

1
2
SLO_Latency_P99_Breach
SLO_Latency_P95_Breach

Symptoms

  • P99 latency > 500ms
  • P95 latency > 200ms
  • User complaints about slow responses

Response Time: < 15 minutes

Step 1: Verify

1
2
3
4
5
6
7
# Check latency percentiles
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant \
  'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="procurement-api"}[5m])) * 1000'

# Check latency by endpoint
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant \
  'histogram_quantile(0.99, sum by (path, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000'

Step 2: Identify Root Cause

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 1. Check database query times
kubectl logs -n procurement -l app=procurement-api --since=10m | \
  grep -E "query.*ms" | sort -t'=' -k2 -rn | head -20

# 2. Check external service latency (via Jaeger)
# Open: https://jaeger.example.com/search?service=procurement-api&operation=all

# 3. Check resource utilization
kubectl top pods -n procurement -l app=procurement-api

# 4. Check GC pauses
kubectl logs -n procurement -l app=procurement-api --since=10m | grep -i "gc\|garbage"

Step 3: Respond

Option A: Scale horizontally

1
kubectl scale deployment/procurement-api -n procurement --replicas=5

Option B: Clear cache

1
kubectl exec -it redis-0 -n procurement -- redis-cli FLUSHDB

Option C: Enable query caching

1
2
3
kubectl set env deployment/procurement-api -n procurement \
  QUERY_CACHE_ENABLED=true \
  QUERY_CACHE_TTL=60

Option D: Disable non-critical features

1
2
3
kubectl set env deployment/procurement-api -n procurement \
  FEATURE_ANALYTICS=false \
  FEATURE_AUDIT_LOG=async

Runbook: Service Unavailable

Alert

1
2
SLO_Availability_Breach
KubePodNotReady

Symptoms

  • Health check failures
  • Pods in CrashLoopBackOff
  • No response from API

Response Time: < 5 minutes (P1)

Step 1: Verify

1
2
3
4
5
6
7
8
# Check pod status
kubectl get pods -n procurement -l app=procurement-api

# Check events
kubectl get events -n procurement --sort-by='.lastTimestamp' | tail -20

# Check service endpoints
kubectl get endpoints procurement-api -n procurement

Step 2: Diagnose

1
2
3
4
5
6
7
8
9
# Check pod logs for crash reason
kubectl logs -n procurement -l app=procurement-api --previous

# Check pod describe for resource issues
kubectl describe pods -n procurement -l app=procurement-api

# Check if nodes are healthy
kubectl get nodes
kubectl describe nodes | grep -A5 "Conditions:"

Step 3: Respond

Option A: Restart pods

1
kubectl rollout restart deployment/procurement-api -n procurement

Option B: Scale to healthy nodes

1
2
3
4
5
6
7
8
# Find unhealthy node
kubectl get pods -n procurement -o wide

# Cordon unhealthy node
kubectl cordon <node-name>

# Evict pods from node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

Option C: Emergency deployment from backup image

1
2
kubectl set image deployment/procurement-api -n procurement \
  procurement-api=procurement-api:v2.0.0-stable

Runbook: Database Issues

Alert

1
2
3
PostgreSQLDown
PostgreSQLHighConnections
PostgreSQLSlowQueries

Step 1: Verify

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Check database pods
kubectl get pods -n procurement -l app=postgres

# Check connections
kubectl exec -it postgres-0 -n procurement -- psql -c \
  "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Check active queries
kubectl exec -it postgres-0 -n procurement -- psql -c \
  "SELECT pid, now() - pg_stat_activity.query_start AS duration, query
   FROM pg_stat_activity
   WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%'
   ORDER BY duration DESC LIMIT 10;"

Step 2: Respond

Kill long-running queries:

1
2
3
4
5
kubectl exec -it postgres-0 -n procurement -- psql -c \
  "SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE state = 'active'
   AND query_start < now() - interval '5 minutes';"

Increase connection limit temporarily:

1
2
kubectl exec -it postgres-0 -n procurement -- psql -c \
  "ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf();"

Runbook: Message Queue Issues

Alert

1
2
KafkaConsumerLag
RabbitMQQueueBacklog

Step 1: Verify

1
2
3
4
5
6
7
8
# Check Kafka consumer lag
kubectl exec -it kafka-0 -n messaging -- \
  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group procurement-consumers

# Check RabbitMQ queues
kubectl exec -it rabbitmq-0 -n messaging -- \
  rabbitmqctl list_queues name messages consumers

Step 2: Respond

Scale consumers:

1
kubectl scale deployment/procurement-worker -n procurement --replicas=5

Purge stuck queue (DANGEROUS):

1
2
kubectl exec -it rabbitmq-0 -n messaging -- \
  rabbitmqctl purge_queue procurement.dead-letter

Runbook: Memory/CPU Issues

Alert

1
2
3
SLO_CPUSaturation_High
SLO_MemorySaturation_High
ContainerOOMKilled

Step 1: Verify

1
2
3
4
5
6
7
8
# Check resource usage
kubectl top pods -n procurement -l app=procurement-api

# Check for OOM kills
kubectl get events -n procurement | grep -i oom

# Check container limits
kubectl describe deployment procurement-api -n procurement | grep -A5 "Limits:"

Step 2: Respond

Increase limits:

1
2
kubectl patch deployment procurement-api -n procurement -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"procurement-api","resources":{"limits":{"memory":"2Gi","cpu":"1000m"}}}]}}}}'

Scale horizontally:

1
kubectl scale deployment/procurement-api -n procurement --replicas=5

Post-Incident Process

Immediate (within 24 hours)

  1. Update status page to "Resolved"
  2. Notify stakeholders via email
  3. Create incident ticket in Jira
  4. Preserve evidence:

    1
    2
    3
    4
    5
    # Save logs
    kubectl logs -n procurement -l app=procurement-api --since=2h > /tmp/incident-logs.txt
    
    # Save metrics
    curl "http://prometheus:9090/api/v1/query_range?query=http_requests_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > /tmp/incident-metrics.json
    

Within 48 hours

  1. Schedule Post-Incident Review (PIR) meeting
  2. Write incident timeline
  3. Identify root cause
  4. Create action items

PIR Template

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Post-Incident Review: [Incident Title]

**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: P1/P2/P3/P4
**Author**: @name

## Summary

[Brief description of what happened]

## Timeline

- HH:MM - [Event]
- HH:MM - [Alert fired]
- HH:MM - [Response action]
- HH:MM - [Resolution]

## Impact

- Users affected: X
- Revenue impact: $X
- Error budget consumed: X%

## Root Cause

[Description of root cause]

## What Went Well

- [Item 1]
- [Item 2]

## What Could Be Improved

- [Item 1]
- [Item 2]

## Action Items

- [ ] [Action] - Owner: @name - Due: YYYY-MM-DD
- [ ] [Action] - Owner: @name - Due: YYYY-MM-DD

## Lessons Learned

[Key takeaways]

Quick Reference Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# View all alerts
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant 'ALERTS{alertstate="firing"}'

# Check SLO status
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant 'sli:http_requests_success:ratio_rate5m'

# Emergency rollback
kubectl rollout undo deployment/procurement-api -n procurement

# Emergency scale
kubectl scale deployment/procurement-api -n procurement --replicas=10

# View recent events
kubectl get events -n procurement --sort-by='.lastTimestamp' | tail -30

# Check all pod logs
stern -n procurement procurement-api --since=10m

# Port forward to debug
kubectl port-forward -n procurement svc/procurement-api 8080:8080

Last Updated: January 25, 2026
Next Review: April 25, 2026
Owner: Platform Engineering Team