🚨 Incident Response Runbooks
Версия: 1.0.0
Дата: 25 января 2026
Статус: PRODUCTION READY
📋 Содержание
- Общая информация
- Severity Classification
- Runbook: High Error Rate
- Runbook: High Latency
- Runbook: Service Unavailable
- Runbook: Database Issues
- Runbook: Message Queue Issues
- Runbook: Memory/CPU Issues
- Post-Incident Process
Общая информация
On-Call Rotation
| Role |
Primary |
Secondary |
| Platform Engineer |
@platform-oncall |
@platform-backup |
| Database Admin |
@dba-oncall |
@dba-backup |
| Security |
@security-oncall |
@security-backup |
Communication Channels
- Slack:
#incidents (primary), #incidents-war-room (major)
- PagerDuty: procurement-api service
- Status Page: status.example.com
- Video Bridge: meet.example.com/incident-room
Quick Links
Severity Classification
| Severity |
Description |
Response Time |
Example |
| P1 - Critical |
Complete service outage, data loss risk |
< 5 min |
API down, database corruption |
| P2 - High |
Significant degradation, SLO breach |
< 15 min |
Error rate > 5%, latency > 2s |
| P3 - Medium |
Minor degradation, approaching SLO |
< 1 hour |
Error rate > 1%, latency > 500ms |
| P4 - Low |
Minor issue, no user impact |
< 4 hours |
Log errors, non-critical alerts |
Runbook: High Error Rate
Alert
| SLO_SuccessRate_FastBurn_Critical
SLO_SuccessRate_FastBurn_Warning
|
Symptoms
- Error rate exceeds 0.5% (SLO threshold)
- Increased 5xx responses
- User complaints about failures
Response Time: < 5 minutes (P1) / < 15 minutes (P2)
Step 1: Verify (1 minute)
| # 1. Check current error rate
kubectl exec -it prometheus-0 -n monitoring -- \
promtool query instant '(1 - sli:http_requests_success:ratio_rate5m) * 100'
# 2. Check error breakdown by endpoint
kubectl exec -it prometheus-0 -n monitoring -- \
promtool query instant 'sum by (path, status) (rate(http_requests_total{status=~"5.."}[5m]))'
# 3. View recent errors in logs
kubectl logs -n procurement -l app=procurement-api --since=5m | grep -i error | tail -50
|
Step 2: Assess (2 minutes)
Questions to answer:
| # Check recent deployments
kubectl rollout history deployment/procurement-api -n procurement
# Check dependency health
kubectl logs -n procurement -l app=procurement-api --since=5m | grep -E "connection|timeout|refused"
# Check database connections
kubectl exec -it postgres-0 -n procurement -- psql -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active';"
|
Step 3: Respond (2-5 minutes)
Option A: Rollback (if recent deployment)
| # Rollback to previous version
kubectl rollout undo deployment/procurement-api -n procurement
# Verify rollback
kubectl rollout status deployment/procurement-api -n procurement
|
Option B: Scale up (if load-related)
| # Increase replicas
kubectl scale deployment/procurement-api -n procurement --replicas=5
# Enable HPA emergency scaling
kubectl patch hpa procurement-api -n procurement \
-p '{"spec":{"maxReplicas":10}}'
|
Option C: Enable circuit breaker fallback
| # Set environment variable for circuit breaker
kubectl set env deployment/procurement-api -n procurement \
CIRCUIT_BREAKER_FORCE_OPEN=true
|
Option D: Block problematic endpoint (temporary)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | # Apply rate limiting via Istio
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: emergency-rate-limit
namespace: procurement
spec:
workloadSelector:
labels:
app: procurement-api
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.local_ratelimit
EOF
|
Step 4: Monitor (5+ minutes)
| # Watch error rate recovery
watch -n 5 'kubectl exec -it prometheus-0 -n monitoring -- \
promtool query instant "(1 - sli:http_requests_success:ratio_rate5m) * 100"'
|
Step 5: Notify
| # Post to Slack
curl -X POST https://slack.com/api/chat.postMessage \
-H "Authorization: Bearer $SLACK_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"channel": "#incidents",
"text": "🚨 *High Error Rate Incident*\nStatus: Investigating\nImpact: API error rate elevated\nCurrent rate: X%"
}'
|
Runbook: High Latency
Alert
| SLO_Latency_P99_Breach
SLO_Latency_P95_Breach
|
Symptoms
- P99 latency > 500ms
- P95 latency > 200ms
- User complaints about slow responses
Response Time: < 15 minutes
Step 1: Verify
| # Check latency percentiles
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant \
'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="procurement-api"}[5m])) * 1000'
# Check latency by endpoint
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant \
'histogram_quantile(0.99, sum by (path, le) (rate(http_request_duration_seconds_bucket[5m]))) * 1000'
|
Step 2: Identify Root Cause
1
2
3
4
5
6
7
8
9
10
11
12 | # 1. Check database query times
kubectl logs -n procurement -l app=procurement-api --since=10m | \
grep -E "query.*ms" | sort -t'=' -k2 -rn | head -20
# 2. Check external service latency (via Jaeger)
# Open: https://jaeger.example.com/search?service=procurement-api&operation=all
# 3. Check resource utilization
kubectl top pods -n procurement -l app=procurement-api
# 4. Check GC pauses
kubectl logs -n procurement -l app=procurement-api --since=10m | grep -i "gc\|garbage"
|
Step 3: Respond
Option A: Scale horizontally
| kubectl scale deployment/procurement-api -n procurement --replicas=5
|
Option B: Clear cache
| kubectl exec -it redis-0 -n procurement -- redis-cli FLUSHDB
|
Option C: Enable query caching
| kubectl set env deployment/procurement-api -n procurement \
QUERY_CACHE_ENABLED=true \
QUERY_CACHE_TTL=60
|
Option D: Disable non-critical features
| kubectl set env deployment/procurement-api -n procurement \
FEATURE_ANALYTICS=false \
FEATURE_AUDIT_LOG=async
|
Runbook: Service Unavailable
Alert
| SLO_Availability_Breach
KubePodNotReady
|
Symptoms
- Health check failures
- Pods in CrashLoopBackOff
- No response from API
Response Time: < 5 minutes (P1)
Step 1: Verify
| # Check pod status
kubectl get pods -n procurement -l app=procurement-api
# Check events
kubectl get events -n procurement --sort-by='.lastTimestamp' | tail -20
# Check service endpoints
kubectl get endpoints procurement-api -n procurement
|
Step 2: Diagnose
| # Check pod logs for crash reason
kubectl logs -n procurement -l app=procurement-api --previous
# Check pod describe for resource issues
kubectl describe pods -n procurement -l app=procurement-api
# Check if nodes are healthy
kubectl get nodes
kubectl describe nodes | grep -A5 "Conditions:"
|
Step 3: Respond
Option A: Restart pods
| kubectl rollout restart deployment/procurement-api -n procurement
|
Option B: Scale to healthy nodes
| # Find unhealthy node
kubectl get pods -n procurement -o wide
# Cordon unhealthy node
kubectl cordon <node-name>
# Evict pods from node
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
|
Option C: Emergency deployment from backup image
| kubectl set image deployment/procurement-api -n procurement \
procurement-api=procurement-api:v2.0.0-stable
|
Runbook: Database Issues
Alert
| PostgreSQLDown
PostgreSQLHighConnections
PostgreSQLSlowQueries
|
Step 1: Verify
1
2
3
4
5
6
7
8
9
10
11
12
13 | # Check database pods
kubectl get pods -n procurement -l app=postgres
# Check connections
kubectl exec -it postgres-0 -n procurement -- psql -c \
"SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Check active queries
kubectl exec -it postgres-0 -n procurement -- psql -c \
"SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%'
ORDER BY duration DESC LIMIT 10;"
|
Step 2: Respond
Kill long-running queries:
| kubectl exec -it postgres-0 -n procurement -- psql -c \
"SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'active'
AND query_start < now() - interval '5 minutes';"
|
Increase connection limit temporarily:
| kubectl exec -it postgres-0 -n procurement -- psql -c \
"ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf();"
|
Runbook: Message Queue Issues
Alert
| KafkaConsumerLag
RabbitMQQueueBacklog
|
Step 1: Verify
| # Check Kafka consumer lag
kubectl exec -it kafka-0 -n messaging -- \
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group procurement-consumers
# Check RabbitMQ queues
kubectl exec -it rabbitmq-0 -n messaging -- \
rabbitmqctl list_queues name messages consumers
|
Step 2: Respond
Scale consumers:
| kubectl scale deployment/procurement-worker -n procurement --replicas=5
|
Purge stuck queue (DANGEROUS):
| kubectl exec -it rabbitmq-0 -n messaging -- \
rabbitmqctl purge_queue procurement.dead-letter
|
Runbook: Memory/CPU Issues
Alert
| SLO_CPUSaturation_High
SLO_MemorySaturation_High
ContainerOOMKilled
|
Step 1: Verify
| # Check resource usage
kubectl top pods -n procurement -l app=procurement-api
# Check for OOM kills
kubectl get events -n procurement | grep -i oom
# Check container limits
kubectl describe deployment procurement-api -n procurement | grep -A5 "Limits:"
|
Step 2: Respond
Increase limits:
| kubectl patch deployment procurement-api -n procurement -p \
'{"spec":{"template":{"spec":{"containers":[{"name":"procurement-api","resources":{"limits":{"memory":"2Gi","cpu":"1000m"}}}]}}}}'
|
Scale horizontally:
| kubectl scale deployment/procurement-api -n procurement --replicas=5
|
Post-Incident Process
- Update status page to "Resolved"
- Notify stakeholders via email
- Create incident ticket in Jira
-
Preserve evidence:
| # Save logs
kubectl logs -n procurement -l app=procurement-api --since=2h > /tmp/incident-logs.txt
# Save metrics
curl "http://prometheus:9090/api/v1/query_range?query=http_requests_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > /tmp/incident-metrics.json
|
Within 48 hours
- Schedule Post-Incident Review (PIR) meeting
- Write incident timeline
- Identify root cause
- Create action items
PIR Template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46 | # Post-Incident Review: [Incident Title]
**Date**: YYYY-MM-DD
**Duration**: X hours Y minutes
**Severity**: P1/P2/P3/P4
**Author**: @name
## Summary
[Brief description of what happened]
## Timeline
- HH:MM - [Event]
- HH:MM - [Alert fired]
- HH:MM - [Response action]
- HH:MM - [Resolution]
## Impact
- Users affected: X
- Revenue impact: $X
- Error budget consumed: X%
## Root Cause
[Description of root cause]
## What Went Well
- [Item 1]
- [Item 2]
## What Could Be Improved
- [Item 1]
- [Item 2]
## Action Items
- [ ] [Action] - Owner: @name - Due: YYYY-MM-DD
- [ ] [Action] - Owner: @name - Due: YYYY-MM-DD
## Lessons Learned
[Key takeaways]
|
Quick Reference Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 | # View all alerts
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant 'ALERTS{alertstate="firing"}'
# Check SLO status
kubectl exec -it prometheus-0 -n monitoring -- promtool query instant 'sli:http_requests_success:ratio_rate5m'
# Emergency rollback
kubectl rollout undo deployment/procurement-api -n procurement
# Emergency scale
kubectl scale deployment/procurement-api -n procurement --replicas=10
# View recent events
kubectl get events -n procurement --sort-by='.lastTimestamp' | tail -30
# Check all pod logs
stern -n procurement procurement-api --since=10m
# Port forward to debug
kubectl port-forward -n procurement svc/procurement-api 8080:8080
|
Last Updated: January 25, 2026
Next Review: April 25, 2026
Owner: Platform Engineering Team