SIEM Alert Fatigue: How to Build a Triage Playbook That Actually Works

The SOC analyst stares at 2,400 alerts from the previous 24 hours. The SIEM has been running for three years. Nobody knows exactly why half the rules were created. Last Tuesday, a real ransomware pre-staging event sat in the queue for six hours before anyone noticed — buried under 340 “Medium” severity alerts about failed RDP logins from a server that hasn’t been patched since 2021. This is not a technology problem. It is a process problem. Here is how to fix it.

The Alert Fatigue Death Spiral

Alert fatigue follows a predictable pattern. A new threat emerges, someone adds a detection rule. The rule produces false positives. Nobody tunes it because tuning takes time and there are already 3,000 more alerts to review. Analysts start skipping the noisy rule. A true positive eventually fires, gets missed, and becomes an incident. In the post-mortem, someone says “we need more alerts.” The spiral tightens.

According to IBM’s 2025 Cost of a Data Breach report, organizations with high SOC alert volume take on average 22 days longer to identify and contain breaches compared to organizations with well-tuned detection. The false positive rate in most enterprise SIEMs runs between 40% and 85%, depending on tuning maturity.

Step 1: Audit Your Alert Inventory

Before building a triage playbook, you need to know what you are actually working with. Run this audit against your SIEM:

# Splunk — find rules with high volume and low action rate
# (lots of alerts, nobody is doing anything with them)
index=notable | stats count, values(action_taken) as actions by rule_name
| eval actionable_rate = if(isnull(actions) OR actions="",0,1)
| where count > 50 AND actionable_rate = 0
| sort -count
| head 20

# Elastic/Kibana — similar approach via KQL
# Find alerts closed as "false positive" by rule over last 30 days
GET .alerts-security.alerts-default/_search
{
  "aggs": {
    "by_rule": {
      "terms": { "field": "kibana.alert.rule.name" },
      "aggs": {
        "false_positives": {
          "filter": { "term": { "kibana.alert.workflow_status": "closed" }}
        }
      }
    }
  }
}

# Wazuh — check rule hit counts from last 30 days
curl -k -X GET "https://wazuh-manager:55000/api/v1/manager/stats/weekly"   -H "Authorization: Bearer $TOKEN" | python3 -m json.tool | grep -A2 "rule_id"

The output will likely be uncomfortable. You will find rules that have never generated a true positive. Rules that are 90%+ false positives. Rules that nobody has reviewed in years. That is your starting point.

Step 2: Classify Every Alert by MITRE ATT&CK Stage and Business Impact

Not all alerts are equal. A detection for T1003 (OS Credential Dumping) on a domain controller is categorically different from the same rule firing on a developer’s laptop during a CTF. The triage playbook must account for context.

Build a classification matrix for every active detection rule:

# Rule classification schema (store this in your SIEM or a CSV/database)
{
  "rule_id": "SIGMA-1042",
  "rule_name": "Mimikatz via LSASS Access",
  "mitre_tactic": "Credential Access",
  "mitre_technique": "T1003.001",
  "kill_chain_stage": "post-exploitation",
  "asset_criticality_multiplier": {
    "tier1_dc": 5,
    "tier2_server": 3,
    "tier3_workstation": 1
  },
  "baseline_fpr": 0.12,  # 12% false positive rate after tuning
  "required_sla_minutes": 15,  # SLA for first triage action
  "auto_actions": ["isolate_if_score_gt_8", "page_oncall_if_dc"],
  "triage_steps": "triage-playbook/T1003.001.md"
}

Step 3: Build the Triage Decision Tree

Every alert type needs a documented decision tree that an analyst can follow in under five minutes for initial triage. Here is a complete example for one of the most common alert types — “Suspicious PowerShell Execution”:

# Triage Playbook: Suspicious PowerShell Execution (T1059.001)
# Estimated triage time: 3-5 minutes
# SLA: Initial triage within 30 minutes

## STEP 1: Context Check (60 seconds)
Questions to answer:
- Who ran it? (user account — service account or human?)
- What asset? (DC / Server / Workstation / VDI?)
- What time? (business hours or 2am Sunday?)
- Is this asset in the "PowerShell baseline" list? (see: assets/ps_baseline.csv)

## STEP 2: Command Analysis (90 seconds)
# Pull the full command from SIEM:
index=main EventCode=4104 | search ScriptBlockText="*" | table _time, ComputerName, UserName, ScriptBlockText

# RED FLAGS — escalate immediately if any present:
RED = [
  "DownloadString",       # download cradle
  "IEX",                  # invoke expression
  "EncodedCommand",       # -enc flag
  "Invoke-Mimikatz",      # credential dumping
  "Bypass",               # UAC/AMSI bypass
  "-WindowStyle Hidden",  # hidden window
  "Net.WebClient",        # web download
  "FromBase64String",     # encoded payload
]

# ORANGE FLAGS — investigate further:
ORANGE = [
  "Set-MpPreference",     # defender modification
  "New-ScheduledTask",    # persistence
  "reg add",              # registry modification
  "net user",             # user enumeration/creation
]

## STEP 3: Decision
IF RED flags present:
  → Escalate to P1, isolate host if Tier 1 or Tier 2
  → Notify incident commander
  → Preserve memory dump: winpmem.exe memory.dmp

IF ORANGE flags present:
  → Set alert to P2, investigate user's recent activity (last 24h)
  → Check if change ticket exists in ITSM
  → Assign to analyst for 30-minute investigation window

IF no flags, baseline user/asset:
  → Close as false positive
  → Add to suppression rule with 90-day expiry
  → Log suppression decision with justification

Step 4: Risk Scoring — Prioritize with Math, Not Gut Feel

A risk score should combine multiple factors so that the “real” alerts rise to the top automatically. Here is a scoring formula that works well in practice:

# Risk Score Formula
# Score = (Base Severity * Asset Criticality) + Context Modifiers

# Base Severity (from rule definition): 1-10
# Asset Criticality: DC=5, Server=3, Workstation=1
# 
# Context Modifiers (additive):
# +3 if alert involves an admin account
# +3 if same asset had another alert in last 2 hours (correlation)
# +2 if alert fires outside business hours
# +2 if asset has no EDR installed
# +2 if rule is in MITRE stages: Collection, Exfiltration, Command & Control
# -2 if user has an open change request in ITSM
# -3 if this exact CommandLine is in the suppression baseline
#
# Thresholds:
# >= 18: Critical — P1, auto-isolate, page on-call
# 12-17: High — P2, analyst must action within 15 min
# 7-11:  Medium — P3, review within 2 hours
# <= 6:  Low — P4, review at end of shift

# Splunk implementation:
index=notable
| eval base_score = case(urgency="critical",10, urgency="high",7, urgency="medium",4, 1)
| lookup asset_criticality.csv asset_name AS dest OUTPUTNEW criticality
| eval criticality_mult = case(criticality="dc",5, criticality="server",3, 1)
| eval time_mod = if(date_hour < 8 OR date_hour > 18, 2, 0)
| eval admin_mod = if(match(user, "(?i)admin|svc_|service"), 3, 0)
| eval risk_score = (base_score * criticality_mult) + time_mod + admin_mod
| sort -risk_score

Step 5: Automation — Let the SIEM Do the Repetitive Work

For alerts with a false positive rate above 70% and a clear, automatable verification step, build automated triage actions. Here are three high-value automation examples:

# Automation 1: Auto-close "Failed Login" alerts when source is known scanner
# (Qualys, Tenable, Nessus scanner IPs produce hundreds of these per day)
IF alert.rule == "Failed_Login_Brute_Force"
   AND alert.src_ip IN scanner_ip_list
   AND alert.dest NOT IN critical_assets
THEN: close(reason="Known scanner activity", suppress_hours=24)

# Automation 2: Auto-enrich and score before analyst sees it
# Uses SOAR (Splunk SOAR / Palo Alto XSOAR / TheHive/Cortex)
def auto_enrich(alert):
  # Geo-IP lookup
  alert.src_geo = geoip_lookup(alert.src_ip)
  # Threat intel check
  alert.ioc_score = virustotal_lookup(alert.src_ip)
  # Asset lookup
  alert.asset_owner = cmdb_lookup(alert.dest_ip)
  alert.asset_criticality = cmdb_get_criticality(alert.dest_ip)
  # Recent activity
  alert.similar_alerts_24h = siem_count_similar(alert.rule, alert.dest_ip, hours=24)
  return calculate_risk_score(alert)

# Automation 3: Slack notification template for P1 alerts
P1_SLACK_TEMPLATE = """
:rotating_light: *P1 ALERT — Immediate Action Required*
*Rule:* {rule_name}
*Asset:* {dest_ip} ({asset_name}) — Criticality: {criticality}
*User:* {user}
*Time:* {alert_time} (UTC)
*Risk Score:* {risk_score}/30
*MITRE:* {mitre_technique}
*Analyst Assigned:* {oncall_analyst}
*SIEM Link:* {siem_url}
*Playbook:* {playbook_link}
"""

Step 6: The Weekly Tuning Ritual

Build a recurring 30-minute weekly meeting specifically for SIEM tuning. The agenda is always the same:

First, review the top 5 highest-volume rules from the past week and check their true positive rate. If a rule generated more than 100 alerts and zero were actioned, it needs either tuning or disabling. Second, review any alerts that were missed or triaged late — what was the blockers? Third, check for rules that have not fired in 90 days — they may be broken (misconfigured data source, changed log format) rather than genuinely quiet. Fourth, review suppression rules that are expiring — do they still make sense?

# Weekly tuning report query (Splunk)
# Run this before the weekly meeting
index=notable earliest=-7d
| stats count as total_alerts, 
        sum(eval(if(status="resolved" AND close_reason!="false positive",1,0))) as true_positives,
        sum(eval(if(close_reason="false positive",1,0))) as false_positives
  by rule_name
| eval fpr = round((false_positives/total_alerts)*100, 1)
| eval tpr = round((true_positives/total_alerts)*100, 1)
| where total_alerts > 20
| sort -fpr
| table rule_name, total_alerts, true_positives, false_positives, fpr, tpr

The Results: What Good Looks Like

A mature SOC with a well-tuned alert pipeline should target these metrics: actionable alert rate above 60% (meaning 60% of alerts that reach an analyst are real or require investigation), mean time to triage under 15 minutes for High severity alerts, false positive rate below 30% across the board, and less than 5% of rules generating more than 20% of alert volume (the “noisy rule” problem).

These numbers are achievable. They require disciplined tuning, documented playbooks, and a culture where closing a false positive with a suppression rule is seen as a win — not a failure to detect. The goal is not zero alerts. The goal is for every alert that reaches an analyst to be worth their time.