How Downtime Data Should Be Structured for CNC Shops

Matt Ulepic
3 days ago
9 min read

How Downtime Data Should Be Structured

Most “downtime reports” aren’t wrong because the shop lacks dashboards—they’re wrong because the underlying data can’t be audited. If your ERP says a job ran fine but the floor remembers repeated stops, the argument usually isn’t about effort. It’s about structure: missing state changes, inconsistent thresholds, and timestamps that don’t reconcile across shifts.

For CNC job shops running 10–50 machines across multiple shifts, the goal isn’t perfect categorization on day one. The goal is a minimal, enforceable downtime data model that produces comparable reports week over week—so morning meetings move from “what happened?” to “what are we doing about it?” If you need a high-level overview of why visibility matters, start with machine downtime tracking.

TL;DR — How Downtime Data Should Be Structured

Start with an auditable machine state timeline; downtime events are computed from it.
Use a small, consistent state set (RUN/STOP/FAULT/SETUP/OFF/UNKNOWN) to prevent interpretation drift.
Store non-negotiable fields: start/end timestamps, machine identity, derived classification, attribution, and audit trail.
Apply monotonic timestamp rules (no gaps/overlaps) and split events cleanly at shift boundaries for fair comparisons.
Make microstop and downtime thresholds explicit and versioned so reports stay comparable across weeks and machines.
Roll up fragmented stop patterns (tool change/load cycles) so Pareto analysis stays usable.
Treat network/controller dropouts as UNKNOWN with clear rules—never silently count them as RUN or blame operators.

Key takeaway Structured downtime data is a control system: a single, gap-free state timeline plus explicit rules (thresholds, shift splits, UNKNOWN handling) turns “ERP vs reality” debates into consistent, shift-comparable loss buckets. When short stops, gaps, and edits are traceable, you stop chasing anecdotes and start recovering hidden capacity before spending on new equipment.

Start with a single source of truth: the machine state timeline

Downtime reporting has to be derived from one thing: an auditable sequence of machine states over time. Manual methods—whiteboards, operator notes, end-of-shift estimates, or “we’ll fix it in the ERP later”—fail at scale because they reconstruct history after the fact. On a 20–50 machine floor with multiple shifts, that reconstruction becomes selective memory, and your “totals” become arguable.

A minimal state model for CNC job shops is usually enough:

RUN: cutting/cycling (or otherwise producing) per your shop’s definition.
STOP (or STARVED/BLOCKED): not running, no alarm present, production is possible but not happening.
FAULT/ALARM: stopped with an alarm condition that must be cleared.
SETUP: planned non-cycle work that you don’t want misclassified as “downtime” when quoting capacity.
OFF/NO_SCHEDULE: not planned to run (breaks, weekends, unscheduled time).
UNKNOWN: data not trustworthy (network drop, controller heartbeat lost, or ambiguous state).

A state is a point-in-time classification on a timeline. A downtime event is a computed segment (start/end) created by grouping one or more non-RUN states according to rules you define. This distinction matters because it keeps the raw timeline stable while letting operations decide how to summarize it (for daily management) without rewriting history (for engineering follow-up).

Every downstream metric—utilization leakage patterns, shift comparisons, stop counts, and “where did the time go?”—depends on clean transitions: no overlaps, no unaccounted time, and no casual mixing of planned non-production (OFF/NO_SCHEDULE, SETUP) into unplanned STOP/FAULT. If planned time is polluted, the shop often concludes it “needs more machines” when it really needs cleaner control of gaps and short stops. For broader context on the systems that collect these signals, see machine monitoring systems.

Required downtime data fields (the non-negotiables)

To make downtime reports comparable across machines, shifts, and weeks, you need a small set of fields that you treat as non-optional. If any of these are missing, you’ll feel it as “why don’t these totals match?” or “we can’t trust that shift comparison.”

Core timestamp fields

event_start_ts, event_end_ts
duration_s (computed, not manually typed)
time_zone and clock_source (controller clock vs server clock)

Identity fields

machine_id (stable, unique) and optionally cell/department
controller/source_id (which device generated the state signal)
job_id, operation_id, part_program when available (useful but not always present on legacy equipment)

Classification fields

state_type (derived from the timeline; don’t rely on free-text)
planned_flag (planned vs unplanned classification at the event level)
downtime_reason_code (optional at capture, required for closure)
reason_category (high-level bucket for consistency)

Attribution fields

operator_id (who was logged in or assigned)
shift_id (and shift schedule version if it changes)
supervisor/team (helps when accountability is team-based)

Data quality and audit trail

event_confidence (e.g., high/medium/low based on data completeness)
data_source (controller, gateway, manual entry)
edited_flag, edited_by, edit_ts (so changes are visible, not silent)

This is where many shops feel the ERP vs actual machine behavior gap the most: the ERP can hold a job status, but it typically can’t provide an auditable state sequence with consistent thresholds, shift splits, and edit history. If your goal is recovering capacity with clean utilization data—not more meetings—these fields are the foundation. For how that structured time rolls into capacity views, see machine utilization tracking software.

Timestamp rules: preventing gaps, overlaps, and ‘argument time’

If your timestamps aren’t governed, you’ll get reports where the day doesn’t sum to the day. That’s when the team starts negotiating totals instead of acting on them. Three practical rules prevent most disputes.

1) Monotonic event logic

Use a single ordering rule: when a new state starts, the prior state ends at that exact timestamp. In other words, next_state_start_ts = prior_state_end_ts. If there is time you genuinely can’t account for (network drop, controller offline), record it explicitly as UNKNOWN rather than leaving a gap or guessing.

2) Cross-shift boundary handling

For fair shift reporting, split events at shift change. Preserve the relationship by keeping a parent_event_id (the original interruption) and creating child segments that align to each shift’s start/end. This prevents one shift from “owning” a stoppage just because it began at 1 minute before the horn.

3) Clock synchronization and late-arriving data

Mixed fleets introduce mixed clocks. If possible, store both controller-reported time and server-received time, and record the clock source used for reporting. For late-arriving data (buffered messages, temporary gateway outage), avoid silently rewriting finalized shift totals. Mark affected segments with a low confidence flag, and keep an audit trail of what changed and when.

Required scenario: if a network drop or controller heartbeat loss creates “unknown” gaps, you need rules so those minutes aren’t quietly counted as RUN (false capacity) or automatically blamed on operators (false accountability). UNKNOWN is not a failure—it’s an honest classification that protects the integrity of the rest of your report.

Event thresholds: capturing real loss without drowning in microstops

Without thresholds, CNC downtime data turns into noise: dozens of tiny stops that inflate counts, fragment categories, and make Pareto charts useless. With overly aggressive thresholds, you hide utilization leakage—the small interruptions that quietly consume capacity across a shift.

Define (at minimum) two thresholds tied to how your processes behave:

microstop_threshold: interruptions you want to count separately for engineering (often 10–60s) but not let dominate daily downtime totals.
downtime_threshold: interruptions you treat as “real downtime” for management reporting (often 60–180s).

Then add roll-up logic that reflects actual work patterns. Example: consecutive short STOP segments separated by brief RUN can be merged into one operational interruption using a merge_window (for instance, merge if RUN between stops is less than a defined window). This is critical for machines that bounce RUN/IDLE repeatedly during tool changes and part loading—without roll-up, you get 20–60 tiny events that mask the true “setup/adjust” or “material handling” bucket.

Store both views:

raw_stop_time (what the controller timeline shows)
reporting_classified_time (after thresholds/merges)

Make thresholds explicit fields and version them. This directly addresses a common multi-shift scenario: Shift B “looks worse” because it logs every 20-second interruption while Shift A ignores them (or the supervisor closes them differently). If thresholding and merging rules aren’t consistent—and recorded—shift comparisons become misleading, and the shop chases people instead of patterns.

Reason coding structure: design for speed, not perfection

Reason coding should reduce decision time, not create operator paperwork. The structure should be simple enough that supervisors will actually use it, while still giving you stable categories for weekly reviews.

Use a two-level taxonomy:

reason_category: high-level bucket (e.g., Setup/Adjustment, Material, Quality, Program, Maintenance, Operator Not Present)
reason_code: the specific reason (kept short and controlled)
unknown/unassigned: a valid interim state, not a failure

Operationally, a closure workflow works best: allow the system to capture the stop event immediately, even if no reason is provided in the moment, but require reason assignment within a defined time window (for example, by end of shift or within 24 hours). This prevents “everything becomes Unknown” while avoiding the trap of forcing operators to pick from a long list mid-crisis.

When alarms are involved, consider separating symptom vs root cause fields. The alarm might be “spindle overload,” but the true cause could be a dull tool, wrong feed, or chip evacuation issue. Keeping these distinct reduces mislabeling without turning your downtime system into an investigation database.

Finally, apply governance: reason codes change over time. Store effective and retired dates so historical reports don’t drift when you rename or restructure codes. (Deep taxonomy design is its own topic; keep this article focused on the storage and governance mechanics.)

Data quality checks that keep reports trustworthy

Once the schema exists, Ops leaders need a short list of checks that catch bad structure early. These checks keep the system actionable across a mixed fleet and multiple shifts, and they prevent slow erosion of trust.

Completeness checks

% time in UNKNOWN by machine and by shift (spikes often indicate connectivity or mapping issues).
% events without reason after X hours (prevents permanent “unassigned”).
Missing operator_id / shift_id where attribution is expected.

Consistency checks

Overlapping events on the same machine.
Negative or zero durations where they shouldn’t exist.
Excessive event counts per hour (a sign of fragmentation from poor roll-up rules).

Comparability checks

Threshold version drift across machines (microstop/downtime thresholds must be consistent or explicitly segmented).
Different planned-time rules by department (OFF/NO_SCHEDULE vs STOP definitions).

Auditability check

Pick any downtime Pareto bar or “top loss” bucket and ensure you can trace it back to exact state transitions and timestamps. If you can’t, your team will eventually stop trusting the report—especially when it conflicts with what supervisors saw on the floor.

Mid-article diagnostic: take one week of data and ask, “Can we fairly compare Shift A vs Shift B on the same cell without caveats?” If the answer depends on who closed events—or which threshold rules were applied—you’ve found the exact friction that keeps downtime tracking from becoming a capacity recovery tool.

Worked examples: structuring the same stoppage three ways (and what breaks)

Example 1: tool change + part load cycles (fragmentation vs roll-up)

Scenario: a machine cycles RUN/IDLE repeatedly during tool changes and part loading. Raw state changes (controller-derived) over ~10 minutes look like this:

Timestamp	State	Note
09:12:10	RUN	Cutting
09:14:02	STOP	Tool change begins
09:14:25	RUN	Short cycle
09:14:41	STOP	Part load / adjust
09:15:05	RUN	Short cycle
09:15:21	STOP	Load/adjust continues
09:20:30	RUN	Back in production

If you naively create a downtime event for every STOP segment, your shift report shows many short events, making it hard to rank losses. Instead, apply thresholding and roll-up:

microstop_threshold (hypothetical): 30 seconds
downtime_threshold (hypothetical): 120 seconds
merge_window (hypothetical): merge STOP segments if RUN between them is < 45 seconds

Resulting computed event (what you store as an event record) can be a single interruption:

event_start_ts: 09:14:02
event_end_ts: 09:20:30
state_type (derived): STOP
planned_flag: depends on whether this is categorized as SETUP vs unplanned STOP (policy choice, but must be consistent)
shift_id, operator_id captured for each segment; edited_flag available if supervisor reclassifies later
threshold_version_id stored so the report remains comparable next month

In a shift summary, this becomes: fewer events, clearer minutes, and a Pareto that surfaces the true bucket (setup/adjust vs load/handling) instead of burying it under dozens of fragments.

Example 2: one stoppage across a shift change (fair attribution without changing totals)

Scenario: a stoppage begins near the end of Shift A and continues into Shift B. Raw states:

Timestamp	State	Note
14:56:40	RUN	Normal production
14:58:10	FAULT/ALARM	Alarm occurs
15:00:00	—	(Shift Boundary) Shift A ends, Shift B starts
15:07:20	RUN	Alarm cleared, machine running

Correct structuring: create one parent downtime event (14:58:10–15:07:20), then split into two child segments:

Segment A (Shift A): 14:58:10–15:00:00, shift_id=A
Segment B (Shift B): 15:00:00–15:07:20, shift_id=B

Totals remain consistent (parent duration equals sum of segments), but shift reporting is fair. This also prevents a common multi-shift CNC cell problem: Shift B shows higher downtime “on paper,” and the team treats it as a performance issue when the real cause is inconsistent handling of boundaries, thresholds, or closure habits.

Example 3: data gap from a network drop (UNKNOWN vs silent miscount)

Scenario: the gateway loses connectivity and the controller heartbeat stops reporting. Raw ingest shows missing messages between two known states:

Timestamp	Observed State	Risk if Mishandled
10:22:05	RUN	Baseline known state (no risk, accurate capture).
10:24:10–10:28:40	No data (heartbeat lost)	High Risk: Silently counted as RUN or STOP by the system, artificially inflating utilization metrics or masking true downtime.
10:28:40	STOP	State resumes (previous untracked gap must be audited or marked as "unknown").

Correct rule: create an UNKNOWN segment for 10:24:10–10:28:40 with low confidence and a data_source note (network/heartbeat loss). Do not automatically attribute it to an operator, and do not let it inflate RUN time. In the shift report, UNKNOWN should be visible so the team fixes the data path and doesn’t draw conclusions from missing minutes.

If you’re implementing or tightening these rules, keep cost framing practical: you’re not buying “charts,” you’re buying consistency—state capture, threshold governance, audit trails, and the ability to reconcile ERP assumptions with machine behavior. For implementation expectations and packaging, see pricing. For teams that want help interpreting patterns across shifts and machines once the data is structured, an AI Production Assistant can be used to turn clean events into consistent follow-up questions without turning the morning meeting into a debate club.

When you’re ready, the fastest way to validate your current data structure is to walk through one cell’s last full shift and check: (1) can every minute be accounted for, (2) do thresholds match across shifts, and (3) can any event be traced back to state transitions. If you want to pressure-test your schema against a mixed fleet and multi-shift reality, schedule a demo and bring a real shift report—your team will get more value from a 20-minute structure review than from another round of manual downtime notes.

How Downtime Data Should Be Structured for CNC Shops