Health Monitor - Cascade rollups

When the cascade fires

Two scenarios trip the rollup:

Scenario A — the integration itself went down

The integration reports unreachable or has bad credentials. The monitor opens an integration-level ticket as usual. From that moment on, any device-offline event for a device behind the same integration cascades: instead of opening a fresh per-device ticket, the device is added to the parent's recovery checklist.

Scenario B — many devices fall off a healthy integration

A subset of devices on an otherwise-healthy integration drops off (a VLAN went down, a firmware push went bad, a customer's local controller crashed). The first few device-offline events open standalone tickets — the monitor can't predict the fanout from one or two events.

When the count of recent device-offline tickets on the same integration crosses the fanout threshold inside the fanout window, the monitor retroactively promotes: it opens an integration-level parent ticket, resolves the standalone device tickets with a "Merged into cascade " note on each, and adds the devices to the parent's checklist.

What the user sees

Tickets list — A small Multiple devices offline chip next to the title for any ticket where the row is a cascade parent. The chip is the only visual signal — there are no member tickets to expand to.
Ticket detail page — When the ticket is a cascade parent, the description body carries the recovery checklist as a GFM task list. A short status banner above reads Cascade status: X of Y devices recovered. The checkboxes update automatically as devices recover.

End-of-incident behaviour

All members recover — the parent's description hits "Y of Y recovered" and the monitor auto-resolves the parent with an "All affected devices recovered" note.
The integration itself recovers (Scenario A) — the monitor ends the active cascade regardless of outstanding members. Any devices still offline re-enter normal observation and may open standalone tickets after their own debounce.
The fanout window expires before recovery — no auto-action. Stragglers will re-fire on their next observation cycle and open standalone tickets — at which point Scenario B may re-promote them if the threshold trips again.

Why "cascade" and not "outage"

The word "outage" is reserved for a future Neowit-declared platform-outage feature where operators can declare "Microsoft is down right now, expected ticket noise" before any single ticket fires. The current rollup is automatic — there's no operator declaration involved — so it's called a cascade in the UI and KB.

Configuration

The fanout window and threshold are platform-tuned (not per-org). Defaults:

Window: 10 minutes
Threshold: 10 devices

Both defaults are deliberately conservative; the window covers a typical reauth-then-reconnect cycle, and the threshold ignores small clusters where the noise floor is already low. If your environment routinely produces fanouts below threshold, talk to support about tuning — the values aren't customer-exposed yet.