Health Monitor - Cascade rollups
When many devices on the same integration go offline at once, the health monitor rolls them into one parent ticket instead of opening a ticket per device. The queue stays readable; the recovery checklist lives on the parent's description. For operators triaging fanout events.
When the cascade fires
Two scenarios trip the rollup:
Scenario A — the integration itself went down
The integration reports unreachable or has bad credentials. The monitor opens an integration-level ticket as usual. From that moment on, any device-offline event for a device behind the same integration cascades: instead of opening a fresh per-device ticket, the device is added to the parent's recovery checklist.
Scenario B — many devices fall off a healthy integration
A subset of devices on an otherwise-healthy integration drops off (a VLAN went down, a firmware push went bad, a customer's local controller crashed). The first few device-offline events open standalone tickets — the monitor can't predict the fanout from one or two events.
When the count of recent device-offline tickets on the same integration crosses the fanout threshold inside the fanout window, the monitor retroactively promotes: it opens an integration-level parent ticket, resolves the standalone device tickets with a "Merged into cascade " note on each, and adds the devices to the parent's checklist.

What the user sees
- Tickets list — A small Multiple devices offline chip next to the title for any ticket where the row is a cascade parent. The chip is the only visual signal — there are no member tickets to expand to.
- Ticket detail page — When the ticket is a cascade parent, the description body carries the recovery checklist as a GFM task list. A short status banner above reads
Cascade status: X of Y devices recovered.The checkboxes update automatically as devices recover.
End-of-incident behaviour
- All members recover — the parent's description hits "Y of Y recovered" and the monitor auto-resolves the parent with an "All affected devices recovered" note.
- The integration itself recovers (Scenario A) — the monitor ends the active cascade regardless of outstanding members. Any devices still offline re-enter normal observation and may open standalone tickets after their own debounce.
- The fanout window expires before recovery — no auto-action. Stragglers will re-fire on their next observation cycle and open standalone tickets — at which point Scenario B may re-promote them if the threshold trips again.
Why "cascade" and not "outage"
The word "outage" is reserved for a future Neowit-declared platform-outage feature where operators can declare "Microsoft is down right now, expected ticket noise" before any single ticket fires. The current rollup is automatic — there's no operator declaration involved — so it's called a cascade in the UI and KB.
Configuration
The fanout window and threshold are platform-tuned (not per-org). Defaults:
- Window: 10 minutes
- Threshold: 10 devices
Both defaults are deliberately conservative; the window covers a typical reauth-then-reconnect cycle, and the threshold ignores small clusters where the noise floor is already low. If your environment routinely produces fanouts below threshold, talk to support about tuning — the values aren't customer-exposed yet.