Failure and Recovery
Use this page when the stack is partially healthy but one of the core control-plane pieces has stalled, restarted, or drifted out of sync.
Read This In Order
- Check
/health,/docs,/ui/,/ui-react/,/guide/, and/guide/blogto separate API failure from docs or frontend-only issues. - Check
/jobsand the relevant campaign aggregate endpoint before re-running long HySpex or point-cloud work. - Check Redis, Celery worker, and MapCache service health in the swarm stack.
- Reuse persisted manifests and
next_start_positionvalues instead of starting bulk workflows from scratch.
Celery Worker Stalled or Restarting
Symptoms:
- jobs stay
queuedorstartedfor too long - long HySpex or terrain work stops making progress
- the worker container restarts under memory pressure
Recovery:
- Inspect
/jobsfirst to see whether the task already reached a terminal state. - If a HySpex campaign is involved, query
/jobs/hyspex-csw-campaigns/{campaign_id}and continue from the reportednext_start_position. - If a task is still live but unhealthy, cancel it and requeue only that batch.
- Lower concurrency or
CELERY_MAX_TASKS_PER_CHILDbefore retrying large point-cloud jobs.
See Request Flows for queue handoff and Workflow Diagrams for campaign sequencing.
Redis Unavailable or Flapping
Symptoms:
- new jobs do not start
- job state becomes stale or stops updating
- API routes that depend on worker state slow down or fail
Recovery:
- Restore Redis first; Celery depends on it for broker and result state.
- After Redis is healthy, refresh
/jobsbefore deciding whether work must be requeued. - Prefer reusing persisted dataset and job manifests under
/datarather than manually reconstructing state.
Redis is a control-plane dependency. Dataset files, terrain outputs, thumbnails, and config remain on the shared volumes described in Storage and Manifests.
MapCache Serving Errors or Empty Tiles
Symptoms:
- cached WMS requests fail while direct WMS still works
- tile cleanup or seed jobs complete but cached output is missing
- only some zoom levels or grids return tiles
Recovery:
- Confirm whether
wms_use_mapcacheis enabled. - Verify the generated
mapcache.xmland the cache directory under/var/sig/tiles. - If needed, fall back to direct MapServer rendering while you repair cache config or reseed.
- Rerun cache seeding only for the affected datasets instead of clearing the whole cache.
See MapCache and Caching for cache controls and Storage and Manifests for cache directory layout.
HySpex Campaign Interrupted Mid-Run
Symptoms:
- a parent batch completed but children did not finish cleanly
- frontend monitoring stopped while backend jobs still exist
- duplicate or partial child coverage appears across retries
Recovery:
- Query
/jobs/hyspex-csw-campaigns/{campaign_id}. - Reuse the same
campaign_idand continue fromnext_start_position. - Review duplicate and child-state summaries before requeuing another batch.
- Only repair missing variants or metadata if the aggregate endpoint shows gaps after the resumed run.
See HySpex Campaigns for operator flow and C4 Architecture for the orchestration view.
Docs or Frontend Missing After Deploy
Symptoms:
/docsworks but/guide,/ui, or/ui-reactreturns 404- content is stale after a rebuild
Recovery:
- Rebuild the API image;
/guide,/ui, and/ui-reactare all served from the API runtime image. - Redeploy the stack with the same path used by this branch, typically
./reload_stack.shandlocal-stack.yml. - Verify the mounted static routes from the API service before debugging Traefik.
See Deployment and Hosting for the hosting model and System Diagrams for the routing path.