Skip to main content

Failure and Recovery

Use this page when the stack is partially healthy but one of the core control-plane pieces has stalled, restarted, or drifted out of sync.

Read This In Order

  1. Check /health, /docs, /ui/, /ui-react/, /guide/, and /guide/blog to separate API failure from docs or frontend-only issues.
  2. Check /jobs and the relevant campaign aggregate endpoint before re-running long HySpex or point-cloud work.
  3. Check Redis, Celery worker, and MapCache service health in the swarm stack.
  4. Reuse persisted manifests and next_start_position values instead of starting bulk workflows from scratch.

Celery Worker Stalled or Restarting

Symptoms:

  • jobs stay queued or started for too long
  • long HySpex or terrain work stops making progress
  • the worker container restarts under memory pressure

Recovery:

  • Inspect /jobs first to see whether the task already reached a terminal state.
  • If a HySpex campaign is involved, query /jobs/hyspex-csw-campaigns/{campaign_id} and continue from the reported next_start_position.
  • If a task is still live but unhealthy, cancel it and requeue only that batch.
  • Lower concurrency or CELERY_MAX_TASKS_PER_CHILD before retrying large point-cloud jobs.

See Request Flows for queue handoff and Workflow Diagrams for campaign sequencing.

Redis Unavailable or Flapping

Symptoms:

  • new jobs do not start
  • job state becomes stale or stops updating
  • API routes that depend on worker state slow down or fail

Recovery:

  • Restore Redis first; Celery depends on it for broker and result state.
  • After Redis is healthy, refresh /jobs before deciding whether work must be requeued.
  • Prefer reusing persisted dataset and job manifests under /data rather than manually reconstructing state.

Redis is a control-plane dependency. Dataset files, terrain outputs, thumbnails, and config remain on the shared volumes described in Storage and Manifests.

MapCache Serving Errors or Empty Tiles

Symptoms:

  • cached WMS requests fail while direct WMS still works
  • tile cleanup or seed jobs complete but cached output is missing
  • only some zoom levels or grids return tiles

Recovery:

  • Confirm whether wms_use_mapcache is enabled.
  • Verify the generated mapcache.xml and the cache directory under /var/sig/tiles.
  • If needed, fall back to direct MapServer rendering while you repair cache config or reseed.
  • Rerun cache seeding only for the affected datasets instead of clearing the whole cache.

See MapCache and Caching for cache controls and Storage and Manifests for cache directory layout.

HySpex Campaign Interrupted Mid-Run

Symptoms:

  • a parent batch completed but children did not finish cleanly
  • frontend monitoring stopped while backend jobs still exist
  • duplicate or partial child coverage appears across retries

Recovery:

  • Query /jobs/hyspex-csw-campaigns/{campaign_id}.
  • Reuse the same campaign_id and continue from next_start_position.
  • Review duplicate and child-state summaries before requeuing another batch.
  • Only repair missing variants or metadata if the aggregate endpoint shows gaps after the resumed run.

See HySpex Campaigns for operator flow and C4 Architecture for the orchestration view.

Docs or Frontend Missing After Deploy

Symptoms:

  • /docs works but /guide, /ui, or /ui-react returns 404
  • content is stale after a rebuild

Recovery:

  • Rebuild the API image; /guide, /ui, and /ui-react are all served from the API runtime image.
  • Redeploy the stack with the same path used by this branch, typically ./reload_stack.sh and local-stack.yml.
  • Verify the mounted static routes from the API service before debugging Traefik.

See Deployment and Hosting for the hosting model and System Diagrams for the routing path.