Guides System Administration Guide

System Administration Guide

Aafiya Managed Care Platform — v1.2.0

Overview

This guide covers the system administration responsibilities for the Aafiya Managed Care Platform. It is intended for system administrators, DevOps engineers, and platform operations staff responsible for deploying, monitoring, and maintaining the platform. For day-to-day operational procedures, see the Staff Portal Guide and docs/operations/RUNBOOK.md.

Architecture Overview

Service Domain Clusters

DomainResponsibilityServices
ClinicalAll clinical workflowsRules engine, pre-authorisation, hospital benefit, bill review, medical advisory, case management, serious injuries, future/past medical expense, EMS data, pharmacy benefit, HPCSA tribunal
EngagementMember & provider portalsProvider portal API, claimant portal API, communication, reporting/BI
NetworkFraud & provider networksFraud detection, specialties & networks, geo-mapping
PlatformCross-cutting infrastructureIdentity, FHIR interop, Home Affairs, audit
AIAI-powered featuresInference, prompt template, clinical briefing, chat, document extraction

Deployment

Prerequisites

  • Google Cloud Platform project with billing enabled
  • GKE cluster (Google Kubernetes Engine)
  • Cloud SQL instance (PostgreSQL 15+)
  • Cloud Firestore database
  • Cloud Storage buckets
  • Artifact Registry for container images

Deployment Pipeline

The platform uses Cloud Deploy with a canary deployment strategy: Cloud Build builds container images and pushes to Artifact Registry, Cloud Deploy rolls out to GKE with canary phases (10% → 50% → 100%), automated health checks run after each phase, and automatic rollback is triggered if the error rate exceeds 1% or latency exceeds thresholds.

Manual Deploy Commands

# Build and push an individual service
mvn install -pl services/clinical/rules-engine -am -DskipTests
gcloud builds submit --tag us-docker.pkg.dev/imms-project/imms-registry/rules-engine:latest

# Deploy via kubectl
kubectl set image deployment/rules-engine -n imms-prod \
  rules-engine=us-docker.pkg.dev/imms-project/imms-registry/rules-engine:latest

# Run locally
docker compose up -d
mvn install -DskipTests
mvn spring-boot:run -pl services/clinical/rules-engine

Monitoring & Alerting

The platform has 25+ monitoring rules configured. Critical alerts (PagerDuty page) trigger on high error rate (>1% for 5 min), service down, database down, and Pub/Sub backlog (>10 min). High alerts (Slack notification) trigger on high latency (p95 >5s for 10 min), memory pressure, and disk space.

Key Metrics

MetricWarningCritical
p95 API Latency>3s>5s
Error Rate>0.5%>1%
GKE CPU Usage>70%>85%
Cloud SQL Connections>70% of max>85% of max
Pub/Sub Lag>5 min>10 min

Security Administration

Authentication & Authorisation: Firebase Authentication handles login with JWT tokens validated by each service. Role-based access control enforces permissions at the controller level with roles: STAFF_ADMIN, STAFF_L3, STAFF_L2, STAFF_L1, PROVIDER, and CLAIMANT.

Encryption: Data at rest uses Cloud SQL and Firestore default encryption. All traffic uses TLS 1.3. PII fields are encrypted with AES-256-GCM field-level encryption. Protected Health Information (PHI) is tokenised before sending to external AI providers.

Audit Trail: Every operational action is logged to Cloud Firestore with actor, action type, immutable timestamp, before/after values, and reason. Logs are append-only with a 7-year retention period for POPIA compliance.

Data Administration

Backups: Cloud SQL has automated daily backups with 7-day retention. Firestore has scheduled exports to Cloud Storage. Cloud Storage has object versioning enabled.

# Connect via Cloud SQL proxy
cloud_sql_proxy -instances=imms-prod:southamerica-west1:imms-db=tcp:5432

# Check active connections
SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;

# Run manual vacuum
VACUUM ANALYZE;

Transactional data is exported to BigQuery for analytics via hourly ETL. Check imms_prod.etl_execution_log for ETL status.

Service Administration

Each Spring Boot service exposes a health endpoint at GET /actuator/health returning {"status": "UP"} when healthy.

# Manual scaling
kubectl scale deployment/rules-engine --replicas=5 -n imms-prod

# View logs
kubectl logs -n imms-prod deployment/rules-engine --tail=100

# Search via Cloud Logging
gcloud logging read "resource.type=k8s_container AND severity=ERROR" --limit 50

# Check Pub/Sub subscription lag
gcloud pubsub subscriptions describe imms-bill-review-subscription

AI Services Administration

The AI platform supports two providers: Vertex AI (Gemini) as the primary and Anthropic (Claude) as the fallback. Configuration is set via environment variables or Spring Cloud Config.

Prompt templates are stored in Cloud Firestore and are versioned. Templates go through an approval workflow before becoming active. Task types include CLINICAL_BRIEFING, DOCUMENT_EXTRACTION, CHAT, and FRAUD_ANALYSIS.

Key AI Metrics

MetricDescriptionWarning
Inference latencyTime to generate AI response>5 seconds
Error rateAI provider errors>2%
Confidence scoreAI response confidence<0.7
Extraction accuracyDocument extraction quality<85%

Disaster Recovery

RTO: ≤8 hours  |  RPO: ≤8 hours

# Cloud SQL restore
gcloud sql backups list --instance=imms-db
gcloud sql backups restore <backup-id> \
  --backup-instance=imms-db --restore-instance=imms-db

# Firestore restore
gcloud firestore import gs://imms-backups-prod/firestore/2026-06-04/

# Service rollback
kubectl rollout undo deployment/rules-engine -n imms-prod

Full system restore: Restore Cloud SQL from latest backup, restore Firestore from latest export, redeploy all services from known-good Artifact Registry tags, verify data integrity, and run smoke tests.

Capacity Planning

ResourceCurrentUsageHeadroom
GKE Nodes (n2-standard-4)6~55%~45%
Cloud SQL (n2-standard-4, 100GB)1 primary + 1 replica~40%~60%
Firestore10GB~2GB~80%
Cloud Storage500GB~50GB~90%

Monthly review items: storage growth rate, GKE cluster utilisation trends, Pub/Sub throughput trends, API latency trends, AI provider token usage and cost, and BigQuery query costs.

Health Check List

Daily

  • All services GREEN in Cloud Monitoring dashboard
  • No Pub/Sub dead-letter queue messages
  • Cloud SQL backup completed within 24 hours
  • BigQuery ETL completed successfully

Weekly

  • Review incident log for unresolved items
  • Check performance trends (latency, error rate)
  • Review capacity utilisation
  • Verify backup integrity (sample restore test)

Monthly

  • Full capacity planning review
  • Disaster recovery drill
  • User access review
  • AI provider usage and cost analysis

Quarterly

  • Compliance audit (POPIA, HIPAA, ISO 27001)
  • Penetration test
  • Full disaster recovery test