System Administration Guide

Aafiya Managed Care Platform — v1.2.0

Overview

This guide covers the system administration responsibilities for the Aafiya Managed Care Platform. It is intended for system administrators, DevOps engineers, and platform operations staff responsible for deploying, monitoring, and maintaining the platform. For day-to-day operational procedures, see the Staff Portal Guide and docs/operations/RUNBOOK.md.

Architecture Overview

Service Domain Clusters

Domain	Responsibility	Services
Clinical	All clinical workflows	Rules engine, pre-authorisation, hospital benefit, bill review, medical advisory, case management, serious injuries, future/past medical expense, EMS data, pharmacy benefit, HPCSA tribunal
Engagement	Member & provider portals	Provider portal API, claimant portal API, communication, reporting/BI
Network	Fraud & provider networks	Fraud detection, specialties & networks, geo-mapping
Platform	Cross-cutting infrastructure	Identity, FHIR interop, Home Affairs, audit
AI	AI-powered features	Inference, prompt template, clinical briefing, chat, document extraction

Deployment

Prerequisites

Google Cloud Platform project with billing enabled
GKE cluster (Google Kubernetes Engine)
Cloud SQL instance (PostgreSQL 15+)
Cloud Firestore database
Cloud Storage buckets
Artifact Registry for container images

Deployment Pipeline

The platform uses Cloud Deploy with a canary deployment strategy: Cloud Build builds container images and pushes to Artifact Registry, Cloud Deploy rolls out to GKE with canary phases (10% → 50% → 100%), automated health checks run after each phase, and automatic rollback is triggered if the error rate exceeds 1% or latency exceeds thresholds.

Manual Deploy Commands

# Build and push an individual service
mvn install -pl services/clinical/rules-engine -am -DskipTests
gcloud builds submit --tag us-docker.pkg.dev/imms-project/imms-registry/rules-engine:latest

# Deploy via kubectl
kubectl set image deployment/rules-engine -n imms-prod \
  rules-engine=us-docker.pkg.dev/imms-project/imms-registry/rules-engine:latest

# Run locally
docker compose up -d
mvn install -DskipTests
mvn spring-boot:run -pl services/clinical/rules-engine

Monitoring & Alerting

The platform has 25+ monitoring rules configured. Critical alerts (PagerDuty page) trigger on high error rate (>1% for 5 min), service down, database down, and Pub/Sub backlog (>10 min). High alerts (Slack notification) trigger on high latency (p95 >5s for 10 min), memory pressure, and disk space.

Key Metrics

Metric	Warning	Critical
p95 API Latency	>3s	>5s
Error Rate	>0.5%	>1%
GKE CPU Usage	>70%	>85%
Cloud SQL Connections	>70% of max	>85% of max
Pub/Sub Lag	>5 min	>10 min

Security Administration

Authentication & Authorisation: Firebase Authentication handles login with JWT tokens validated by each service. Role-based access control enforces permissions at the controller level with roles: STAFF_ADMIN, STAFF_L3, STAFF_L2, STAFF_L1, PROVIDER, and CLAIMANT.

Encryption: Data at rest uses Cloud SQL and Firestore default encryption. All traffic uses TLS 1.3. PII fields are encrypted with AES-256-GCM field-level encryption. Protected Health Information (PHI) is tokenised before sending to external AI providers.

Audit Trail: Every operational action is logged to Cloud Firestore with actor, action type, immutable timestamp, before/after values, and reason. Logs are append-only with a 7-year retention period for POPIA compliance.

Data Administration

Backups: Cloud SQL has automated daily backups with 7-day retention. Firestore has scheduled exports to Cloud Storage. Cloud Storage has object versioning enabled.

# Connect via Cloud SQL proxy
cloud_sql_proxy -instances=imms-prod:southamerica-west1:imms-db=tcp:5432

# Check active connections
SELECT datname, count(*) FROM pg_stat_activity GROUP BY datname;

# Run manual vacuum
VACUUM ANALYZE;

Transactional data is exported to BigQuery for analytics via hourly ETL. Check imms_prod.etl_execution_log for ETL status.

Service Administration

Each Spring Boot service exposes a health endpoint at GET /actuator/health returning {"status": "UP"} when healthy.

# Manual scaling
kubectl scale deployment/rules-engine --replicas=5 -n imms-prod

# View logs
kubectl logs -n imms-prod deployment/rules-engine --tail=100

# Search via Cloud Logging
gcloud logging read "resource.type=k8s_container AND severity=ERROR" --limit 50

# Check Pub/Sub subscription lag
gcloud pubsub subscriptions describe imms-bill-review-subscription

AI Services Administration

The AI platform supports two providers: Vertex AI (Gemini) as the primary and Anthropic (Claude) as the fallback. Configuration is set via environment variables or Spring Cloud Config.

Prompt templates are stored in Cloud Firestore and are versioned. Templates go through an approval workflow before becoming active. Task types include CLINICAL_BRIEFING, DOCUMENT_EXTRACTION, CHAT, and FRAUD_ANALYSIS.

Key AI Metrics

Metric	Description	Warning
Inference latency	Time to generate AI response	>5 seconds
Error rate	AI provider errors	>2%
Confidence score	AI response confidence	<0.7
Extraction accuracy	Document extraction quality	<85%

Disaster Recovery

RTO: ≤8 hours | RPO: ≤8 hours

# Cloud SQL restore
gcloud sql backups list --instance=imms-db
gcloud sql backups restore <backup-id> \
  --backup-instance=imms-db --restore-instance=imms-db

# Firestore restore
gcloud firestore import gs://imms-backups-prod/firestore/2026-06-04/

# Service rollback
kubectl rollout undo deployment/rules-engine -n imms-prod

Full system restore: Restore Cloud SQL from latest backup, restore Firestore from latest export, redeploy all services from known-good Artifact Registry tags, verify data integrity, and run smoke tests.

Capacity Planning

Resource	Current	Usage	Headroom
GKE Nodes (n2-standard-4)	6	~55%	~45%
Cloud SQL (n2-standard-4, 100GB)	1 primary + 1 replica	~40%	~60%
Firestore	10GB	~2GB	~80%
Cloud Storage	500GB	~50GB	~90%

Monthly review items: storage growth rate, GKE cluster utilisation trends, Pub/Sub throughput trends, API latency trends, AI provider token usage and cost, and BigQuery query costs.

Health Check List

Daily

All services GREEN in Cloud Monitoring dashboard
No Pub/Sub dead-letter queue messages
Cloud SQL backup completed within 24 hours
BigQuery ETL completed successfully

Weekly

Review incident log for unresolved items
Check performance trends (latency, error rate)
Review capacity utilisation
Verify backup integrity (sample restore test)

Monthly

Full capacity planning review
Disaster recovery drill
User access review
AI provider usage and cost analysis

Quarterly

Compliance audit (POPIA, HIPAA, ISO 27001)
Penetration test
Full disaster recovery test