Overview
Designed and operated the complete AWS infrastructure for MediSync — a B2B healthtech SaaS that provides real-time clinical data synchronization and compliance reporting for 80+ healthcare facilities. The platform runs 5 microservices (3 Node.js/TypeScript, 2 Python/FastAPI) on ECS Fargate (fully serverless — zero EC2 instances to manage), a React SPA on CloudFront, and an async event pipeline processing ~500K clinical events per day.
The critical constraint: HIPAA compliance. Every architectural decision — from VPC design to log retention to encryption key management — was made through the lens of PHI (Protected Health Information) security requirements. The platform passed three independent HIPAA security audits with zero findings.
As the sole DevOps engineer, I designed, built, and maintained every layer — from Terraform modules to blue-green deployments to 3 AM PagerDuty rotations.
Architecture
Terraform Layered State Architecture
networking/ → VPC, Subnets, NAT Gateways, VPC Endpoints, Transit Gateway, Flow Logs
↓ remote_state
security/ → KMS Keys, IAM Roles, WAF, Security Hub, GuardDuty, CloudTrail
↓ remote_state
data/ → Aurora Serverless v2, ElastiCache, S3, DynamoDB, SQS, SNS
↓ remote_state
compute/ → ECS Cluster, Task Definitions, ALB, Target Groups, Service Discovery
↓ remote_state
pipeline/ → CodePipeline, CodeBuild, CodeDeploy, ECR, Artifact Buckets
↓ remote_state
observability/→ CloudWatch Dashboards, X-Ray, Alarms, Log Groups, Metric Filters
System Architecture
Healthcare Facility Systems (HL7/FHIR)
↓
API Gateway (Regional) + WAF
↓
ALB (Internal, HTTPS only)
├── ingestion-service (Fargate, 4 tasks) → SQS FIFO Queue
├── sync-engine (Fargate, 6 tasks) → Aurora Serverless v2
├── compliance-service (Fargate, 3 tasks) → DynamoDB (audit trail)
├── notification-service (Fargate, 2 tasks)→ SNS + SES
└── analytics-worker (Fargate, 3 tasks) → S3 Data Lake
↓
EventBridge (Event Bus)
↓
CloudWatch Logs + X-Ray Traces + Metric Filters
↓
CloudFront CDN → React SPA (S3 Origin)
↓
Healthcare Admins (Dashboard)
Blue-Green Deployment Flow
Developer Merge → CodePipeline Trigger
↓
CodeBuild (Build + Test + SAST + Container Scan)
↓
ECR (Immutable image tag: main-<build-number>)
↓
CodeDeploy (Blue-Green)
├── Creates Green Task Set (new version)
├── Shifts 10% traffic to Green (canary)
├── Runs health checks for 5 minutes
├── Shifts remaining 90% traffic
├── Waits 15 minutes (bake time)
└── Terminates Blue Task Set
↓
If any check fails:
└── Automatic rollback to Blue (< 30 seconds)
Infrastructure (Terraform)
14 Custom Terraform Modules
| Module | Lines | Description |
|---|---|---|
vpc | 310 | HIPAA-compliant VPC with 3 AZs, public/private/isolated subnets, NAT Gateways, VPC Flow Logs (encrypted, 365-day retention), VPC endpoints for all AWS services |
ecs-cluster | 180 | ECS Fargate cluster with Container Insights, execute-command logging (audit trail), capacity providers (FARGATE + FARGATE_SPOT) |
ecs-service | 420 | Reusable service module: task definition, service, ALB target group, CodeDeploy blue-green config, auto-scaling policies, CloudWatch log groups |
alb | 195 | Internal ALB with HTTPS listeners, ACM certificates, security groups, access logging to S3, WAF association, idle timeout tuning |
aurora | 340 | Aurora Serverless v2 (PostgreSQL 16), Multi-AZ, 0.5-64 ACU auto-scaling, IAM authentication, Performance Insights, automated snapshots (35-day retention), cross-region read replica |
elasticache | 145 | ElastiCache Redis 7 (Serverless), encryption at rest (KMS) + in transit (TLS), automatic failover, auth token from Secrets Manager |
s3 | 155 | HIPAA-compliant buckets: SSE-KMS, versioning, MFA delete on PHI buckets, lifecycle policies, access logging, object lock (compliance mode for audit trails) |
kms | 110 | Customer-managed KMS keys: per-service key isolation, automatic annual rotation, cross-account key policies for DR, CloudTrail key usage logging |
iam | 450 | Per-service task roles + execution roles, least-privilege policies, permission boundaries, SCP guardrails, IAM Access Analyzer integration |
codepipeline | 380 | CodePipeline + CodeBuild + CodeDeploy for blue-green ECS deployments, artifact encryption, cross-account deploy capability |
waf | 190 | WAF WebACL: AWSManagedRulesCommonRuleSet, AWSManagedRulesKnownBadInputsRuleSet, rate limiting (1000/5min per IP), custom rules for HL7/FHIR endpoint protection |
cloudfront-spa | 200 | CloudFront + S3 with OAC, security headers, custom error responses, geo-restriction (US + Canada only — data residency requirement) |
eventbridge | 135 | EventBridge event bus with schema registry, DLQ for failed deliveries, archive (90-day replay window) |
observability | 280 | CloudWatch dashboards, X-Ray tracing groups, composite alarms, anomaly detection, SNS fan-out for multi-channel alerting |
ECS Task Roles vs. Execution Roles
A critical distinction that most teams get wrong. Every Fargate task has two separate IAM roles with completely different trust policies:
# EXECUTION ROLE — used by the ECS Agent to pull images and write logs
# This role is assumed by ecs-tasks.amazonaws.com at task START
resource "aws_iam_role" "execution_role" {
name = "medisync-${var.service_name}-execution"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_role_policy_attachment" "execution_ecr" {
role = aws_iam_role.execution_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# Secrets injection at startup — ECS Agent reads these BEFORE container starts
resource "aws_iam_policy" "execution_secrets" {
name = "medisync-${var.service_name}-execution-secrets"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "ReadSecrets"
Effect = "Allow"
Action = [
"secretsmanager:GetSecretValue"
]
Resource = var.secret_arns # Only the specific secrets this service needs
},
{
Sid = "DecryptSecrets"
Effect = "Allow"
Action = ["kms:Decrypt"]
Resource = [var.secrets_kms_key_arn]
}
]
})
}
# TASK ROLE — used by the APPLICATION CODE at runtime
# This role is what the SDK uses when the container calls AWS APIs
resource "aws_iam_role" "task_role" {
name = "medisync-${var.service_name}-task"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
Action = "sts:AssumeRole"
Condition = {
ArnLike = {
"aws:SourceArn" = "arn:aws:ecs:${var.region}:${var.account_id}:*"
}
StringEquals = {
"aws:SourceAccount" = var.account_id
}
}
}]
})
}
Per-Service IAM Policies — Least Privilege
Each service gets only the AWS permissions it needs. No shared “application role”:
# Ingestion Service — writes to SQS FIFO, reads HL7 from S3 staging
resource "aws_iam_policy" "ingestion_task_policy" {
name = "medisync-ingestion-task"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "WriteSQS"
Effect = "Allow"
Action = ["sqs:SendMessage", "sqs:GetQueueAttributes"]
Resource = [module.sqs_clinical_events.queue_arn]
},
{
Sid = "ReadStagingBucket"
Effect = "Allow"
Action = ["s3:GetObject", "s3:ListBucket"]
Resource = [
module.s3_staging.bucket_arn,
"${module.s3_staging.bucket_arn}/hl7/*"
]
},
{
Sid = "XRayTracing"
Effect = "Allow"
Action = ["xray:PutTraceSegments", "xray:PutTelemetryRecords"]
Resource = ["*"]
},
{
Sid = "KMSDecrypt"
Effect = "Allow"
Action = ["kms:Decrypt", "kms:GenerateDataKey"]
Resource = [module.kms.staging_key_arn]
}
]
})
}
# Sync Engine — reads SQS, writes Aurora (IAM auth), writes EventBridge
resource "aws_iam_policy" "sync_engine_task_policy" {
name = "medisync-sync-engine-task"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "ConsumeSQS"
Effect = "Allow"
Action = [
"sqs:ReceiveMessage", "sqs:DeleteMessage",
"sqs:GetQueueAttributes", "sqs:ChangeMessageVisibility"
]
Resource = [module.sqs_clinical_events.queue_arn]
},
{
Sid = "RDSIAMAuth"
Effect = "Allow"
Action = "rds-db:connect"
Resource = "arn:aws:rds-db:${var.region}:${var.account_id}:dbuser:${module.aurora.cluster_resource_id}/sync_engine"
},
{
Sid = "PutEvents"
Effect = "Allow"
Action = "events:PutEvents"
Resource = [module.eventbridge.bus_arn]
Condition = {
StringEquals = {
"events:source" = "medisync.sync-engine"
}
}
},
{
Sid = "CacheAccess"
Effect = "Allow"
Action = ["elasticache:Connect"]
Resource = [module.elasticache.cluster_arn]
}
]
})
}
# Compliance Service — reads DynamoDB audit trail, writes S3 compliance reports
resource "aws_iam_policy" "compliance_task_policy" {
name = "medisync-compliance-task"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "DynamoDBAccess"
Effect = "Allow"
Action = [
"dynamodb:Query", "dynamodb:GetItem", "dynamodb:PutItem",
"dynamodb:BatchWriteItem"
]
Resource = [
module.dynamodb_audit.table_arn,
"${module.dynamodb_audit.table_arn}/index/*"
]
},
{
Sid = "ComplianceReportsBucket"
Effect = "Allow"
Action = ["s3:PutObject", "s3:GetObject"]
Resource = [
"${module.s3_compliance.bucket_arn}/reports/*"
]
Condition = {
StringEquals = {
"s3:x-amz-server-side-encryption" = "aws:kms"
}
}
},
{
Sid = "KMSForReports"
Effect = "Allow"
Action = ["kms:GenerateDataKey", "kms:Decrypt"]
Resource = [module.kms.compliance_key_arn]
}
]
})
}
VPC Architecture — HIPAA Compliant
┌──────────────────────────── VPC 10.10.0.0/16 ───────────────────────────┐
│ │
│ ┌─── AZ us-east-1a ───┐ ┌─── AZ us-east-1b ───┐ ┌─── AZ us-east-1c ┐│
│ │ │ │ │ │ ││
│ │ Public 10.10.1.0/24 │ │ Public 10.10.2.0/24 │ │ Public 10.10.3 ││
│ │ └── NAT Gateway │ │ └── NAT Gateway │ │ └── NAT GW ││
│ │ └── ALB ENIs │ │ └── ALB ENIs │ │ └── ALB ENIs ││
│ │ │ │ │ │ ││
│ │ Private 10.10.11/24 │ │ Private 10.10.12/24 │ │ Private 10.10.13 ││
│ │ └── Fargate Tasks │ │ └── Fargate Tasks │ │ └── Fargate ││
│ │ └── No public IPs │ │ └── No public IPs │ │ └── Tasks ││
│ │ │ │ │ │ ││
│ │ Isolated 10.10.21/24 │ │ Isolated 10.10.22/24 │ │ Isolated 10.10.23││
│ │ └── Aurora Primary │ │ └── Aurora Replica │ │ └── ElastiCache││
│ │ └── No NAT route │ │ └── No NAT route │ │ └── DynamoDB VE││
│ └──────────────────────┘ └──────────────────────┘ └──────────────────┘│
│ │
│ VPC Endpoints (Interface): ECR, Secrets Manager, KMS, STS, CloudWatch, │
│ X-Ray, SQS, SNS, EventBridge, DynamoDB (Gateway), S3 (Gateway) │
│ │
│ VPC Flow Logs → CloudWatch Logs (encrypted, 365-day retention) │
│ No internet gateway access from private/isolated subnets │
└─────────────────────────────────────────────────────────────────────────┘
- Fargate tasks run in private subnets only — no public IP assignment, egress via NAT Gateway
- Interface VPC Endpoints for all AWS API calls — PHI never traverses the public internet
- VPC Flow Logs retained for 365 days (HIPAA requirement) with KMS encryption
- Security Groups: per-service SGs — ingestion service can talk to SQS endpoint, sync engine to Aurora, etc. No “allow all within VPC”
ECS Fargate Platform
Task Definition — Production Grade
resource "aws_ecs_task_definition" "service" {
family = "medisync-${var.service_name}"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.cpu # e.g., 1024 (1 vCPU)
memory = var.memory # e.g., 2048 (2 GB)
execution_role_arn = aws_iam_role.execution_role.arn
task_role_arn = aws_iam_role.task_role.arn
runtime_platform {
operating_system_family = "LINUX"
cpu_architecture = "ARM64" # Graviton — 20% cheaper, 40% better perf/$
}
container_definitions = jsonencode([
{
name = var.service_name
image = "${var.ecr_repo_url}:${var.image_tag}"
essential = true
portMappings = [{
containerPort = var.container_port
protocol = "tcp"
}]
# Secrets injected by ECS Agent via Execution Role
secrets = [
{
name = "DATABASE_URL"
valueFrom = "${var.db_secret_arn}:connection_string::"
},
{
name = "REDIS_URL"
valueFrom = "${var.redis_secret_arn}:url::"
},
{
name = "API_KEY"
valueFrom = "${var.api_key_secret_arn}"
}
]
environment = [
{ name = "NODE_ENV", value = "production" },
{ name = "AWS_REGION", value = var.region },
{ name = "SERVICE_NAME", value = var.service_name },
{ name = "OTEL_EXPORTER_OTLP_ENDPOINT", value = "http://localhost:2000" }
]
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:${var.container_port}/health || exit 1"]
interval = 15
timeout = 5
retries = 3
startPeriod = 60
}
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/medisync/${var.service_name}"
"awslogs-region" = var.region
"awslogs-stream-prefix" = "ecs"
}
}
linuxParameters = {
initProcessEnabled = true # Proper PID 1 signal handling
}
},
# X-Ray sidecar for distributed tracing
{
name = "xray-daemon"
image = "public.ecr.aws/xray/aws-xray-daemon:3.x"
essential = false
cpu = 32
memory = 64
portMappings = [{
containerPort = 2000
protocol = "udp"
}]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/medisync/${var.service_name}/xray"
"awslogs-region" = var.region
"awslogs-stream-prefix" = "xray"
}
}
}
])
}
ECS Service with Blue-Green Deployment
resource "aws_ecs_service" "service" {
name = var.service_name
cluster = var.cluster_id
task_definition = aws_ecs_task_definition.service.arn
desired_count = var.desired_count
launch_type = "FARGATE"
deployment_controller {
type = "CODE_DEPLOY" # Blue-green via CodeDeploy
}
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.service.id]
assign_public_ip = false # NEVER — Fargate tasks stay private
}
load_balancer {
target_group_arn = aws_lb_target_group.blue.arn
container_name = var.service_name
container_port = var.container_port
}
service_registries {
registry_arn = aws_service_discovery_service.service.arn
}
enable_execute_command = true # For debugging — all sessions logged to CloudTrail
lifecycle {
ignore_changes = [task_definition, load_balancer] # Managed by CodeDeploy
}
}
Auto-Scaling — Target Tracking + Step Scaling
# Target tracking: maintain 65% average CPU
resource "aws_appautoscaling_policy" "cpu" {
name = "medisync-${var.service_name}-cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.service.resource_id
scalable_dimension = aws_appautoscaling_target.service.scalable_dimension
service_namespace = aws_appautoscaling_target.service.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 65.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
# Target tracking: maintain ~500 requests/target
resource "aws_appautoscaling_policy" "request_count" {
name = "medisync-${var.service_name}-request-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.service.resource_id
scalable_dimension = aws_appautoscaling_target.service.scalable_dimension
service_namespace = aws_appautoscaling_target.service.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ALBRequestCountPerTarget"
resource_label = "${aws_lb.internal.arn_suffix}/${aws_lb_target_group.blue.arn_suffix}"
}
target_value = 500.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
# Step scaling for SQS queue depth (sync-engine)
resource "aws_appautoscaling_policy" "queue_depth" {
name = "medisync-sync-engine-queue-scaling"
policy_type = "StepScaling"
resource_id = aws_appautoscaling_target.sync_engine.resource_id
scalable_dimension = aws_appautoscaling_target.sync_engine.scalable_dimension
service_namespace = aws_appautoscaling_target.sync_engine.service_namespace
step_scaling_policy_configuration {
adjustment_type = "ChangeInCapacity"
cooldown = 120
metric_aggregation_type = "Average"
step_adjustment {
metric_interval_lower_bound = 0
metric_interval_upper_bound = 5000
scaling_adjustment = 2
}
step_adjustment {
metric_interval_lower_bound = 5000
scaling_adjustment = 5
}
}
}
CI/CD — CodePipeline with Blue-Green Deployments
Pipeline Architecture
Each service has its own CodePipeline with 4 stages:
resource "aws_codepipeline" "service" {
name = "medisync-${var.service_name}"
role_arn = aws_iam_role.codepipeline.arn
artifact_store {
location = aws_s3_bucket.artifacts.bucket
type = "S3"
encryption_key {
id = module.kms.pipeline_key_arn
type = "KMS"
}
}
stage {
name = "Source"
action {
name = "GitHub"
category = "Source"
owner = "AWS"
provider = "CodeStarSourceConnection"
version = "1"
output_artifacts = ["source_output"]
configuration = {
ConnectionArn = var.codestar_connection_arn
FullRepositoryId = "medisync/${var.service_name}"
BranchName = "main"
}
}
}
stage {
name = "Build"
action {
name = "BuildAndTest"
category = "Build"
owner = "AWS"
provider = "CodeBuild"
version = "1"
input_artifacts = ["source_output"]
output_artifacts = ["build_output"]
configuration = {
ProjectName = aws_codebuild_project.service.name
}
}
}
stage {
name = "Deploy"
action {
name = "BlueGreenDeploy"
category = "Deploy"
owner = "AWS"
provider = "CodeDeployToECS"
version = "1"
input_artifacts = ["build_output"]
configuration = {
ApplicationName = aws_codedeploy_app.service.name
DeploymentGroupName = aws_codedeploy_deployment_group.service.deployment_group_name
TaskDefinitionTemplateArtifact = "build_output"
TaskDefinitionTemplatePath = "taskdef.json"
AppSpecTemplateArtifact = "build_output"
AppSpecTemplatePath = "appspec.yaml"
}
}
}
stage {
name = "PostDeploy"
action {
name = "IntegrationTests"
category = "Build"
owner = "AWS"
provider = "CodeBuild"
version = "1"
input_artifacts = ["source_output"]
configuration = {
ProjectName = aws_codebuild_project.integration_tests.name
}
}
}
}
CodeDeploy Blue-Green Configuration
resource "aws_codedeploy_deployment_group" "service" {
app_name = aws_codedeploy_app.service.name
deployment_group_name = "medisync-${var.service_name}"
deployment_config_name = "CodeDeployDefault.ECSCanary10Percent5Minutes"
service_role_arn = aws_iam_role.codedeploy.arn
auto_rollback_configuration {
enabled = true
events = ["DEPLOYMENT_FAILURE", "DEPLOYMENT_STOP_ON_ALARM"]
}
alarm_configuration {
alarms = [
aws_cloudwatch_metric_alarm.service_5xx.alarm_name,
aws_cloudwatch_metric_alarm.service_latency_p99.alarm_name,
aws_cloudwatch_metric_alarm.service_error_rate.alarm_name,
]
enabled = true
}
blue_green_deployment_config {
deployment_ready_option {
action_on_timeout = "CONTINUE_DEPLOYMENT"
}
terminate_blue_instances_on_deployment_success {
action = "TERMINATE"
termination_wait_time_in_minutes = 15
}
}
deployment_style {
deployment_option = "WITH_TRAFFIC_CONTROL"
deployment_type = "BLUE_GREEN"
}
ecs_service {
cluster_name = var.cluster_name
service_name = aws_ecs_service.service.name
}
load_balancer_info {
target_group_pair_info {
prod_traffic_route {
listener_arns = [aws_lb_listener.https.arn]
}
test_traffic_route {
listener_arns = [aws_lb_listener.test.arn]
}
target_group {
name = aws_lb_target_group.blue.name
}
target_group {
name = aws_lb_target_group.green.name
}
}
}
}
AppSpec for ECS Blue-Green
# appspec.yaml — generated by CodeBuild
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: <TASK_DEFINITION>
LoadBalancerInfo:
ContainerName: "sync-engine"
ContainerPort: 3000
PlatformVersion: "1.4.0"
CapacityProviderStrategy:
- Base: 2
CapacityProvider: "FARGATE"
Weight: 3
- Base: 0
CapacityProvider: "FARGATE_SPOT"
Weight: 7
Hooks:
- BeforeInstall: "hooks/before-install.sh"
- AfterInstall: "hooks/after-install.sh"
- AfterAllowTestTraffic: "hooks/run-smoke-tests.sh"
- AfterAllowTraffic: "hooks/validate-production.sh"
CodeBuild — Build + Test + Scan
# buildspec.yml
version: 0.2
env:
variables:
ECR_REPO: "medisync"
secrets-manager:
NPM_TOKEN: "medisync/ci/npm-token:token"
phases:
pre_build:
commands:
- echo "Logging in to ECR..."
- aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_URI
- export IMAGE_TAG="main-${CODEBUILD_BUILD_NUMBER}"
build:
commands:
- echo "Running tests..."
- npm ci && npm run lint && npm run test:coverage
- echo "Running SAST scan..."
- trivy fs --severity CRITICAL,HIGH --exit-code 1 .
- echo "Building Docker image..."
- docker build -t $ECR_URI/$SERVICE_NAME:$IMAGE_TAG --platform linux/arm64 .
- echo "Scanning container image..."
- trivy image --severity CRITICAL --exit-code 1 $ECR_URI/$SERVICE_NAME:$IMAGE_TAG
- echo "Pushing to ECR..."
- docker push $ECR_URI/$SERVICE_NAME:$IMAGE_TAG
post_build:
commands:
- echo "Generating deployment artifacts..."
- envsubst < taskdef-template.json > taskdef.json
- cp appspec.yaml appspec.yaml
artifacts:
files:
- taskdef.json
- appspec.yaml
- hooks/*
Observability
CloudWatch — Dashboards + Alarms
- 6 custom CloudWatch dashboards — per-service metrics, cross-service overview, ALB health, Aurora performance, SQS queue depth trends, cost analysis
- 40+ CloudWatch Alarms with composite alarm aggregation:
# Composite alarm: service health = HTTP errors + latency + task health
resource "aws_cloudwatch_composite_alarm" "service_health" {
alarm_name = "medisync-${var.service_name}-composite-health"
alarm_rule = <<-RULE
ALARM("${aws_cloudwatch_metric_alarm.http_5xx.alarm_name}") OR
ALARM("${aws_cloudwatch_metric_alarm.p99_latency.alarm_name}") OR
ALARM("${aws_cloudwatch_metric_alarm.running_task_count.alarm_name}")
RULE
alarm_actions = [aws_sns_topic.critical.arn]
ok_actions = [aws_sns_topic.recovery.arn]
}
# Individual alarm: 5xx error rate exceeds 1% for 3 consecutive periods
resource "aws_cloudwatch_metric_alarm" "http_5xx" {
alarm_name = "medisync-${var.service_name}-5xx-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
threshold = 1.0
treat_missing_data = "notBreaching"
metric_query {
id = "error_rate"
expression = "(errors / total) * 100"
label = "5xx Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 60
stat = "Sum"
dimensions = { TargetGroup = var.target_group_arn_suffix }
}
}
metric_query {
id = "total"
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 60
stat = "Sum"
dimensions = { TargetGroup = var.target_group_arn_suffix }
}
}
}
X-Ray Distributed Tracing
- X-Ray daemon sidecar in every Fargate task definition
- Trace maps showing cross-service call flows — ingestion → SQS → sync engine → Aurora
- Service lens dashboards correlating traces with CloudWatch metrics
- Sampling rules: 5% baseline, 100% for errors, 100% for high-latency requests (>2s)
{
"SamplingRule": {
"RuleName": "medisync-errors",
"Priority": 1,
"FixedRate": 1.0,
"ReservoirSize": 10,
"ServiceName": "medisync-*",
"ServiceType": "AWS::ECS::Container",
"Host": "*",
"HTTPMethod": "*",
"URLPath": "*",
"ResourceARN": "*",
"Attributes": {
"http.status_code": "5*"
}
}
}
HIPAA Audit Logging
- CloudTrail — all API calls logged, encrypted (KMS), 7-year retention in S3 with Object Lock (compliance mode)
- VPC Flow Logs — 365-day retention, used for network forensics
- ECS Execute Command logs — every
execsession recorded in CloudWatch Logs - DynamoDB audit table — application-level audit trail for all PHI access (who, what, when, from where)
- Athena queries over CloudTrail + VPC Flow Logs for security investigations
Data Tier
Aurora Serverless v2
resource "aws_rds_cluster" "aurora" {
cluster_identifier = "medisync-production"
engine = "aurora-postgresql"
engine_mode = "provisioned"
engine_version = "16.1"
database_name = "medisync"
serverlessv2_scaling_configuration {
min_capacity = 0.5 # Scale to near-zero during off-peak
max_capacity = 64.0 # Burst to 64 ACUs during peak processing
}
storage_encrypted = true
kms_key_id = module.kms.aurora_key_arn
deletion_protection = true
skip_final_snapshot = false
iam_database_authentication_enabled = true # IAM auth — no passwords
vpc_security_group_ids = [aws_security_group.aurora.id]
db_subnet_group_name = aws_db_subnet_group.isolated.name
backup_retention_period = 35
preferred_backup_window = "03:00-04:00"
enabled_cloudwatch_logs_exports = ["postgresql"]
# Cross-region replica for DR (us-west-2)
# global_cluster_identifier = aws_rds_global_cluster.medisync.id
}
resource "aws_rds_cluster_instance" "writer" {
cluster_identifier = aws_rds_cluster.aurora.id
instance_class = "db.serverless"
engine = aws_rds_cluster.aurora.engine
engine_version = aws_rds_cluster.aurora.engine_version
publicly_accessible = false
performance_insights_enabled = true
performance_insights_kms_key_id = module.kms.aurora_key_arn
monitoring_interval = 15
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
}
- IAM Database Authentication — no database passwords, token-based auth via task role
- Auto-scaling from 0.5 to 64 ACUs — handles burst clinical data ingestion without over-provisioning
- Performance Insights with KMS encryption — query-level analysis for optimization
- Enhanced Monitoring at 15-second granularity
Key Achievements
- Zero HIPAA audit findings across three independent security assessments
- 99.95% uptime over 12 months — Fargate eliminated all node/OS-level incidents
- ~500K clinical events/day processed with P99 latency under 400ms
- Blue-green deployments with automatic rollback — zero failed deployments in production
- 45% cost reduction from EC2 to Fargate + Aurora Serverless v2 (pay-per-use scaling)
- Mean deployment time: 6 minutes including canary bake time and post-deploy validation
- 14 Terraform modules, all HIPAA-compliant, reusable across dev/staging/prod
- Zero static credentials anywhere — all auth via IAM roles (task roles, execution roles, CodePipeline service roles)
- Full audit trail: every API call, network flow, container exec, and PHI access logged and retained per HIPAA requirements
Need something similar?
Let's discuss how I can build this kind of infrastructure for your team.