AWS EKS Production Platform — FinTrack

Overview

Designed, built, and operated the entire AWS cloud infrastructure and Kubernetes platform for FinTrack — a B2B fintech SaaS that provides real-time financial reconciliation and reporting to 200+ enterprise clients. The platform runs 6 microservices (4 Node.js/TypeScript, 2 Python/FastAPI), a React SPA, and a suite of async workers processing ~2M financial transactions per day.

As the sole DevOps and platform engineer, I owned every layer — from VPC design and IAM policy authoring to Kubernetes pod security and incident response alerting. The platform achieved 99.97% uptime over 18 months with zero data breaches and full SOC 2 audit trail compliance.

Architecture

Terraform Layered State Architecture

Modeled after the same layered pattern I use across engagements — each layer has its own state file in S3, with cross-layer data sharing via terraform_remote_state:

networking/   → VPC, Subnets, NAT Gateways, Transit Gateway, Route Tables, NACLs
  ↓ remote_state
security/     → KMS Keys, IAM Roles, OIDC Provider, GuardDuty, Security Hub
  ↓ remote_state
data/         → RDS PostgreSQL, ElastiCache Redis, S3 Buckets, DynamoDB Tables
  ↓ remote_state
compute/      → EKS Cluster, ECR, Node Groups, Karpenter Provisioners
  ↓ remote_state
platform/     → ArgoCD Bootstrap, External Secrets, cert-manager, Ingress
  ↓ remote_state
observability/→ CloudWatch, Prometheus Stack, Loki, Grafana, Alertmanager

CI/CD & GitOps Flow

Developer Push → GitHub Actions (Lint + Test + SAST + Build)
       ↓
   ECR (Container Registry — immutable tags)
       ↓
   ArgoCD Image Updater (polls ECR for new tags)
       ↓
   ArgoCD Application Controller (syncs desired state)
       ↓
   EKS Cluster (us-east-1)
   ├── reconciliation-service (3 replicas)
   ├── reporting-service (2 replicas)
   ├── notification-service (2 replicas)
   ├── auth-service (2 replicas)
   ├── transaction-worker (4 replicas, Karpenter spot)
   ├── ml-anomaly-detector (2 replicas, GPU spot)
   ├── AWS ALB Ingress (HTTPS + WAF)
   ├── External Secrets (← Secrets Manager)
   ├── cert-manager (ACM + Let's Encrypt)
   ├── OPA Gatekeeper (policy enforcement)
   ├── Loki + Fluent Bit (Logging)
   └── Prometheus + Grafana (Monitoring)
       ↓
   CloudFront CDN → React SPA (S3 Origin)
       ↓
   Enterprise Clients (Web + API)

Infrastructure (Terraform)

12 Custom Terraform Modules

Every AWS resource is provisioned through reusable, versioned Terraform modules stored in a dedicated terraform-modules mono-repo. Each module has full variable validation, outputs, and README.md with usage examples.

Module	Lines	Description
`vpc`	340	Multi-AZ VPC with public, private, and isolated subnets; NAT Gateways (one per AZ); VPC Flow Logs to S3; custom NACLs for data-tier isolation
`eks`	520	EKS cluster with OIDC provider, envelope encryption (KMS), managed + Karpenter node groups, cluster add-ons (CoreDNS, kube-proxy, VPC CNI, EBS CSI), IRSA trust policies
`ecr`	85	ECR repositories with immutable tags, image scanning on push, lifecycle policies (keep last 20 tagged, delete untagged after 7d)
`rds`	290	RDS PostgreSQL 16 Multi-AZ with automated backups (35-day retention), Performance Insights, IAM authentication, custom parameter group, private subnet group
`elasticache`	165	ElastiCache Redis 7 cluster-mode with 3 shards, 2 replicas per shard, encryption at rest (KMS) + in transit (TLS), automatic failover
`kms`	95	KMS keys with automatic rotation, key policies for EKS envelope encryption, RDS encryption, S3 SSE, Secrets Manager encryption
`iam-irsa`	380	IRSA role factory — generates per-service IAM roles with OIDC trust policies, scoped to specific Kubernetes service accounts and namespaces
`secrets-manager`	120	Secrets with automatic rotation via Lambda, resource policies, cross-account sharing for DR
`cloudfront-spa`	210	CloudFront distribution with S3 origin, OAC, custom error responses (SPA routing), WAF WebACL association, Response Security Headers policy
`waf`	175	WAF WebACL with AWS managed rules (AWSManagedRulesCommonRuleSet, SQLi, XSS, Bot Control), custom rate limiting (2000 req/5min per IP), geo-restriction
`s3`	130	Bucket with versioning, SSE-KMS, bucket policies (deny non-TLS), lifecycle rules, replication to DR region
`observability`	240	CloudWatch log groups, metric filters, composite alarms, SNS topics, Chatbot integration for Slack

Zero Static Credentials — IRSA Throughout

Every pod runs with a dedicated IAM role. No AWS access keys exist anywhere — not in Secrets Manager, not in environment variables, nowhere.

# IRSA Role Factory — per-service IAM role with OIDC trust
module "irsa_reconciliation_service" {
  source = "../../modules/iam-irsa"

  role_name   = "fintrack-reconciliation-service"
  namespace   = "fintrack"
  sa_name     = "reconciliation-service"
  oidc_issuer = data.terraform_remote_state.compute.outputs.oidc_issuer
  oidc_arn    = data.terraform_remote_state.compute.outputs.oidc_provider_arn

  policy_arns = [
    aws_iam_policy.rds_connect.arn,
    aws_iam_policy.s3_reports_bucket.arn,
    aws_iam_policy.sqs_transactions_queue.arn,
    aws_iam_policy.secrets_read.arn,
  ]
}

# Trust policy generated by the module:
# {
#   "Effect": "Allow",
#   "Principal": {
#     "Federated": "arn:aws:iam::oidc-provider/oidc.eks.us-east-1..."
#   },
#   "Action": "sts:AssumeRoleWithWebIdentity",
#   "Condition": {
#     "StringEquals": {
#       "oidc.eks....:sub": "system:serviceaccount:fintrack:reconciliation-service",
#       "oidc.eks....:aud": "sts.amazonaws.com"
#     }
#   }
# }

IAM Policy Scoping

Every IAM policy follows least-privilege with resource-level and condition-based restrictions:

# S3 policy — scoped to specific bucket + prefix + encryption requirement
resource "aws_iam_policy" "s3_reports_bucket" {
  name = "fintrack-s3-reports-access"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid      = "AllowReportsBucketAccess"
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
        Resource = [
          module.s3_reports.bucket_arn,
          "${module.s3_reports.bucket_arn}/*"
        ]
        Condition = {
          StringEquals = {
            "s3:x-amz-server-side-encryption" = "aws:kms"
            "s3:x-amz-server-side-encryption-aws-kms-key-id" = module.kms.reports_key_arn
          }
        }
      },
      {
        Sid      = "AllowKMSDecrypt"
        Effect   = "Allow"
        Action   = ["kms:Decrypt", "kms:GenerateDataKey"]
        Resource = [module.kms.reports_key_arn]
      }
    ]
  })
}

# RDS IAM Authentication — no password, token-based auth
resource "aws_iam_policy" "rds_connect" {
  name = "fintrack-rds-iam-connect"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = "rds-db:connect"
      Resource = "arn:aws:rds-db:${var.region}:${data.aws_caller_identity.current.account_id}:dbuser:${module.rds.cluster_resource_id}/fintrack_app"
    }]
  })
}

VPC Architecture

┌──────────────────────────── VPC 10.0.0.0/16 ────────────────────────────┐
│                                                                         │
│  ┌─── AZ us-east-1a ───┐  ┌─── AZ us-east-1b ───┐  ┌─── AZ us-east-1c ┐│
│  │                      │  │                      │  │                   ││
│  │ Public  10.0.1.0/24  │  │ Public  10.0.2.0/24  │  │ Public 10.0.3.0  ││
│  │   └── NAT Gateway    │  │   └── NAT Gateway    │  │   └── NAT GW     ││
│  │   └── ALB ENIs       │  │   └── ALB ENIs       │  │   └── ALB ENIs   ││
│  │                      │  │                      │  │                   ││
│  │ Private 10.0.11.0/24 │  │ Private 10.0.12.0/24 │  │ Private 10.0.13  ││
│  │   └── EKS Nodes      │  │   └── EKS Nodes      │  │   └── EKS Nodes  ││
│  │   └── Pods (VPC CNI) │  │   └── Pods (VPC CNI) │  │   └── Pods       ││
│  │                      │  │                      │  │                   ││
│  │ Isolated 10.0.21/24  │  │ Isolated 10.0.22/24  │  │ Isolated 10.0.23 ││
│  │   └── RDS Primary    │  │   └── RDS Standby    │  │   └── ElastiCache││
│  │   └── No internet    │  │   └── No internet    │  │   └── No internet││
│  └──────────────────────┘  └──────────────────────┘  └──────────────────┘│
│                                                                         │
│  VPC Endpoints: S3 (Gateway), ECR, STS, Secrets Manager, KMS, Logs     │
└─────────────────────────────────────────────────────────────────────────┘

3 AZs, 9 subnets — public (ALB, NAT), private (EKS nodes, pods), isolated (data tier — no internet route)
VPC Flow Logs → S3 with Athena queries for security forensics
VPC Endpoints for all AWS API calls — pods never leave the VPC to talk to AWS services
Custom NACLs on isolated subnets — only allow ingress from private subnet CIDRs on database ports

Kubernetes Platform

EKS Cluster Configuration

EKS 1.29 with envelope encryption (KMS), OIDC provider for IRSA, private API endpoint + public with CIDR allow-list
VPC CNI in custom networking mode — pods get IPs from dedicated pod CIDRs, not node subnets
EBS CSI Driver with IRSA for dynamic PVC provisioning (gp3, encrypted)
CoreDNS with NodeLocal DNSCache for low-latency service discovery

Node Strategy — Karpenter

# Karpenter NodePool — general workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        workload-type: general
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.xlarge", "m6i.2xlarge", "m7i.xlarge", "m7i.2xlarge",
                   "m6a.xlarge", "m6a.2xlarge", "c6i.xlarge", "c6i.2xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
      expireAfter: 720h
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 60s
  limits:
    cpu: "200"
    memory: 800Gi
  weight: 50

---
# Karpenter NodePool — GPU for ML anomaly detection (spot only)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g5.xlarge", "g5.2xlarge", "g4dn.xlarge"]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    cpu: "32"
    memory: 128Gi

Security Hardening

OPA Gatekeeper with custom constraint templates:
- Deny containers running as root
- Require resource limits on all pods
- Deny hostPath volumes and hostNetwork
- Enforce approved container registries (ECR only)
- Require pod security labels

# OPA Constraint: Deny containers without resource limits
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: require-resource-limits
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
    namespaces: ["fintrack"]
  parameters:
    requiredResources:
      - limits
      - requests

Network Policies isolating namespaces — fintrack pods can only talk to fintrack pods + kube-dns + ingress
Trivy Operator scanning images on admission, blocking Critical/High CVEs
Pod Security Standards enforced at namespace level (restricted profile)

Custom Helm Chart

Single reusable Helm chart (fintrack-service-chart) deployed 6 times with per-service value overrides:

# values-reconciliation-service.yaml
replicaCount: 3

image:
  repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/reconciliation-service
  tag: "main-487"

serviceAccount:
  create: true
  name: reconciliation-service
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::role/fintrack-reconciliation-service

env:
  - name: DB_HOST
    valueFrom:
      secretKeyRef:
        name: rds-credentials
        key: host
  - name: AWS_REGION
    value: us-east-1
  - name: SQS_QUEUE_URL
    valueFrom:
      secretKeyRef:
        name: sqs-config
        key: transactions-queue-url

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: transactions
        target:
          type: AverageValue
          averageValue: "100"

podDisruptionBudget:
  minAvailable: 2

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

External Secrets Operator

Syncing secrets from AWS Secrets Manager into Kubernetes using IRSA:

# ClusterSecretStore — authenticates to Secrets Manager via IRSA
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

---
# ExternalSecret — syncs RDS credentials
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: rds-credentials
  namespace: fintrack
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: rds-credentials
    creationPolicy: Owner
  data:
    - secretKey: host
      remoteRef:
        key: fintrack/production/rds
        property: host
    - secretKey: port
      remoteRef:
        key: fintrack/production/rds
        property: port
    - secretKey: username
      remoteRef:
        key: fintrack/production/rds
        property: username
    - secretKey: password
      remoteRef:
        key: fintrack/production/rds
        property: password

GitOps (ArgoCD)

App-of-Apps Pattern

ArgoCD is bootstrapped using the app-of-apps pattern — a root Application pointing to a directory that contains Application manifests for every service and cluster add-on:

gitops-repo/
├── apps/                          # Root app-of-apps directory
│   ├── reconciliation-service.yaml
│   ├── reporting-service.yaml
│   ├── notification-service.yaml
│   ├── auth-service.yaml
│   ├── transaction-worker.yaml
│   ├── ml-anomaly-detector.yaml
│   └── addons/
│       ├── external-secrets.yaml
│       ├── cert-manager.yaml
│       ├── karpenter.yaml
│       ├── aws-lb-controller.yaml
│       ├── opa-gatekeeper.yaml
│       ├── trivy-operator.yaml
│       ├── metrics-server.yaml
│       ├── prometheus-stack.yaml
│       ├── loki-stack.yaml
│       └── fluent-bit.yaml
├── services/                      # Helm value overrides per service
│   ├── reconciliation-service/
│   │   ├── Chart.yaml
│   │   └── values.yaml
│   └── ...
└── addons/                        # Add-on Helm value overrides
    ├── external-secrets/
    ├── cert-manager/
    └── ...

ArgoCD Image Updater

Automatically detects new images in ECR and updates the GitOps repo:

# ArgoCD Application with Image Updater annotations
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: reconciliation-service
  namespace: argocd
  annotations:
    argocd-image-updater.argoproj.io/image-list: >-
      main=123456789012.dkr.ecr.us-east-1.amazonaws.com/reconciliation-service
    argocd-image-updater.argoproj.io/main.update-strategy: latest-build
    argocd-image-updater.argoproj.io/main.allow-tags: "regexp:^main-\\d+$"
    argocd-image-updater.argoproj.io/write-back-method: git
    argocd-image-updater.argoproj.io/git-branch: main
spec:
  project: fintrack
  source:
    repoURL: git@github.com:fintrack/gitops.git
    targetRevision: main
    path: services/reconciliation-service
    helm:
      valueFiles:
        - values.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: fintrack
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        maxDuration: 3m0s
        factor: 2

CI/CD Pipelines (GitHub Actions)

Backend Pipeline — 5 Stages

A 480-line reusable workflow called per service via matrix strategy:

# .github/workflows/backend-ci.yml (simplified)
name: Backend CI/CD

on:
  push:
    branches: [main]
    paths: ["services/**"]

permissions:
  id-token: write    # OIDC for AWS auth
  contents: read
  security-events: write

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      services: ${{ steps.filter.outputs.changes }}
    steps:
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            reconciliation-service: services/reconciliation-service/**
            reporting-service: services/reporting-service/**
            notification-service: services/notification-service/**
            auth-service: services/auth-service/**

  build-and-push:
    needs: detect-changes
    if: needs.detect-changes.outputs.services != '[]'
    strategy:
      matrix:
        service: ${{ fromJson(needs.detect-changes.outputs.services) }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/github-actions-ci
          aws-region: us-east-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Run tests
        run: |
          cd services/${{ matrix.service }}
          npm ci && npm run lint && npm run test:coverage

      - name: Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          scan-ref: services/${{ matrix.service }}
          severity: CRITICAL,HIGH
          exit-code: 1

      - name: Build and push to ECR
        uses: docker/build-push-action@v6
        with:
          context: services/${{ matrix.service }}
          push: true
          tags: |
            ${{ env.ECR_REGISTRY }}/${{ matrix.service }}:main-${{ github.run_number }}
            ${{ env.ECR_REGISTRY }}/${{ matrix.service }}:latest

  db-migration:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/github-actions-migration
          aws-region: us-east-1

      - name: Run Prisma migrations
        run: npx prisma migrate deploy

      - name: Rollback on failure
        if: failure()
        run: |
          npx prisma migrate resolve --rolled-back \
            $(npx prisma migrate status --json | jq -r '.migrations[-1].name')

  verify-deployment:
    needs: db-migration
    runs-on: ubuntu-latest
    steps:
      - name: Wait for ArgoCD sync
        run: |
          for i in $(seq 1 30); do
            HEALTH=$(argocd app get $SERVICE -o json | jq -r '.status.health.status')
            if [[ "$HEALTH" == "Healthy" ]]; then
              echo "✅ $SERVICE is synced and healthy"
              exit 0
            fi
            echo "⏳ Waiting... ($HEALTH)"
            sleep 10
          done
          echo "❌ Timeout" && exit 1

      - name: Smoke test
        run: |
          HTTP=$(curl -s -o /dev/null -w "%{http_code}" https://api.fintrack.io/health)
          [[ "$HTTP" == "200" ]] && echo "✅ Healthy" || (echo "❌ HTTP $HTTP" && exit 1)

Frontend Pipeline

React SPA build → S3 deployment → CloudFront cache invalidation:

- name: Deploy to S3
  run: |
    aws s3 sync dist/ s3://fintrack-frontend-prod \
      --delete \
      --cache-control "public, max-age=31536000, immutable" \
      --exclude "index.html" --exclude "*.json"

    aws s3 cp dist/index.html s3://fintrack-frontend-prod/index.html \
      --cache-control "no-cache, no-store, must-revalidate"

- name: Invalidate CloudFront cache
  run: |
    aws cloudfront create-invalidation \
      --distribution-id ${{ vars.CF_DISTRIBUTION_ID }} \
      --paths "/index.html" "/asset-manifest.json"

GitHub Actions → AWS — Zero Credentials

GitHub Actions authenticates via OIDC federation. No AWS access keys stored in GitHub Secrets:

# Terraform: GitHub OIDC provider + CI role
resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.github.certificates[0].sha1_fingerprint]
}

resource "aws_iam_role" "github_actions_ci" {
  name = "github-actions-ci"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.github.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:fintrack/*:ref:refs/heads/main"
        }
      }
    }]
  })
}

Observability

Metrics — Prometheus + Grafana

kube-prometheus-stack deployed via ArgoCD — 15-day retention, 50GB PVC (gp3)
ServiceMonitor CRDs for every fintrack service
12 custom Grafana dashboards — SLA overview, per-service latency, error rates, SQS queue depth, pod resource usage, node cost breakdown
35+ PrometheusRule alert conditions

# Custom PrometheusRule — transaction processing SLA
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fintrack-sla-alerts
  namespace: monitoring
spec:
  groups:
    - name: fintrack.sla
      interval: 30s
      rules:
        - alert: HighTransactionLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="reconciliation-service",
                route="/api/reconcile"
              }[5m])) by (le)
            ) > 2.0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "P99 reconciliation latency exceeds 2s SLA"

        - alert: TransactionQueueBacklog
          expr: |
            aws_sqs_approximate_number_of_messages_visible{
              queue_name="fintrack-transactions"
            } > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Transaction queue backlog exceeds 10k messages"

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="fintrack"
            }[1h]) > 5
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.pod }} restarted {{ $value }} times in 1h"

Logging — Loki + Fluent Bit

Fluent Bit DaemonSet collecting container logs from all nodes, enriching with Kubernetes metadata, shipping to Loki
Loki with S3 backend for long-term storage (90-day retention), compactor for cost optimization
Grafana unified view — logs + metrics on the same dashboard, correlated by trace ID

Alerting

Alertmanager routing tree: critical → PagerDuty + Slack #incidents, warning → Slack #platform-alerts, info → silenced outside business hours
CloudWatch Alarms for infrastructure-level alerts — RDS CPU, ElastiCache memory, NAT Gateway error count

Key Achievements

99.97% uptime over 18 months — single P1 incident (RDS failover, resolved in 8 minutes)
Zero static AWS credentials — full IRSA + OIDC federation for pods and CI/CD
~2M transactions/day processed with P99 latency under 800ms
60% compute cost reduction via Karpenter spot instances + consolidation policies
SOC 2 compliant — full audit trail, encryption at rest + in transit, least-privilege IAM
Mean deployment time: 4 minutes from merge to production (GitHub Actions → ArgoCD Image Updater)
12 Terraform modules, all versioned, tested, and reusable across dev/staging/prod environments