Skip to content
B2B Fintech SaaS

AWS EKS Production Platform — FinTrack

Designed and built a production-grade EKS platform for a B2B fintech SaaS serving 200+ enterprise clients — 12 Terraform modules, IRSA-based zero-trust identity, ArgoCD GitOps, and a full observability stack on AWS.

Terraform AWS EKS Kubernetes ArgoCD Helm GitHub Actions ECR IRSA Karpenter ALB AWS WAF CloudFront RDS PostgreSQL ElastiCache Redis Secrets Manager KMS CloudWatch Prometheus Grafana Loki Fluent Bit cert-manager External Secrets Operator Trivy OPA Gatekeeper Node.js Python React
Sections
8 deep-dives
Tech Stack
29 technologies

Overview

Designed, built, and operated the entire AWS cloud infrastructure and Kubernetes platform for FinTrack — a B2B fintech SaaS that provides real-time financial reconciliation and reporting to 200+ enterprise clients. The platform runs 6 microservices (4 Node.js/TypeScript, 2 Python/FastAPI), a React SPA, and a suite of async workers processing ~2M financial transactions per day.

As the sole DevOps and platform engineer, I owned every layer — from VPC design and IAM policy authoring to Kubernetes pod security and incident response alerting. The platform achieved 99.97% uptime over 18 months with zero data breaches and full SOC 2 audit trail compliance.


Architecture

Terraform Layered State Architecture

Modeled after the same layered pattern I use across engagements — each layer has its own state file in S3, with cross-layer data sharing via terraform_remote_state:

networking/   → VPC, Subnets, NAT Gateways, Transit Gateway, Route Tables, NACLs
  ↓ remote_state
security/     → KMS Keys, IAM Roles, OIDC Provider, GuardDuty, Security Hub
  ↓ remote_state
data/         → RDS PostgreSQL, ElastiCache Redis, S3 Buckets, DynamoDB Tables
  ↓ remote_state
compute/      → EKS Cluster, ECR, Node Groups, Karpenter Provisioners
  ↓ remote_state
platform/     → ArgoCD Bootstrap, External Secrets, cert-manager, Ingress
  ↓ remote_state
observability/→ CloudWatch, Prometheus Stack, Loki, Grafana, Alertmanager

CI/CD & GitOps Flow

Developer Push → GitHub Actions (Lint + Test + SAST + Build)

   ECR (Container Registry — immutable tags)

   ArgoCD Image Updater (polls ECR for new tags)

   ArgoCD Application Controller (syncs desired state)

   EKS Cluster (us-east-1)
   ├── reconciliation-service (3 replicas)
   ├── reporting-service (2 replicas)
   ├── notification-service (2 replicas)
   ├── auth-service (2 replicas)
   ├── transaction-worker (4 replicas, Karpenter spot)
   ├── ml-anomaly-detector (2 replicas, GPU spot)
   ├── AWS ALB Ingress (HTTPS + WAF)
   ├── External Secrets (← Secrets Manager)
   ├── cert-manager (ACM + Let's Encrypt)
   ├── OPA Gatekeeper (policy enforcement)
   ├── Loki + Fluent Bit (Logging)
   └── Prometheus + Grafana (Monitoring)

   CloudFront CDN → React SPA (S3 Origin)

   Enterprise Clients (Web + API)

Infrastructure (Terraform)

12 Custom Terraform Modules

Every AWS resource is provisioned through reusable, versioned Terraform modules stored in a dedicated terraform-modules mono-repo. Each module has full variable validation, outputs, and README.md with usage examples.

ModuleLinesDescription
vpc340Multi-AZ VPC with public, private, and isolated subnets; NAT Gateways (one per AZ); VPC Flow Logs to S3; custom NACLs for data-tier isolation
eks520EKS cluster with OIDC provider, envelope encryption (KMS), managed + Karpenter node groups, cluster add-ons (CoreDNS, kube-proxy, VPC CNI, EBS CSI), IRSA trust policies
ecr85ECR repositories with immutable tags, image scanning on push, lifecycle policies (keep last 20 tagged, delete untagged after 7d)
rds290RDS PostgreSQL 16 Multi-AZ with automated backups (35-day retention), Performance Insights, IAM authentication, custom parameter group, private subnet group
elasticache165ElastiCache Redis 7 cluster-mode with 3 shards, 2 replicas per shard, encryption at rest (KMS) + in transit (TLS), automatic failover
kms95KMS keys with automatic rotation, key policies for EKS envelope encryption, RDS encryption, S3 SSE, Secrets Manager encryption
iam-irsa380IRSA role factory — generates per-service IAM roles with OIDC trust policies, scoped to specific Kubernetes service accounts and namespaces
secrets-manager120Secrets with automatic rotation via Lambda, resource policies, cross-account sharing for DR
cloudfront-spa210CloudFront distribution with S3 origin, OAC, custom error responses (SPA routing), WAF WebACL association, Response Security Headers policy
waf175WAF WebACL with AWS managed rules (AWSManagedRulesCommonRuleSet, SQLi, XSS, Bot Control), custom rate limiting (2000 req/5min per IP), geo-restriction
s3130Bucket with versioning, SSE-KMS, bucket policies (deny non-TLS), lifecycle rules, replication to DR region
observability240CloudWatch log groups, metric filters, composite alarms, SNS topics, Chatbot integration for Slack

Zero Static Credentials — IRSA Throughout

Every pod runs with a dedicated IAM role. No AWS access keys exist anywhere — not in Secrets Manager, not in environment variables, nowhere.

# IRSA Role Factory — per-service IAM role with OIDC trust
module "irsa_reconciliation_service" {
  source = "../../modules/iam-irsa"

  role_name   = "fintrack-reconciliation-service"
  namespace   = "fintrack"
  sa_name     = "reconciliation-service"
  oidc_issuer = data.terraform_remote_state.compute.outputs.oidc_issuer
  oidc_arn    = data.terraform_remote_state.compute.outputs.oidc_provider_arn

  policy_arns = [
    aws_iam_policy.rds_connect.arn,
    aws_iam_policy.s3_reports_bucket.arn,
    aws_iam_policy.sqs_transactions_queue.arn,
    aws_iam_policy.secrets_read.arn,
  ]
}

# Trust policy generated by the module:
# {
#   "Effect": "Allow",
#   "Principal": {
#     "Federated": "arn:aws:iam::oidc-provider/oidc.eks.us-east-1..."
#   },
#   "Action": "sts:AssumeRoleWithWebIdentity",
#   "Condition": {
#     "StringEquals": {
#       "oidc.eks....:sub": "system:serviceaccount:fintrack:reconciliation-service",
#       "oidc.eks....:aud": "sts.amazonaws.com"
#     }
#   }
# }

IAM Policy Scoping

Every IAM policy follows least-privilege with resource-level and condition-based restrictions:

# S3 policy — scoped to specific bucket + prefix + encryption requirement
resource "aws_iam_policy" "s3_reports_bucket" {
  name = "fintrack-s3-reports-access"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid      = "AllowReportsBucketAccess"
        Effect   = "Allow"
        Action   = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
        Resource = [
          module.s3_reports.bucket_arn,
          "${module.s3_reports.bucket_arn}/*"
        ]
        Condition = {
          StringEquals = {
            "s3:x-amz-server-side-encryption" = "aws:kms"
            "s3:x-amz-server-side-encryption-aws-kms-key-id" = module.kms.reports_key_arn
          }
        }
      },
      {
        Sid      = "AllowKMSDecrypt"
        Effect   = "Allow"
        Action   = ["kms:Decrypt", "kms:GenerateDataKey"]
        Resource = [module.kms.reports_key_arn]
      }
    ]
  })
}

# RDS IAM Authentication — no password, token-based auth
resource "aws_iam_policy" "rds_connect" {
  name = "fintrack-rds-iam-connect"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = "rds-db:connect"
      Resource = "arn:aws:rds-db:${var.region}:${data.aws_caller_identity.current.account_id}:dbuser:${module.rds.cluster_resource_id}/fintrack_app"
    }]
  })
}

VPC Architecture

┌──────────────────────────── VPC 10.0.0.0/16 ────────────────────────────┐
│                                                                         │
│  ┌─── AZ us-east-1a ───┐  ┌─── AZ us-east-1b ───┐  ┌─── AZ us-east-1c ┐│
│  │                      │  │                      │  │                   ││
│  │ Public  10.0.1.0/24  │  │ Public  10.0.2.0/24  │  │ Public 10.0.3.0  ││
│  │   └── NAT Gateway    │  │   └── NAT Gateway    │  │   └── NAT GW     ││
│  │   └── ALB ENIs       │  │   └── ALB ENIs       │  │   └── ALB ENIs   ││
│  │                      │  │                      │  │                   ││
│  │ Private 10.0.11.0/24 │  │ Private 10.0.12.0/24 │  │ Private 10.0.13  ││
│  │   └── EKS Nodes      │  │   └── EKS Nodes      │  │   └── EKS Nodes  ││
│  │   └── Pods (VPC CNI) │  │   └── Pods (VPC CNI) │  │   └── Pods       ││
│  │                      │  │                      │  │                   ││
│  │ Isolated 10.0.21/24  │  │ Isolated 10.0.22/24  │  │ Isolated 10.0.23 ││
│  │   └── RDS Primary    │  │   └── RDS Standby    │  │   └── ElastiCache││
│  │   └── No internet    │  │   └── No internet    │  │   └── No internet││
│  └──────────────────────┘  └──────────────────────┘  └──────────────────┘│
│                                                                         │
│  VPC Endpoints: S3 (Gateway), ECR, STS, Secrets Manager, KMS, Logs     │
└─────────────────────────────────────────────────────────────────────────┘
  • 3 AZs, 9 subnets — public (ALB, NAT), private (EKS nodes, pods), isolated (data tier — no internet route)
  • VPC Flow Logs → S3 with Athena queries for security forensics
  • VPC Endpoints for all AWS API calls — pods never leave the VPC to talk to AWS services
  • Custom NACLs on isolated subnets — only allow ingress from private subnet CIDRs on database ports

Kubernetes Platform

EKS Cluster Configuration

  • EKS 1.29 with envelope encryption (KMS), OIDC provider for IRSA, private API endpoint + public with CIDR allow-list
  • VPC CNI in custom networking mode — pods get IPs from dedicated pod CIDRs, not node subnets
  • EBS CSI Driver with IRSA for dynamic PVC provisioning (gp3, encrypted)
  • CoreDNS with NodeLocal DNSCache for low-latency service discovery

Node Strategy — Karpenter

# Karpenter NodePool — general workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    metadata:
      labels:
        workload-type: general
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.xlarge", "m6i.2xlarge", "m7i.xlarge", "m7i.2xlarge",
                   "m6a.xlarge", "m6a.2xlarge", "c6i.xlarge", "c6i.2xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
      expireAfter: 720h
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 60s
  limits:
    cpu: "200"
    memory: 800Gi
  weight: 50

---
# Karpenter NodePool — GPU for ML anomaly detection (spot only)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot
spec:
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g5.xlarge", "g5.2xlarge", "g4dn.xlarge"]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    cpu: "32"
    memory: 128Gi

Security Hardening

  • OPA Gatekeeper with custom constraint templates:
    • Deny containers running as root
    • Require resource limits on all pods
    • Deny hostPath volumes and hostNetwork
    • Enforce approved container registries (ECR only)
    • Require pod security labels
# OPA Constraint: Deny containers without resource limits
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: require-resource-limits
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment", "StatefulSet"]
    namespaces: ["fintrack"]
  parameters:
    requiredResources:
      - limits
      - requests
  • Network Policies isolating namespaces — fintrack pods can only talk to fintrack pods + kube-dns + ingress
  • Trivy Operator scanning images on admission, blocking Critical/High CVEs
  • Pod Security Standards enforced at namespace level (restricted profile)

Custom Helm Chart

Single reusable Helm chart (fintrack-service-chart) deployed 6 times with per-service value overrides:

# values-reconciliation-service.yaml
replicaCount: 3

image:
  repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/reconciliation-service
  tag: "main-487"

serviceAccount:
  create: true
  name: reconciliation-service
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::role/fintrack-reconciliation-service

env:
  - name: DB_HOST
    valueFrom:
      secretKeyRef:
        name: rds-credentials
        key: host
  - name: AWS_REGION
    value: us-east-1
  - name: SQS_QUEUE_URL
    valueFrom:
      secretKeyRef:
        name: sqs-config
        key: transactions-queue-url

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: 1Gi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 15
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: transactions
        target:
          type: AverageValue
          averageValue: "100"

podDisruptionBudget:
  minAvailable: 2

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

External Secrets Operator

Syncing secrets from AWS Secrets Manager into Kubernetes using IRSA:

# ClusterSecretStore — authenticates to Secrets Manager via IRSA
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

---
# ExternalSecret — syncs RDS credentials
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: rds-credentials
  namespace: fintrack
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: rds-credentials
    creationPolicy: Owner
  data:
    - secretKey: host
      remoteRef:
        key: fintrack/production/rds
        property: host
    - secretKey: port
      remoteRef:
        key: fintrack/production/rds
        property: port
    - secretKey: username
      remoteRef:
        key: fintrack/production/rds
        property: username
    - secretKey: password
      remoteRef:
        key: fintrack/production/rds
        property: password

GitOps (ArgoCD)

App-of-Apps Pattern

ArgoCD is bootstrapped using the app-of-apps pattern — a root Application pointing to a directory that contains Application manifests for every service and cluster add-on:

gitops-repo/
├── apps/                          # Root app-of-apps directory
│   ├── reconciliation-service.yaml
│   ├── reporting-service.yaml
│   ├── notification-service.yaml
│   ├── auth-service.yaml
│   ├── transaction-worker.yaml
│   ├── ml-anomaly-detector.yaml
│   └── addons/
│       ├── external-secrets.yaml
│       ├── cert-manager.yaml
│       ├── karpenter.yaml
│       ├── aws-lb-controller.yaml
│       ├── opa-gatekeeper.yaml
│       ├── trivy-operator.yaml
│       ├── metrics-server.yaml
│       ├── prometheus-stack.yaml
│       ├── loki-stack.yaml
│       └── fluent-bit.yaml
├── services/                      # Helm value overrides per service
│   ├── reconciliation-service/
│   │   ├── Chart.yaml
│   │   └── values.yaml
│   └── ...
└── addons/                        # Add-on Helm value overrides
    ├── external-secrets/
    ├── cert-manager/
    └── ...

ArgoCD Image Updater

Automatically detects new images in ECR and updates the GitOps repo:

# ArgoCD Application with Image Updater annotations
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: reconciliation-service
  namespace: argocd
  annotations:
    argocd-image-updater.argoproj.io/image-list: >-
      main=123456789012.dkr.ecr.us-east-1.amazonaws.com/reconciliation-service
    argocd-image-updater.argoproj.io/main.update-strategy: latest-build
    argocd-image-updater.argoproj.io/main.allow-tags: "regexp:^main-\\d+$"
    argocd-image-updater.argoproj.io/write-back-method: git
    argocd-image-updater.argoproj.io/git-branch: main
spec:
  project: fintrack
  source:
    repoURL: git@github.com:fintrack/gitops.git
    targetRevision: main
    path: services/reconciliation-service
    helm:
      valueFiles:
        - values.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: fintrack
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 30s
        maxDuration: 3m0s
        factor: 2

CI/CD Pipelines (GitHub Actions)

Backend Pipeline — 5 Stages

A 480-line reusable workflow called per service via matrix strategy:

# .github/workflows/backend-ci.yml (simplified)
name: Backend CI/CD

on:
  push:
    branches: [main]
    paths: ["services/**"]

permissions:
  id-token: write    # OIDC for AWS auth
  contents: read
  security-events: write

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      services: ${{ steps.filter.outputs.changes }}
    steps:
      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            reconciliation-service: services/reconciliation-service/**
            reporting-service: services/reporting-service/**
            notification-service: services/notification-service/**
            auth-service: services/auth-service/**

  build-and-push:
    needs: detect-changes
    if: needs.detect-changes.outputs.services != '[]'
    strategy:
      matrix:
        service: ${{ fromJson(needs.detect-changes.outputs.services) }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/github-actions-ci
          aws-region: us-east-1

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Run tests
        run: |
          cd services/${{ matrix.service }}
          npm ci && npm run lint && npm run test:coverage

      - name: Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: fs
          scan-ref: services/${{ matrix.service }}
          severity: CRITICAL,HIGH
          exit-code: 1

      - name: Build and push to ECR
        uses: docker/build-push-action@v6
        with:
          context: services/${{ matrix.service }}
          push: true
          tags: |
            ${{ env.ECR_REGISTRY }}/${{ matrix.service }}:main-${{ github.run_number }}
            ${{ env.ECR_REGISTRY }}/${{ matrix.service }}:latest

  db-migration:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/github-actions-migration
          aws-region: us-east-1

      - name: Run Prisma migrations
        run: npx prisma migrate deploy

      - name: Rollback on failure
        if: failure()
        run: |
          npx prisma migrate resolve --rolled-back \
            $(npx prisma migrate status --json | jq -r '.migrations[-1].name')

  verify-deployment:
    needs: db-migration
    runs-on: ubuntu-latest
    steps:
      - name: Wait for ArgoCD sync
        run: |
          for i in $(seq 1 30); do
            HEALTH=$(argocd app get $SERVICE -o json | jq -r '.status.health.status')
            if [[ "$HEALTH" == "Healthy" ]]; then
              echo "✅ $SERVICE is synced and healthy"
              exit 0
            fi
            echo "⏳ Waiting... ($HEALTH)"
            sleep 10
          done
          echo "❌ Timeout" && exit 1

      - name: Smoke test
        run: |
          HTTP=$(curl -s -o /dev/null -w "%{http_code}" https://api.fintrack.io/health)
          [[ "$HTTP" == "200" ]] && echo "✅ Healthy" || (echo "❌ HTTP $HTTP" && exit 1)

Frontend Pipeline

React SPA build → S3 deployment → CloudFront cache invalidation:

- name: Deploy to S3
  run: |
    aws s3 sync dist/ s3://fintrack-frontend-prod \
      --delete \
      --cache-control "public, max-age=31536000, immutable" \
      --exclude "index.html" --exclude "*.json"

    aws s3 cp dist/index.html s3://fintrack-frontend-prod/index.html \
      --cache-control "no-cache, no-store, must-revalidate"

- name: Invalidate CloudFront cache
  run: |
    aws cloudfront create-invalidation \
      --distribution-id ${{ vars.CF_DISTRIBUTION_ID }} \
      --paths "/index.html" "/asset-manifest.json"

GitHub Actions → AWS — Zero Credentials

GitHub Actions authenticates via OIDC federation. No AWS access keys stored in GitHub Secrets:

# Terraform: GitHub OIDC provider + CI role
resource "aws_iam_openid_connect_provider" "github" {
  url             = "https://token.actions.githubusercontent.com"
  client_id_list  = ["sts.amazonaws.com"]
  thumbprint_list = [data.tls_certificate.github.certificates[0].sha1_fingerprint]
}

resource "aws_iam_role" "github_actions_ci" {
  name = "github-actions-ci"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.github.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:fintrack/*:ref:refs/heads/main"
        }
      }
    }]
  })
}

Observability

Metrics — Prometheus + Grafana

  • kube-prometheus-stack deployed via ArgoCD — 15-day retention, 50GB PVC (gp3)
  • ServiceMonitor CRDs for every fintrack service
  • 12 custom Grafana dashboards — SLA overview, per-service latency, error rates, SQS queue depth, pod resource usage, node cost breakdown
  • 35+ PrometheusRule alert conditions
# Custom PrometheusRule — transaction processing SLA
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fintrack-sla-alerts
  namespace: monitoring
spec:
  groups:
    - name: fintrack.sla
      interval: 30s
      rules:
        - alert: HighTransactionLatency
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="reconciliation-service",
                route="/api/reconcile"
              }[5m])) by (le)
            ) > 2.0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "P99 reconciliation latency exceeds 2s SLA"

        - alert: TransactionQueueBacklog
          expr: |
            aws_sqs_approximate_number_of_messages_visible{
              queue_name="fintrack-transactions"
            } > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Transaction queue backlog exceeds 10k messages"

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="fintrack"
            }[1h]) > 5
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.pod }} restarted {{ $value }} times in 1h"

Logging — Loki + Fluent Bit

  • Fluent Bit DaemonSet collecting container logs from all nodes, enriching with Kubernetes metadata, shipping to Loki
  • Loki with S3 backend for long-term storage (90-day retention), compactor for cost optimization
  • Grafana unified view — logs + metrics on the same dashboard, correlated by trace ID

Alerting

  • Alertmanager routing tree: critical → PagerDuty + Slack #incidents, warning → Slack #platform-alerts, info → silenced outside business hours
  • CloudWatch Alarms for infrastructure-level alerts — RDS CPU, ElastiCache memory, NAT Gateway error count

Key Achievements

  • 99.97% uptime over 18 months — single P1 incident (RDS failover, resolved in 8 minutes)
  • Zero static AWS credentials — full IRSA + OIDC federation for pods and CI/CD
  • ~2M transactions/day processed with P99 latency under 800ms
  • 60% compute cost reduction via Karpenter spot instances + consolidation policies
  • SOC 2 compliant — full audit trail, encryption at rest + in transit, least-privilege IAM
  • Mean deployment time: 4 minutes from merge to production (GitHub Actions → ArgoCD Image Updater)
  • 12 Terraform modules, all versioned, tested, and reusable across dev/staging/prod environments

Need something similar?

Let's discuss how I can build this kind of infrastructure for your team.