Overview
Designed, built, and operated the entire AWS cloud infrastructure and Kubernetes platform for FinTrack — a B2B fintech SaaS that provides real-time financial reconciliation and reporting to 200+ enterprise clients. The platform runs 6 microservices (4 Node.js/TypeScript, 2 Python/FastAPI), a React SPA, and a suite of async workers processing ~2M financial transactions per day.
As the sole DevOps and platform engineer, I owned every layer — from VPC design and IAM policy authoring to Kubernetes pod security and incident response alerting. The platform achieved 99.97% uptime over 18 months with zero data breaches and full SOC 2 audit trail compliance.
Architecture
Terraform Layered State Architecture
Modeled after the same layered pattern I use across engagements — each layer has its own state file in S3, with cross-layer data sharing via terraform_remote_state:
networking/ → VPC, Subnets, NAT Gateways, Transit Gateway, Route Tables, NACLs
↓ remote_state
security/ → KMS Keys, IAM Roles, OIDC Provider, GuardDuty, Security Hub
↓ remote_state
data/ → RDS PostgreSQL, ElastiCache Redis, S3 Buckets, DynamoDB Tables
↓ remote_state
compute/ → EKS Cluster, ECR, Node Groups, Karpenter Provisioners
↓ remote_state
platform/ → ArgoCD Bootstrap, External Secrets, cert-manager, Ingress
↓ remote_state
observability/→ CloudWatch, Prometheus Stack, Loki, Grafana, Alertmanager
CI/CD & GitOps Flow
Developer Push → GitHub Actions (Lint + Test + SAST + Build)
↓
ECR (Container Registry — immutable tags)
↓
ArgoCD Image Updater (polls ECR for new tags)
↓
ArgoCD Application Controller (syncs desired state)
↓
EKS Cluster (us-east-1)
├── reconciliation-service (3 replicas)
├── reporting-service (2 replicas)
├── notification-service (2 replicas)
├── auth-service (2 replicas)
├── transaction-worker (4 replicas, Karpenter spot)
├── ml-anomaly-detector (2 replicas, GPU spot)
├── AWS ALB Ingress (HTTPS + WAF)
├── External Secrets (← Secrets Manager)
├── cert-manager (ACM + Let's Encrypt)
├── OPA Gatekeeper (policy enforcement)
├── Loki + Fluent Bit (Logging)
└── Prometheus + Grafana (Monitoring)
↓
CloudFront CDN → React SPA (S3 Origin)
↓
Enterprise Clients (Web + API)
Infrastructure (Terraform)
12 Custom Terraform Modules
Every AWS resource is provisioned through reusable, versioned Terraform modules stored in a dedicated terraform-modules mono-repo. Each module has full variable validation, outputs, and README.md with usage examples.
| Module | Lines | Description |
|---|---|---|
vpc | 340 | Multi-AZ VPC with public, private, and isolated subnets; NAT Gateways (one per AZ); VPC Flow Logs to S3; custom NACLs for data-tier isolation |
eks | 520 | EKS cluster with OIDC provider, envelope encryption (KMS), managed + Karpenter node groups, cluster add-ons (CoreDNS, kube-proxy, VPC CNI, EBS CSI), IRSA trust policies |
ecr | 85 | ECR repositories with immutable tags, image scanning on push, lifecycle policies (keep last 20 tagged, delete untagged after 7d) |
rds | 290 | RDS PostgreSQL 16 Multi-AZ with automated backups (35-day retention), Performance Insights, IAM authentication, custom parameter group, private subnet group |
elasticache | 165 | ElastiCache Redis 7 cluster-mode with 3 shards, 2 replicas per shard, encryption at rest (KMS) + in transit (TLS), automatic failover |
kms | 95 | KMS keys with automatic rotation, key policies for EKS envelope encryption, RDS encryption, S3 SSE, Secrets Manager encryption |
iam-irsa | 380 | IRSA role factory — generates per-service IAM roles with OIDC trust policies, scoped to specific Kubernetes service accounts and namespaces |
secrets-manager | 120 | Secrets with automatic rotation via Lambda, resource policies, cross-account sharing for DR |
cloudfront-spa | 210 | CloudFront distribution with S3 origin, OAC, custom error responses (SPA routing), WAF WebACL association, Response Security Headers policy |
waf | 175 | WAF WebACL with AWS managed rules (AWSManagedRulesCommonRuleSet, SQLi, XSS, Bot Control), custom rate limiting (2000 req/5min per IP), geo-restriction |
s3 | 130 | Bucket with versioning, SSE-KMS, bucket policies (deny non-TLS), lifecycle rules, replication to DR region |
observability | 240 | CloudWatch log groups, metric filters, composite alarms, SNS topics, Chatbot integration for Slack |
Zero Static Credentials — IRSA Throughout
Every pod runs with a dedicated IAM role. No AWS access keys exist anywhere — not in Secrets Manager, not in environment variables, nowhere.
# IRSA Role Factory — per-service IAM role with OIDC trust
module "irsa_reconciliation_service" {
source = "../../modules/iam-irsa"
role_name = "fintrack-reconciliation-service"
namespace = "fintrack"
sa_name = "reconciliation-service"
oidc_issuer = data.terraform_remote_state.compute.outputs.oidc_issuer
oidc_arn = data.terraform_remote_state.compute.outputs.oidc_provider_arn
policy_arns = [
aws_iam_policy.rds_connect.arn,
aws_iam_policy.s3_reports_bucket.arn,
aws_iam_policy.sqs_transactions_queue.arn,
aws_iam_policy.secrets_read.arn,
]
}
# Trust policy generated by the module:
# {
# "Effect": "Allow",
# "Principal": {
# "Federated": "arn:aws:iam::oidc-provider/oidc.eks.us-east-1..."
# },
# "Action": "sts:AssumeRoleWithWebIdentity",
# "Condition": {
# "StringEquals": {
# "oidc.eks....:sub": "system:serviceaccount:fintrack:reconciliation-service",
# "oidc.eks....:aud": "sts.amazonaws.com"
# }
# }
# }
IAM Policy Scoping
Every IAM policy follows least-privilege with resource-level and condition-based restrictions:
# S3 policy — scoped to specific bucket + prefix + encryption requirement
resource "aws_iam_policy" "s3_reports_bucket" {
name = "fintrack-s3-reports-access"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowReportsBucketAccess"
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
Resource = [
module.s3_reports.bucket_arn,
"${module.s3_reports.bucket_arn}/*"
]
Condition = {
StringEquals = {
"s3:x-amz-server-side-encryption" = "aws:kms"
"s3:x-amz-server-side-encryption-aws-kms-key-id" = module.kms.reports_key_arn
}
}
},
{
Sid = "AllowKMSDecrypt"
Effect = "Allow"
Action = ["kms:Decrypt", "kms:GenerateDataKey"]
Resource = [module.kms.reports_key_arn]
}
]
})
}
# RDS IAM Authentication — no password, token-based auth
resource "aws_iam_policy" "rds_connect" {
name = "fintrack-rds-iam-connect"
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = "rds-db:connect"
Resource = "arn:aws:rds-db:${var.region}:${data.aws_caller_identity.current.account_id}:dbuser:${module.rds.cluster_resource_id}/fintrack_app"
}]
})
}
VPC Architecture
┌──────────────────────────── VPC 10.0.0.0/16 ────────────────────────────┐
│ │
│ ┌─── AZ us-east-1a ───┐ ┌─── AZ us-east-1b ───┐ ┌─── AZ us-east-1c ┐│
│ │ │ │ │ │ ││
│ │ Public 10.0.1.0/24 │ │ Public 10.0.2.0/24 │ │ Public 10.0.3.0 ││
│ │ └── NAT Gateway │ │ └── NAT Gateway │ │ └── NAT GW ││
│ │ └── ALB ENIs │ │ └── ALB ENIs │ │ └── ALB ENIs ││
│ │ │ │ │ │ ││
│ │ Private 10.0.11.0/24 │ │ Private 10.0.12.0/24 │ │ Private 10.0.13 ││
│ │ └── EKS Nodes │ │ └── EKS Nodes │ │ └── EKS Nodes ││
│ │ └── Pods (VPC CNI) │ │ └── Pods (VPC CNI) │ │ └── Pods ││
│ │ │ │ │ │ ││
│ │ Isolated 10.0.21/24 │ │ Isolated 10.0.22/24 │ │ Isolated 10.0.23 ││
│ │ └── RDS Primary │ │ └── RDS Standby │ │ └── ElastiCache││
│ │ └── No internet │ │ └── No internet │ │ └── No internet││
│ └──────────────────────┘ └──────────────────────┘ └──────────────────┘│
│ │
│ VPC Endpoints: S3 (Gateway), ECR, STS, Secrets Manager, KMS, Logs │
└─────────────────────────────────────────────────────────────────────────┘
- 3 AZs, 9 subnets — public (ALB, NAT), private (EKS nodes, pods), isolated (data tier — no internet route)
- VPC Flow Logs → S3 with Athena queries for security forensics
- VPC Endpoints for all AWS API calls — pods never leave the VPC to talk to AWS services
- Custom NACLs on isolated subnets — only allow ingress from private subnet CIDRs on database ports
Kubernetes Platform
EKS Cluster Configuration
- EKS 1.29 with envelope encryption (KMS), OIDC provider for IRSA, private API endpoint + public with CIDR allow-list
- VPC CNI in custom networking mode — pods get IPs from dedicated pod CIDRs, not node subnets
- EBS CSI Driver with IRSA for dynamic PVC provisioning (gp3, encrypted)
- CoreDNS with NodeLocal DNSCache for low-latency service discovery
Node Strategy — Karpenter
# Karpenter NodePool — general workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
workload-type: general
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6i.xlarge", "m6i.2xlarge", "m7i.xlarge", "m7i.2xlarge",
"m6a.xlarge", "m6a.2xlarge", "c6i.xlarge", "c6i.2xlarge"]
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
expireAfter: 720h
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 60s
limits:
cpu: "200"
memory: 800Gi
weight: 50
---
# Karpenter NodePool — GPU for ML anomaly detection (spot only)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-spot
spec:
template:
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["g5.xlarge", "g5.2xlarge", "g4dn.xlarge"]
taints:
- key: nvidia.com/gpu
effect: NoSchedule
limits:
cpu: "32"
memory: 128Gi
Security Hardening
- OPA Gatekeeper with custom constraint templates:
- Deny containers running as root
- Require resource limits on all pods
- Deny
hostPathvolumes andhostNetwork - Enforce approved container registries (ECR only)
- Require pod security labels
# OPA Constraint: Deny containers without resource limits
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
name: require-resource-limits
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
namespaces: ["fintrack"]
parameters:
requiredResources:
- limits
- requests
- Network Policies isolating namespaces — fintrack pods can only talk to fintrack pods + kube-dns + ingress
- Trivy Operator scanning images on admission, blocking Critical/High CVEs
- Pod Security Standards enforced at namespace level (
restrictedprofile)
Custom Helm Chart
Single reusable Helm chart (fintrack-service-chart) deployed 6 times with per-service value overrides:
# values-reconciliation-service.yaml
replicaCount: 3
image:
repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/reconciliation-service
tag: "main-487"
serviceAccount:
create: true
name: reconciliation-service
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::role/fintrack-reconciliation-service
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: rds-credentials
key: host
- name: AWS_REGION
value: us-east-1
- name: SQS_QUEUE_URL
valueFrom:
secretKeyRef:
name: sqs-config
key: transactions-queue-url
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: transactions
target:
type: AverageValue
averageValue: "100"
podDisruptionBudget:
minAvailable: 2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
External Secrets Operator
Syncing secrets from AWS Secrets Manager into Kubernetes using IRSA:
# ClusterSecretStore — authenticates to Secrets Manager via IRSA
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: aws-secrets-manager
spec:
provider:
aws:
service: SecretsManager
region: us-east-1
auth:
jwt:
serviceAccountRef:
name: external-secrets-sa
namespace: external-secrets
---
# ExternalSecret — syncs RDS credentials
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: rds-credentials
namespace: fintrack
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: aws-secrets-manager
target:
name: rds-credentials
creationPolicy: Owner
data:
- secretKey: host
remoteRef:
key: fintrack/production/rds
property: host
- secretKey: port
remoteRef:
key: fintrack/production/rds
property: port
- secretKey: username
remoteRef:
key: fintrack/production/rds
property: username
- secretKey: password
remoteRef:
key: fintrack/production/rds
property: password
GitOps (ArgoCD)
App-of-Apps Pattern
ArgoCD is bootstrapped using the app-of-apps pattern — a root Application pointing to a directory that contains Application manifests for every service and cluster add-on:
gitops-repo/
├── apps/ # Root app-of-apps directory
│ ├── reconciliation-service.yaml
│ ├── reporting-service.yaml
│ ├── notification-service.yaml
│ ├── auth-service.yaml
│ ├── transaction-worker.yaml
│ ├── ml-anomaly-detector.yaml
│ └── addons/
│ ├── external-secrets.yaml
│ ├── cert-manager.yaml
│ ├── karpenter.yaml
│ ├── aws-lb-controller.yaml
│ ├── opa-gatekeeper.yaml
│ ├── trivy-operator.yaml
│ ├── metrics-server.yaml
│ ├── prometheus-stack.yaml
│ ├── loki-stack.yaml
│ └── fluent-bit.yaml
├── services/ # Helm value overrides per service
│ ├── reconciliation-service/
│ │ ├── Chart.yaml
│ │ └── values.yaml
│ └── ...
└── addons/ # Add-on Helm value overrides
├── external-secrets/
├── cert-manager/
└── ...
ArgoCD Image Updater
Automatically detects new images in ECR and updates the GitOps repo:
# ArgoCD Application with Image Updater annotations
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: reconciliation-service
namespace: argocd
annotations:
argocd-image-updater.argoproj.io/image-list: >-
main=123456789012.dkr.ecr.us-east-1.amazonaws.com/reconciliation-service
argocd-image-updater.argoproj.io/main.update-strategy: latest-build
argocd-image-updater.argoproj.io/main.allow-tags: "regexp:^main-\\d+$"
argocd-image-updater.argoproj.io/write-back-method: git
argocd-image-updater.argoproj.io/git-branch: main
spec:
project: fintrack
source:
repoURL: git@github.com:fintrack/gitops.git
targetRevision: main
path: services/reconciliation-service
helm:
valueFiles:
- values.yaml
destination:
server: https://kubernetes.default.svc
namespace: fintrack
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
retry:
limit: 3
backoff:
duration: 30s
maxDuration: 3m0s
factor: 2
CI/CD Pipelines (GitHub Actions)
Backend Pipeline — 5 Stages
A 480-line reusable workflow called per service via matrix strategy:
# .github/workflows/backend-ci.yml (simplified)
name: Backend CI/CD
on:
push:
branches: [main]
paths: ["services/**"]
permissions:
id-token: write # OIDC for AWS auth
contents: read
security-events: write
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
services: ${{ steps.filter.outputs.changes }}
steps:
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
reconciliation-service: services/reconciliation-service/**
reporting-service: services/reporting-service/**
notification-service: services/notification-service/**
auth-service: services/auth-service/**
build-and-push:
needs: detect-changes
if: needs.detect-changes.outputs.services != '[]'
strategy:
matrix:
service: ${{ fromJson(needs.detect-changes.outputs.services) }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::role/github-actions-ci
aws-region: us-east-1
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Run tests
run: |
cd services/${{ matrix.service }}
npm ci && npm run lint && npm run test:coverage
- name: Trivy vulnerability scan
uses: aquasecurity/trivy-action@master
with:
scan-type: fs
scan-ref: services/${{ matrix.service }}
severity: CRITICAL,HIGH
exit-code: 1
- name: Build and push to ECR
uses: docker/build-push-action@v6
with:
context: services/${{ matrix.service }}
push: true
tags: |
${{ env.ECR_REGISTRY }}/${{ matrix.service }}:main-${{ github.run_number }}
${{ env.ECR_REGISTRY }}/${{ matrix.service }}:latest
db-migration:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::role/github-actions-migration
aws-region: us-east-1
- name: Run Prisma migrations
run: npx prisma migrate deploy
- name: Rollback on failure
if: failure()
run: |
npx prisma migrate resolve --rolled-back \
$(npx prisma migrate status --json | jq -r '.migrations[-1].name')
verify-deployment:
needs: db-migration
runs-on: ubuntu-latest
steps:
- name: Wait for ArgoCD sync
run: |
for i in $(seq 1 30); do
HEALTH=$(argocd app get $SERVICE -o json | jq -r '.status.health.status')
if [[ "$HEALTH" == "Healthy" ]]; then
echo "✅ $SERVICE is synced and healthy"
exit 0
fi
echo "⏳ Waiting... ($HEALTH)"
sleep 10
done
echo "❌ Timeout" && exit 1
- name: Smoke test
run: |
HTTP=$(curl -s -o /dev/null -w "%{http_code}" https://api.fintrack.io/health)
[[ "$HTTP" == "200" ]] && echo "✅ Healthy" || (echo "❌ HTTP $HTTP" && exit 1)
Frontend Pipeline
React SPA build → S3 deployment → CloudFront cache invalidation:
- name: Deploy to S3
run: |
aws s3 sync dist/ s3://fintrack-frontend-prod \
--delete \
--cache-control "public, max-age=31536000, immutable" \
--exclude "index.html" --exclude "*.json"
aws s3 cp dist/index.html s3://fintrack-frontend-prod/index.html \
--cache-control "no-cache, no-store, must-revalidate"
- name: Invalidate CloudFront cache
run: |
aws cloudfront create-invalidation \
--distribution-id ${{ vars.CF_DISTRIBUTION_ID }} \
--paths "/index.html" "/asset-manifest.json"
GitHub Actions → AWS — Zero Credentials
GitHub Actions authenticates via OIDC federation. No AWS access keys stored in GitHub Secrets:
# Terraform: GitHub OIDC provider + CI role
resource "aws_iam_openid_connect_provider" "github" {
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = [data.tls_certificate.github.certificates[0].sha1_fingerprint]
}
resource "aws_iam_role" "github_actions_ci" {
name = "github-actions-ci"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.github.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
}
StringLike = {
"token.actions.githubusercontent.com:sub" = "repo:fintrack/*:ref:refs/heads/main"
}
}
}]
})
}
Observability
Metrics — Prometheus + Grafana
- kube-prometheus-stack deployed via ArgoCD — 15-day retention, 50GB PVC (gp3)
- ServiceMonitor CRDs for every fintrack service
- 12 custom Grafana dashboards — SLA overview, per-service latency, error rates, SQS queue depth, pod resource usage, node cost breakdown
- 35+ PrometheusRule alert conditions
# Custom PrometheusRule — transaction processing SLA
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: fintrack-sla-alerts
namespace: monitoring
spec:
groups:
- name: fintrack.sla
interval: 30s
rules:
- alert: HighTransactionLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="reconciliation-service",
route="/api/reconcile"
}[5m])) by (le)
) > 2.0
for: 5m
labels:
severity: critical
annotations:
summary: "P99 reconciliation latency exceeds 2s SLA"
- alert: TransactionQueueBacklog
expr: |
aws_sqs_approximate_number_of_messages_visible{
queue_name="fintrack-transactions"
} > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Transaction queue backlog exceeds 10k messages"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{
namespace="fintrack"
}[1h]) > 5
labels:
severity: critical
annotations:
summary: "{{ $labels.pod }} restarted {{ $value }} times in 1h"
Logging — Loki + Fluent Bit
- Fluent Bit DaemonSet collecting container logs from all nodes, enriching with Kubernetes metadata, shipping to Loki
- Loki with S3 backend for long-term storage (90-day retention), compactor for cost optimization
- Grafana unified view — logs + metrics on the same dashboard, correlated by trace ID
Alerting
- Alertmanager routing tree:
critical→ PagerDuty + Slack#incidents,warning→ Slack#platform-alerts,info→ silenced outside business hours - CloudWatch Alarms for infrastructure-level alerts — RDS CPU, ElastiCache memory, NAT Gateway error count
Key Achievements
- 99.97% uptime over 18 months — single P1 incident (RDS failover, resolved in 8 minutes)
- Zero static AWS credentials — full IRSA + OIDC federation for pods and CI/CD
- ~2M transactions/day processed with P99 latency under 800ms
- 60% compute cost reduction via Karpenter spot instances + consolidation policies
- SOC 2 compliant — full audit trail, encryption at rest + in transit, least-privilege IAM
- Mean deployment time: 4 minutes from merge to production (GitHub Actions → ArgoCD Image Updater)
- 12 Terraform modules, all versioned, tested, and reusable across dev/staging/prod environments
Need something similar?
Let's discuss how I can build this kind of infrastructure for your team.