Building a Production-Grade Microservices Platform on Kubernetes
There is a difference between building a microservices demo and building a microservices platform. A demo splits a monolith into a handful of HTTP services. A platform is designed around bounded contexts, enforces API contracts between teams, gates every deployment behind an automated test pyramid, instruments every request end-to-end, and treats the network as untrusted from day one.
This post is an architectural deep dive into ShopEase — a full-stack, cloud-native eCommerce platform I built to production standards. The platform consists of four backend microservices, a Next.js frontend, three independent database engines, a five-layer test pyramid with 330+ tests, automated CI/CD, and a full observability stack — all running on Kubernetes.
Live at: shop.kunlecreates.org GitHub: kunlecreates/shop-ease-enterprise-app
The goal of this post is not to walk through code line-by-line. It is to articulate the architectural thinking — the decisions that distinguish platform engineering from application development.
Pillar 1: Domain-Driven Microservices
The first design decision was the hardest: how to draw the service boundaries.
Most beginners split services by technical function — “authentication service”, “database service”, “API service”. That produces chatty, tightly coupled systems that are harder to operate than the monolith they replaced.
Domain-Driven Design (DDD) provides a better lens: split by bounded context, where each context owns a coherent piece of business behaviour and its own data model.
The Five Bounded Contexts
┌───────────────────────────────────────────────────────────────────────────────┐
│ ShopEase Platform │
├──────────────┬──────────────┬───────────────┬──────────────┬──────────────────┤
│ frontend │ user-service │product-service│ order-service│notification-svc │
│ │ │ │ │ │
│ Next.js 15 │ Identity & │ Catalogue & │ Carts & │ Email delivery │
│ App Router │ Auth │ Inventory │ Fulfilment │ (stateless) │
│ │ │ │ │ │
│ React 19 │ Oracle DB │ PostgreSQL │ MS SQL Svr │ FastAPI │
└──────────────┴──────────────┴───────────────┴──────────────┴──────────────────┘
Why this split?
-
user-service owns Identity: registration, login, JWT issuance, email verification, password reset, RBAC, and the complete login audit trail. No other service handles authentication logic. The bounded context is enforced at the data layer —
user_refvalues in other services are opaque strings, not foreign keys. -
product-service owns the Catalogue: SKU definitions, pricing, categories, inventory levels, and stock movement audit. It is completely independent of orders or users. This allows the catalogue to scale independently during traffic spikes and be deployed without touching the checkout flow.
-
order-service owns Fulfilment: shopping carts, order lifecycle, payment metadata, and shipping address snapshots. The key design rule here is that the shipping address is snapshotted at checkout time. A user updating their profile later never retroactively corrupts historical order data — a real-world requirement that most demos miss entirely.
-
notification-service owns Delivery: it consumes order events and dispatches confirmation emails via Jinja2-templated SMTP. It is deliberately stateless — no database, no state machine, just a FastAPI service that renders a template and sends an email. Stateless services are trivially scalable and testable.
-
frontend is not just a UI — it is a thin API proxy layer. Every backend call is routed through Next.js server-side API routes, which forward to the appropriate ClusterIP service. This means the browser never talks directly to backend services, and JWT tokens are stored in HttpOnly cookies, not
localStorage.
The Rule: No Cross-Service Foreign Keys
Cross-service data references use opaque string keys (user_ref, product_ref). There are no database-level foreign key constraints crossing service boundaries. Services communicate via APIs. This is what makes each service independently deployable.
Pillar 2: Polyglot Service Architecture
Once bounded contexts were defined, the next question was: which technology is right for each domain?
Choosing the same technology for every service is operationally simpler, but it means accepting a worst-fit language for every problem. A platform engineer chooses the best tool for the bounded context.
| Service | Language & Framework | Why |
|---|---|---|
user-service |
Java 21 + Spring Boot 3.3 | Stateful auth domain with complex lifecycle logic (verification tokens, password reset, login audit). Java’s type system and Spring Security’s maturity make RBAC and JWT/RS256 handling robust and well-tested. |
order-service |
Java 21 + Spring Boot 3.3 | Cart and order state machines benefit from Java’s strong typing. Spring Data JPA + Testcontainers (MSSQL) gives production-identical integration testing. |
product-service |
Node.js 20 + NestJS 10 | The catalogue is primarily CRUD + full-text search. NestJS’s decorator-driven module system matches the domain model cleanly. PostgreSQL’s TSVECTOR full-text search integrates naturally with a TypeScript layer. |
notification-service |
Python 3.12 + FastAPI | A stateless email dispatcher. Python’s Jinja2 ecosystem produces expressive email templates. FastAPI gives async handling with minimal boilerplate. For a fire-and-forward concern, Python is the right weight. |
frontend |
Next.js 15 + React 19 | App Router enables server-side rendering for SEO-critical product pages and server-side JWT verification before rendering protected routes. |
Each service is independently buildable, testable, and deployable. The CI pipelines are parallel — all five services build simultaneously on push.
Pillar 3: Contract-Driven API Integration
In microservices systems, the biggest category of integration bugs is contract drift: Service A changes a response field name; Service B breaks silently. Traditional unit tests catch nothing because they mock the other service. Manual testing catches it eventually, but only after a deployment.
Contract testing closes this gap by making the expected API shape an explicit, versioned artefact — tested on every push.
How the API Test Layer Works
api-tests/
├── contracts/ ← Schema validation tests (what shape does this service expose?)
│ ├── user-product.contract.test.ts # product-service API matches user expectations
│ ├── order-product.contract.test.ts # order-service expectations of product-service
│ └── ...
└── flows/ ← Multi-service business flow tests (do services work together?)
├── checkout.flow.test.ts # login → browse → cart → checkout → confirmation
├── user-registration.flow.test.ts
└── ...
Contract tests (contracts/) run against live deployed services. They validate:
- Required fields are present in every response
- Data types match the expected schema
- Error codes are consistent (e.g.,
401for expired JWT,404for missing resource) - Pagination envelope format is stable
Flow tests (flows/) are end-to-end business workflow validators. They exercise real multi-service paths: a checkout.flow.test.ts call logs in as a real user, searches for a real product, adds it to a cart, checks out, and validates that an order record appears in the order service.
This layer runs in the CI pipeline after deployment to the staging cluster — it tests the real services, not mocks. A contract violation fails the pipeline before production promotion.
The practical result: when the product-service team changes a response envelope for performance reasons, the downstream contract tests fail immediately, not three deployments later.
Pillar 4: Production CI/CD Strategy
The CI/CD pipeline is designed around two principles: fail fast and deploy with confidence.
Pipeline Architecture
Developer Push (any service file)
│
▼
┌────────────────────────────────────────────────────────────────────┐
│ GitHub Actions CI Pipeline │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Unit Tests │ → │ Integration │ → │ Docker Build & Push │ │
│ │ (fast, <1m) │ │ Tests │ │ → GHCR │ │
│ │ │ │ (real DB │ │ │ │
│ │ JUnit/Jest/ │ │ via │ │ Reuses image across │ │
│ │ pytest │ │ Testcont.) │ │ environments │ │
│ └──────────────┘ └──────────────┘ └────────────────────────┘ │
│ │ │ │ │
│ │ FAIL │ FAIL │ SUCCESS │
│ ▼ ▼ ▼ │
│ Block PR Block PR Helm Deploy → K8s │
└────────────────────────────────────────────────────────────────────┘
│ │
▼ (post-deploy) ▼
┌─────────────────────────┐ ┌──────────────────────────┐
│ API Contract Tests │ │ Coverage Authority │
│ (live services) │ │ (aggregate from all CIs)│
│ Blocks promotion │ │ Updates badge │
└─────────────────────────┘ └──────────────────────────┘
│
▼
┌─────────────────────────┐
│ Playwright E2E Tests │
│ (browser, post-deploy) │
│ Does not block merge │
└─────────────────────────┘
Five Key Governance Decisions
-
Integration tests block PRs using real databases via Testcontainers. This eliminates the “works on my machine” class of failures. The Testcontainers pattern spins up a real PostgreSQL/MSSQL/Oracle container per test run, runs migrations against it, and tears it down — giving production-identical test fidelity in under five minutes.
-
Parallel service pipelines. All five services have independent CI pipelines that run concurrently on push. A change to
product-servicedoes not wait foruser-servicetests to pass. Total CI time: 3–5 minutes. -
Docker images are built once and reused. The image built in CI is the exact image deployed to staging and promoted to production. There is no rebuild on deploy.
-
Coverage Authority workflow aggregates test coverage from all five service CI runs after they complete, computes a weighted aggregate, and auto-updates the coverage badge in the README. Coverage tracking is automated — it is not a manual step.
-
Self-hosted ARC runner. GitHub Actions Runner Controller (ARC) is deployed on the cluster. CI jobs run on the same infrastructure as production, eliminating environment-specific surprises.
Most teams track test coverage per-service and call it done. Coverage Authority operates at the platform level. After all five independent service CI pipelines complete, a dedicated workflow collects each service's coverage report, computes a weighted aggregate across the entire platform, and automatically updates the README badge. A failing aggregate gate — where any service drops below its configured threshold — blocks the composite status from going green. This is quality governance engineered as infrastructure: a single authoritative view of platform-wide test health, updated on every push with zero manual steps.
Pillar 5: Observability-Native Services
Observability is not something you bolt on after a production incident. It is designed in from the beginning. The ShopEase platform uses OpenTelemetry auto-instrumentation via the Kubernetes Operator pattern — every service emits distributed traces, metrics, and logs with zero manual SDK instrumentation code.
How OpenTelemetry Auto-Instrumentation Works
The OpenTelemetry Operator is deployed to the opentelemetry-system namespace. It watches for Instrumentation Custom Resources and automatically injects the appropriate OTel agent as an init container into every matching pod.
# Java Instrumentation CR — applies to user-service and order-service
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: java-instrumentation
namespace: shopease-system
spec:
exporter:
endpoint: http://otel-collector.opentelemetry-system:4317
propagators: [tracecontext, baggage, b3]
sampler:
type: AlwaysOn
java:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
# Memory optimised to reduce 40-60% overhead vs default configuration
env:
- name: OTEL_INSTRUMENTATION_COMMON_EXPERIMENTAL_CONTROLLER_TELEMETRY_ENABLED
value: "false"
When the user-service Pod is created, the Operator detects the instrumentation.opentelemetry.io/inject-java: "true" pod annotation and injects the Java agent init container. The application never changes. No SDK imports. No configuration code.
The Full Signal Path
Auto-instrumented Pod
┌──────────────────────────┐
│ Application Code │
│ (unchanged) │
│ │
│ OTel Java Agent │ ← injected by Operator
│ (init container) │
└───────────┬──────────────┘
│ OTLP/gRPC
▼
┌──────────────────────────┐
│ OpenTelemetry Collector │
│ (gateway mode) │
└──┬───────────┬───────────┘
│ │ │
Traces Metrics Logs
▼ ▼ ▼
Jaeger Prometheus ECK
(trace (metrics + (Elasticsearch
UI) Grafana) + Kibana)
What This Gives You
-
Distributed traces in Jaeger: every inbound HTTP request to
user-servicegenerates a trace that propagates throughorder-serviceandnotification-service. Root cause analysis of a latency spike requires correlating one trace ID, not grepping logs across five pods. -
JVM metrics in Prometheus: heap usage, GC pause durations, thread pool saturation, and DB connection pool utilisation are auto-collected. Grafana dashboards alert before OOM errors occur.
-
Structured logs in Kibana: logs from all five services are centrally aggregated, indexed, and searchable. A distributed transaction ID can find every log line across every service in one query.
The overhead optimisation (40–60% reduction vs default) was achieved by disabling experimental features and tuning sampler configuration — production observability must not noticeably degrade the service it observes.
Pillar 6: Zero-Trust Service Networking
In a default Kubernetes cluster, every pod can talk to every other pod. This violates the principle of least privilege and creates a large blast radius if any single service is compromised.
ShopEase implements zero-trust NetworkPolicies: every service is isolated by default, with explicit ingress and egress rules for each permitted communication path.
The Policy Model
Default posture: deny all ingress and egress for all pods
│
▼ Explicit exceptions:
┌─────────────────────────────────────────────────────────────────┐
│ Allowed Communication Paths │
│ │
│ NGINX Ingress → frontend (port 3000) │
│ NGINX Ingress → user-service (port 8080) │
│ NGINX Ingress → product-service (port 3001) │
│ NGINX Ingress → order-service (port 8081) │
│ │
│ frontend → user-service (JWT validation proxy) │
│ frontend → product-service (catalogue proxy) │
│ frontend → order-service (cart/order proxy) │
│ │
│ order-service → notification-svc (order confirmation) │
│ order-service → product-service (inventory reservation) │
│ │
│ All services → Oracle/PG/MSSQL (own database only) │
│ All services → OTEL Collector (telemetry egress) │
└─────────────────────────────────────────────────────────────────┘
What Zero-Trust Prevents
If the notification-service were compromised — say, through a dependency vulnerability — it cannot lateral move to query the users Oracle database or place orders on behalf of users. Its NetworkPolicy limits egress to the SMTP endpoint and the OTel collector only. East-west movement is blocked at the kernel level.
This is a security control that operates independently of application-layer JWT validation. Even if a JWT were forged or stolen, a compromised pod still cannot route directly to a database it has no NetworkPolicy permission to reach.
The System as a Whole
These six pillars are not independent concerns — they reinforce each other:
- DDD boundaries make contract tests meaningful (each service has a clear, testable API surface)
- Polyglot persistence is possible because each service owns its database (no shared schema to coordinate)
- Contract tests catch drift before deployment (CI governance)
- Zero-trust networking enforces the bounded context at the infrastructure level
- Auto-instrumentation gives full-stack visibility without requiring each service team to maintain SDK boilerplate
Platform engineering is the discipline of building systems that are independently deployable, jointly observable, and collectively trustworthy. Each technical decision in ShopEase exists to serve one of those three properties.
Platform Metrics
| Dimension | Value |
|---|---|
| Services | 4 backend + 1 frontend |
| Test layers | 5 (unit → frontend unit → integration → API contracts → E2E) |
| Total tests | 330+ across all layers |
| Code coverage | 85%+ aggregate (target: 90%) |
| Integration test approach | Testcontainers (real Oracle, PostgreSQL, MSSQL in CI) |
| Deployment time | ~5 minutes end-to-end (GitHub push → live on Kubernetes) |
| Observability signals | Traces (Jaeger), Metrics (Prometheus/Grafana), Logs (ECK) |
| Database engines | Oracle DB 23c, PostgreSQL 17, MS SQL Server 2022 |
What I Would Do Differently
Async event bus over synchronous order→notification calls. The current order-service → notification-service path is synchronous HTTP. For production, this should be a message queue (Kafka or RabbitMQ) with at-least-once delivery guarantees. A slow SMTP server should not add latency to the checkout response.
Saga pattern for distributed transactions. The checkout flow spans order creation and inventory reservation. Currently, a failure between these steps requires manual reconciliation. A choreography-based saga with compensating transactions would make this self-healing.
HashiCorp Vault for secret management. Kubernetes Secrets are base64, not encrypted at rest by default. Vault with dynamic credentials would give short-lived, auto-rotated database passwords without any secret ever sitting in a manifest file.
Resources
- ShopEase GitHub Repository
- Live Platform
- DB-as-a-Service Kubernetes Deployments
- Observability Stack Deep Dive
- CI/CD Helm Deployment
Questions about any of the architectural decisions, or want to discuss platform engineering patterns? Email me or connect on GitHub.