Overview
At Engeem SAS, I architected a multi-tenant SaaS control plane that automates the full lifecycle of tenant provisioning — from organization creation through environment setup, secrets management, and RBAC policy enforcement. The platform manages 15+ microservices and uses domain-driven design to maintain strict service boundaries.
Problem
Enterprise customers needed isolated environments with:
- Zero-touch onboarding: A new tenant should go from signup to fully provisioned in under 60 seconds
- Secrets management: No hardcoded credentials anywhere in the stack — everything rotated automatically
- Fine-grained RBAC: Tenants can define custom roles scoped to their organization
- Audit compliance: Every provisioning action must be traceable for SOC 2 readiness
The legacy approach was a collection of scripts and manual Jira tickets that took 2–3 days per new tenant.
Architecture
The control plane is decomposed into three domain services, each owning its aggregate root.
Data Flow
Tenant Provisioning Sequence
Each step is orchestrated via a saga pattern with compensating transactions. If Vault fails, the Keycloak realm is rolled back.
API Design
The control plane exposes a RESTful API following resource-oriented design:
POST /api/v1/tenants # Create tenant
GET /api/v1/tenants/:id # Get tenant details
POST /api/v1/tenants/:id/environments # Provision environment
DELETE /api/v1/tenants/:id/environments/:env # Deprovision
POST /api/v1/tenants/:id/roles # Create custom role
GET /api/v1/tenants/:id/audit-log # Audit trail
All endpoints are secured with OIDC bearer tokens issued by Keycloak, scoped per tenant.
Scaling Strategy
- Stateless services: Each microservice is horizontally scalable behind Kubernetes Ingress
- Event-driven provisioning: Long-running provisioning tasks use async events (RabbitMQ) to avoid HTTP timeout issues
- Database per tenant: Each tenant gets an isolated PostgreSQL schema, managed by Flyway migrations triggered at provisioning time
- Connection pooling: HikariCP with tenant-aware routing via
AbstractRoutingDataSource
Reliability
- Saga orchestration: Multi-step provisioning with compensating rollbacks prevents partial states
- Health checks: Each service exposes
/actuator/healthwith dependency-aware status (Keycloak reachable, Vault unsealed, DB connected) - Circuit breakers: Resilience4j circuit breakers on all external calls (Keycloak, Vault, K8s API) with fallback strategies
- Provisioning SLA: < 60 seconds from API call to fully operational tenant environment
Security
- OIDC everywhere: Service-to-service authentication via Keycloak client credentials, no shared secrets
- Vault-managed secrets: Database credentials, API keys, and TLS certificates are all dynamic secrets with automatic rotation
- Namespace isolation: Kubernetes NetworkPolicies enforce strict tenant-to-tenant isolation
- Zero hardcoded credentials: Eliminated all static secrets from codebase and CI/CD pipelines via Vault integration
Trade-offs
| Decision | Benefit | Cost | |----------|---------|------| | Schema-per-tenant | Strong isolation, easy compliance | More complex migrations, higher DB overhead | | Keycloak as IdP | Standards-compliant OIDC, rich RBAC | Operational complexity of running Keycloak | | Saga over 2PC | Loose coupling, individual service resilience | Eventually consistent, complex error handling | | Vault for all secrets | Dynamic rotation, audit trail | Additional infrastructure to operate |
Lessons Learned
-
Domain boundaries pay off — Strict DDD boundaries between Organization, Environment, and Cluster services meant teams could work in parallel and deploy independently.
-
Automate day-2 operations early — Tenant deprovisioning and secret rotation were scoped from the start, not bolted on later. This saved weeks of rework.
-
Keycloak Admin API is powerful but underdocumented — We contributed back several documentation PRs after learning the hard way about realm-level client scope propagation.
-
Observability-driven development — Every provisioning step emits structured logs and metrics. When a provisioning fails, we can trace the exact step and compensating action.
-
Infrastructure as code for tenant resources — Using Terraform modules for Vault secrets engines and K8s namespaces ensured reproducibility across staging and production.