Platform Architecture
Sparbz Cloud is built on a distributed microservices architecture running on Kubernetes. This document describes the system components, data flow, and operational model.
System Overview
The platform consists of 7 specialized microservices that work together to provide managed infrastructure:
┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ (API, Console Dashboard, CLI) │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API Server (REST Endpoint) │
│ - Authentication & Authorization │
│ - Resource Management (CRUD) │
│ - Webhook Management │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Real-Time Events (Kafka/RTE Stream) │
│ - Control Plane Events (creation, updates, deletion) │
│ - Data Plane Events (status, readiness) │
│ - Usage Events (per-second recordings) │
└─────────────┬─────────────────┬──────────────┬──────────────┘
│ │ │
┌───────▼────┐ ┌──────▼─────┐ ┌──▼────────────┐
│ Event │ │ Status │ │ Usage │
│ Bridge │ │ Worker │ │ Collector │
│(WebSocket) │ │ (Monitor) │ │(Recording) │
└────────────┘ └────────────┘ └───┬───────────┘
│
┌──────────────────────────┘
│
┌─────────────▼──────────────┐
│ Hourly at :05 past hour │
│ Usage Aggregator CronJob │
│ (Aggregates to summaries) │
└──────────┬─────────────────┘
│
┌──────────▼─────────────────────────────┐
│ Daily at 2 AM UTC - Meter Sync │
│ (Syncs to Stripe Billing API) │
└──────────┬──────────────┬──────────────┘
│ │
┌───────▼──┐ ┌──────▼───────┐
│ Stripe │ │ RTE Audit │
│ Billing │ │ Events │
└──────────┘ └──────────────┘
│
┌──────────▼──────────────────────────────┐
│ Daily at 3 AM UTC - Garbage Collector│
│ (Cleans orphaned resources) │
└──────────────────────────────────────────┘
Core Microservices
1. API Server (cmd/api/)
Purpose: Main REST endpoint for all resource operations
Responsibilities:
- User authentication (JWT tokens, API keys)
- Resource CRUD operations (create, read, update, delete)
- User management and authorization
- API documentation (Swagger)
- Health checks and metrics
Deployment: Kubernetes Deployment, 2 replicas Port: 8080 Environment: Production
2. Event Bridge (cmd/event-bridge/)
Purpose: Real-time event streaming to clients
Responsibilities:
- Subscribes to all Kafka topics
- Maintains WebSocket connections
- Streams events to connected clients
- Connection state management via Redis
- HTTP long-polling fallback
Deployment: Kubernetes Deployment, 2 replicas Port: 8082 Use Case: Real-time UI updates (status, progress indicators)
3. Status Worker (cmd/status-worker/)
Purpose: Monitor resource health and status
Responsibilities:
- Polls resource status from Kubernetes
- Checks database connectivity
- Verifies pod readiness
- Records status changes
- Publishes status events to RTE
Deployment: Kubernetes Deployment, 1 replica Schedule: Continuous polling Updates: Status database records
4. Usage Collector (cmd/usage-collector/)
Purpose: Record per-second resource usage
Responsibilities:
- Subscribes to control plane events
- Records resource creation/deletion
- Stores usage in
usage_recordstable - Publishes usage events to RTE
- Handles teardown cleanup
Deployment: Kubernetes StatefulSet, 1 instance
Trigger: Event-driven (Kafka events)
Storage: PostgreSQL usage_records table
5. Usage Aggregator (cmd/usage-aggregator/)
Purpose: Aggregate per-second usage into hourly summaries
Responsibilities:
- Runs hourly at :05 past each hour
- Queries
usage_recordsfrom previous hour - Groups usage by resource and organization
- Calculates exact duration in hours
- Stores summaries in
usage_summariestable
Deployment: Kubernetes CronJob
Schedule: 0 * * * * (every hour at :05)
Formula: cost = (duration_seconds / 3600) × hourly_rate
6. Meter Sync (cmd/meter-sync/)
Purpose: Sync aggregated usage to Stripe for billing
Responsibilities:
- Runs daily at 2 AM UTC
- Queries unbilled summaries from past 24 hours
- Posts to Stripe Billing Meter Events API
- Publishes audit events to RTE
- Tracks billing status and errors
Deployment: Kubernetes CronJob
Schedule: 0 2 * * * (daily at 2 AM UTC)
Integration: Stripe Billing API
Audit: RTE events published to usage.metered topic
7. Garbage Collector (cmd/garbage-collector/)
Purpose: Clean up orphaned and deleted resources
Responsibilities:
- Runs daily at 3 AM UTC
- Detects orphaned PVCs without DB records
- Cleans orphaned Kubernetes namespaces
- Permanently purges soft-deleted records (after retention period)
- Cleans old backups (>90 days)
Deployment: Kubernetes CronJob
Schedule: 0 3 * * * (daily at 3 AM UTC)
Safety: Dry-run mode enabled by default
Retention: 30-day default for soft-deleted records
Data Flow
Resource Creation
User Request
↓
API Server (/api/v1/resources)
↓
Validate & Authorize
↓
Create in Database
↓
Provision Infrastructure (K8s, networking, etc.)
↓
Publish CONTROL_RESOURCE_CREATED Event → RTE/Kafka
↓
┌───────────────────────────────────────────┐
│ Event Bridge (streams to WebSocket) │
│ Status Worker (monitors health) │
│ Usage Collector (starts recording) │
└───────────────────────────────────────────┘
Usage Recording & Billing
Resource Created
↓
Usage Collector (subscribes to events)
↓
Record per-second usage → usage_records table
↓
Hourly at :05 (Usage Aggregator CronJob)
↓
Aggregate summaries → usage_summaries table
↓
Daily at 2 AM (Meter Sync CronJob)
↓
POST to Stripe Billing API
↓
├─ Update usage_summaries: is_billed=true
├─ Record stripe_event_id for deduplication
└─ Publish RTE audit event
↓
Stripe Invoice (end of billing period)
↓
Customer Charged
Resource Cleanup
Resource Deleted
↓
API Server: soft delete (set deleted_at timestamp)
↓
Event published: CONTROL_RESOURCE_DELETED
↓
Usage Collector: record deletion event
↓
Record remains in DB (soft delete)
↓
Daily at 3 AM (Garbage Collector CronJob)
↓
└─ 30 days after deletion: Hard delete from DB
└─ Detect orphaned K8s resources: Clean up
└─ Old backups (>90 days): Remove
Billing Model
Per-Second Charging
Sparbz Cloud uses per-second billing with no minimum hour penalties:
Formula: cost = (duration_seconds / 3600) × hourly_rate
Example - Database (Pro Tier = $29/month):
- Hourly rate: $29 ÷ 730 = $0.0397/hour
- 5-minute test: (300 seconds / 3600) × $0.0397 = $0.0033
- 1 day: 86400 × $0.0397 / 3600 = $0.952
Billing Timeline
- Second 0: Resource created
- Per-second: Usage recorded (1 record per second)
- Hourly: Records aggregated into summary
- Daily: Summaries synced to Stripe
- Monthly: Invoice generated by Stripe
- Payment: Charged on billing date
No Minimum Charges
Unlike AWS (1-minute minimum) or GCP (10-minute minimum), Sparbz Cloud charges for exact usage. Short tests, experiments, and development work cost pennies, not dollars.
Infrastructure Layers
Kubernetes Services
- API Server: ClusterIP service with ingress
- Event Bridge: ClusterIP service for WebSocket/HTTP
- Status Worker: Internal communication only
- Usage Collector: StatefulSet for state consistency
- CronJobs: Three separate scheduled jobs
Persistence
-
PostgreSQL: Primary data store
- User accounts, organizations, resources
- Usage records and summaries
- Billing status tracking
- Soft-deleted records for recovery
-
Kafka/RTE: Event streaming
- Control plane events (resource lifecycle)
- Data plane events (status, readiness)
- Usage events (aggregation, billing)
- Audit trail (billing events)
-
Redis/Valkey: Session and connection state
- Event Bridge connections registry
- User sessions
- Caching
External Services
- Stripe: Billing and invoicing
- Kubernetes API: Resource provisioning
- Harbor Registry: Container images
- S3/MinIO: Object storage
- HashiCorp Vault: Secrets management
Deployment Model
Production Kubernetes
Deployments:
- szc-api (2 replicas)
- szc-event-bridge (2 replicas)
- szc-status-worker (1 replica)
StatefulSets:
- szc-usage-collector (1 instance)
CronJobs:
- szc-usage-aggregator (hourly)
- szc-meter-sync (daily 2 AM)
- szc-garbage-collector (daily 3 AM)
Resource Allocation
| Service | Memory | CPU | Replicas |
|---|---|---|---|
| API | 512Mi | 500m | 2 |
| Event Bridge | 256Mi | 200m | 2 |
| Status Worker | 256Mi | 250m | 1 |
| Usage Collector | 256Mi | 200m | 1 |
| Usage Aggregator | 512Mi | 500m | - (CronJob) |
| Meter Sync | 512Mi | 500m | - (CronJob) |
| Garbage Collector | 512Mi | 500m | - (CronJob) |
Scaling
Horizontal Scaling
- API Server: Scale up replicas for load
- Event Bridge: Scale for concurrent WebSocket connections
- Status Worker: Usually 1 (no scaling needed)
- Usage Collector: Usually 1 (StatefulSet for consistency)
Vertical Scaling
- Increase memory/CPU for CronJobs if processing large datasets
- Database connection pooling for increased API load
- Redis cluster for session state at scale
High Availability
Multi-Replica Services
- API and Event Bridge run 2+ replicas for zero-downtime deployments
- Pod Disruption Budget ensures 1 always available
- LoadBalancer distributes traffic across replicas
CronJob Reliability
- Automated retries on failure
- Job history for debugging
- Status tracking in database
Data Durability
- PostgreSQL automated backups
- Kafka topic replication
- Event audit trail for reconstruction
Monitoring & Observability
Logs
- Structured logging from all services
- Centralized log aggregation (Loki/ELK recommended)
- Service-specific filtering via labels
Metrics
- Prometheus metrics on
/metricsendpoint - Custom business metrics (billable hours, error rates)
- Kubernetes metrics (pod CPU/memory)
Alerts
- Job failure alerts (CronJobs)
- API error rate spikes
- Database connection pool exhaustion
- Stripe API failures
Security
- TLS encryption for all external communication
- JWT tokens for API authentication
- RBAC in Kubernetes
- Network policies for pod communication
- Secrets in Kubernetes secrets (not in code)
- Vault for additional secrets management
Summary
Sparbz Cloud's architecture is designed for:
✅ Accuracy: Per-second billing without minimum charges ✅ Reliability: Multi-replica services with automatic failover ✅ Scalability: Horizontal scaling for all services ✅ Observability: Comprehensive logging and metrics ✅ Security: Defense-in-depth with encryption and RBAC ✅ Simplicity: Well-defined service boundaries and data flow
See the Operations Guide for deployment procedures.