Skip to main content

Platform Architecture

Sparbz Cloud is built on a distributed microservices architecture running on Kubernetes. This document describes the system components, data flow, and operational model.

System Overview

The platform consists of 7 specialized microservices that work together to provide managed infrastructure:

┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ (API, Console Dashboard, CLI) │
└────────────────┬────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ API Server (REST Endpoint) │
│ - Authentication & Authorization │
│ - Resource Management (CRUD) │
│ - Webhook Management │
└────────────────┬────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Real-Time Events (Kafka/RTE Stream) │
│ - Control Plane Events (creation, updates, deletion) │
│ - Data Plane Events (status, readiness) │
│ - Usage Events (per-second recordings) │
└─────────────┬─────────────────┬──────────────┬──────────────┘
│ │ │
┌───────▼────┐ ┌──────▼─────┐ ┌──▼────────────┐
│ Event │ │ Status │ │ Usage │
│ Bridge │ │ Worker │ │ Collector │
│(WebSocket) │ │ (Monitor) │ │(Recording) │
└────────────┘ └────────────┘ └───┬───────────┘

┌──────────────────────────┘

┌─────────────▼──────────────┐
│ Hourly at :05 past hour │
│ Usage Aggregator CronJob │
│ (Aggregates to summaries) │
└──────────┬─────────────────┘

┌──────────▼─────────────────────────────┐
│ Daily at 2 AM UTC - Meter Sync │
│ (Syncs to Stripe Billing API) │
└──────────┬──────────────┬──────────────┘
│ │
┌───────▼──┐ ┌──────▼───────┐
│ Stripe │ │ RTE Audit │
│ Billing │ │ Events │
└──────────┘ └──────────────┘

┌──────────▼──────────────────────────────┐
│ Daily at 3 AM UTC - Garbage Collector│
│ (Cleans orphaned resources) │
└──────────────────────────────────────────┘

Core Microservices

1. API Server (cmd/api/)

Purpose: Main REST endpoint for all resource operations

Responsibilities:

  • User authentication (JWT tokens, API keys)
  • Resource CRUD operations (create, read, update, delete)
  • User management and authorization
  • API documentation (Swagger)
  • Health checks and metrics

Deployment: Kubernetes Deployment, 2 replicas Port: 8080 Environment: Production

2. Event Bridge (cmd/event-bridge/)

Purpose: Real-time event streaming to clients

Responsibilities:

  • Subscribes to all Kafka topics
  • Maintains WebSocket connections
  • Streams events to connected clients
  • Connection state management via Redis
  • HTTP long-polling fallback

Deployment: Kubernetes Deployment, 2 replicas Port: 8082 Use Case: Real-time UI updates (status, progress indicators)

3. Status Worker (cmd/status-worker/)

Purpose: Monitor resource health and status

Responsibilities:

  • Polls resource status from Kubernetes
  • Checks database connectivity
  • Verifies pod readiness
  • Records status changes
  • Publishes status events to RTE

Deployment: Kubernetes Deployment, 1 replica Schedule: Continuous polling Updates: Status database records

4. Usage Collector (cmd/usage-collector/)

Purpose: Record per-second resource usage

Responsibilities:

  • Subscribes to control plane events
  • Records resource creation/deletion
  • Stores usage in usage_records table
  • Publishes usage events to RTE
  • Handles teardown cleanup

Deployment: Kubernetes StatefulSet, 1 instance Trigger: Event-driven (Kafka events) Storage: PostgreSQL usage_records table

5. Usage Aggregator (cmd/usage-aggregator/)

Purpose: Aggregate per-second usage into hourly summaries

Responsibilities:

  • Runs hourly at :05 past each hour
  • Queries usage_records from previous hour
  • Groups usage by resource and organization
  • Calculates exact duration in hours
  • Stores summaries in usage_summaries table

Deployment: Kubernetes CronJob Schedule: 0 * * * * (every hour at :05) Formula: cost = (duration_seconds / 3600) × hourly_rate

6. Meter Sync (cmd/meter-sync/)

Purpose: Sync aggregated usage to Stripe for billing

Responsibilities:

  • Runs daily at 2 AM UTC
  • Queries unbilled summaries from past 24 hours
  • Posts to Stripe Billing Meter Events API
  • Publishes audit events to RTE
  • Tracks billing status and errors

Deployment: Kubernetes CronJob Schedule: 0 2 * * * (daily at 2 AM UTC) Integration: Stripe Billing API Audit: RTE events published to usage.metered topic

7. Garbage Collector (cmd/garbage-collector/)

Purpose: Clean up orphaned and deleted resources

Responsibilities:

  • Runs daily at 3 AM UTC
  • Detects orphaned PVCs without DB records
  • Cleans orphaned Kubernetes namespaces
  • Permanently purges soft-deleted records (after retention period)
  • Cleans old backups (>90 days)

Deployment: Kubernetes CronJob Schedule: 0 3 * * * (daily at 3 AM UTC) Safety: Dry-run mode enabled by default Retention: 30-day default for soft-deleted records

Data Flow

Resource Creation

User Request

API Server (/api/v1/resources)

Validate & Authorize

Create in Database

Provision Infrastructure (K8s, networking, etc.)

Publish CONTROL_RESOURCE_CREATED Event → RTE/Kafka

┌───────────────────────────────────────────┐
│ Event Bridge (streams to WebSocket) │
│ Status Worker (monitors health) │
│ Usage Collector (starts recording) │
└───────────────────────────────────────────┘

Usage Recording & Billing

Resource Created

Usage Collector (subscribes to events)

Record per-second usage → usage_records table

Hourly at :05 (Usage Aggregator CronJob)

Aggregate summaries → usage_summaries table

Daily at 2 AM (Meter Sync CronJob)

POST to Stripe Billing API

├─ Update usage_summaries: is_billed=true
├─ Record stripe_event_id for deduplication
└─ Publish RTE audit event

Stripe Invoice (end of billing period)

Customer Charged

Resource Cleanup

Resource Deleted

API Server: soft delete (set deleted_at timestamp)

Event published: CONTROL_RESOURCE_DELETED

Usage Collector: record deletion event

Record remains in DB (soft delete)

Daily at 3 AM (Garbage Collector CronJob)

└─ 30 days after deletion: Hard delete from DB
└─ Detect orphaned K8s resources: Clean up
└─ Old backups (>90 days): Remove

Billing Model

Per-Second Charging

Sparbz Cloud uses per-second billing with no minimum hour penalties:

Formula: cost = (duration_seconds / 3600) × hourly_rate

Example - Database (Pro Tier = $29/month):

  • Hourly rate: $29 ÷ 730 = $0.0397/hour
  • 5-minute test: (300 seconds / 3600) × $0.0397 = $0.0033
  • 1 day: 86400 × $0.0397 / 3600 = $0.952

Billing Timeline

  1. Second 0: Resource created
  2. Per-second: Usage recorded (1 record per second)
  3. Hourly: Records aggregated into summary
  4. Daily: Summaries synced to Stripe
  5. Monthly: Invoice generated by Stripe
  6. Payment: Charged on billing date

No Minimum Charges

Unlike AWS (1-minute minimum) or GCP (10-minute minimum), Sparbz Cloud charges for exact usage. Short tests, experiments, and development work cost pennies, not dollars.

Infrastructure Layers

Kubernetes Services

  • API Server: ClusterIP service with ingress
  • Event Bridge: ClusterIP service for WebSocket/HTTP
  • Status Worker: Internal communication only
  • Usage Collector: StatefulSet for state consistency
  • CronJobs: Three separate scheduled jobs

Persistence

  • PostgreSQL: Primary data store

    • User accounts, organizations, resources
    • Usage records and summaries
    • Billing status tracking
    • Soft-deleted records for recovery
  • Kafka/RTE: Event streaming

    • Control plane events (resource lifecycle)
    • Data plane events (status, readiness)
    • Usage events (aggregation, billing)
    • Audit trail (billing events)
  • Redis/Valkey: Session and connection state

    • Event Bridge connections registry
    • User sessions
    • Caching

External Services

  • Stripe: Billing and invoicing
  • Kubernetes API: Resource provisioning
  • Harbor Registry: Container images
  • S3/MinIO: Object storage
  • HashiCorp Vault: Secrets management

Deployment Model

Production Kubernetes

Deployments:
- szc-api (2 replicas)
- szc-event-bridge (2 replicas)
- szc-status-worker (1 replica)

StatefulSets:
- szc-usage-collector (1 instance)

CronJobs:
- szc-usage-aggregator (hourly)
- szc-meter-sync (daily 2 AM)
- szc-garbage-collector (daily 3 AM)

Resource Allocation

ServiceMemoryCPUReplicas
API512Mi500m2
Event Bridge256Mi200m2
Status Worker256Mi250m1
Usage Collector256Mi200m1
Usage Aggregator512Mi500m- (CronJob)
Meter Sync512Mi500m- (CronJob)
Garbage Collector512Mi500m- (CronJob)

Scaling

Horizontal Scaling

  • API Server: Scale up replicas for load
  • Event Bridge: Scale for concurrent WebSocket connections
  • Status Worker: Usually 1 (no scaling needed)
  • Usage Collector: Usually 1 (StatefulSet for consistency)

Vertical Scaling

  • Increase memory/CPU for CronJobs if processing large datasets
  • Database connection pooling for increased API load
  • Redis cluster for session state at scale

High Availability

Multi-Replica Services

  • API and Event Bridge run 2+ replicas for zero-downtime deployments
  • Pod Disruption Budget ensures 1 always available
  • LoadBalancer distributes traffic across replicas

CronJob Reliability

  • Automated retries on failure
  • Job history for debugging
  • Status tracking in database

Data Durability

  • PostgreSQL automated backups
  • Kafka topic replication
  • Event audit trail for reconstruction

Monitoring & Observability

Logs

  • Structured logging from all services
  • Centralized log aggregation (Loki/ELK recommended)
  • Service-specific filtering via labels

Metrics

  • Prometheus metrics on /metrics endpoint
  • Custom business metrics (billable hours, error rates)
  • Kubernetes metrics (pod CPU/memory)

Alerts

  • Job failure alerts (CronJobs)
  • API error rate spikes
  • Database connection pool exhaustion
  • Stripe API failures

Security

  • TLS encryption for all external communication
  • JWT tokens for API authentication
  • RBAC in Kubernetes
  • Network policies for pod communication
  • Secrets in Kubernetes secrets (not in code)
  • Vault for additional secrets management

Summary

Sparbz Cloud's architecture is designed for:

Accuracy: Per-second billing without minimum charges ✅ Reliability: Multi-replica services with automatic failover ✅ Scalability: Horizontal scaling for all services ✅ Observability: Comprehensive logging and metrics ✅ Security: Defense-in-depth with encryption and RBAC ✅ Simplicity: Well-defined service boundaries and data flow

See the Operations Guide for deployment procedures.