Skip to Content
CloudOperations & Monitoring

Operations & Monitoring

Running TestBase Cloud in production involves more than spinning up containers. Use the practices below to keep environments reliable.

Infrastructure baseline

  • Compute: Deploy the Cloud API to Cloud Run or GCE. The reference deployment uses a GCE VM with Docker (testbase-agent-vm) and exposes ports 8080-8180 for container traffic.
  • Image: Build from testbase-cloud/docker/Dockerfile. It installs Node.js 20, Codex CLI, and the workspace sync utilities.
  • Storage: Provision a dedicated GCS bucket per environment (e.g., gs://computer-agents-firechatbot-a9654). Bucket objects are namespaced by container ID, simplifying cleanup.

Health checks

EndpointPurposeAction
GET /healthCloud API liveness/readinessFeed into Cloud Run/GCE load balancer health checks.
GET /api/v1/containers/:id/healthContainer-level statusTrigger retries or replacements if healthy === false.
cloud.getHealth(containerId)SDK helperIntegrate into automation or dashboards.

When a container reports healthy: false, fetch recent logs (cloud.getLogs(containerId, 200)) and decide whether to recreate it or escalate.

Logging

  • Container logs: GET /api/v1/containers/:id/logs?lines=100 or cloud.getLogs. Pipe into your log aggregation system for long-term retention.
  • Cloud API logs: The Express server uses Winston logging. Configure sinks (Stackdriver, Datadog, etc.) by adjusting packages/cloud-api/src/app.ts.
  • Session artefacts: GCS stores diff summaries and transcripts—treat them as compliance records and ensure bucket retention policies match business requirements.

Scaling strategies

  • Start with ephemeral containers (persistContainer: false) to conserve compute. Storage is cheap; compute is not.
  • For high-throughput workloads, pre-create a small pool of persistent containers and reuse them to avoid cold starts.
  • Autoscale Cloud Run based on request concurrency, or size your GCE VM to handle peak container counts (each container consumes one port in 8080-8180).

Security

  • Restrict API keys by environment (tb_dev_*, tb_test_*, tb_prod_*). Rotate keys regularly and audit usage.
  • Store secrets (OpenAI keys, MCP tokens) in Secret Manager or Vault, then inject them via environment variables when creating containers.
  • Configure IAM on the GCS bucket so only the Cloud API service account can read/write container workspaces.
  • Enable HTTPS on the Cloud API endpoint; Cloud Run does this by default, while GCE requires a load balancer or reverse proxy.

Disaster recovery

  • Workspace backups: Since workspaces live in GCS, enable bucket versioning or scheduled exports if you need point-in-time recovery.
  • Container recreation: Keep Docker image digests in CI outputs. Rebuilds are deterministic when you pin versions in docker/Dockerfile.
  • API redundancies: Deploy multiple Cloud API instances behind a load balancer. Clients can retry on 5xx errors because operations are idempotent (container IDs act as natural keys).

Observability integrations

  • Emit metrics around container creation latency, execution duration, MCP failures, and file upload size. The Cloud SDK exposes timestamps you can forward to OpenTelemetry.
  • Tag containers with business metadata by storing labels in an external system keyed by container.id.
  • Combine session IDs (returned by run) with log traces to follow requests end-to-end.

Following these practices keeps TestBase Cloud predictable while your team scales agent workloads.

Last updated on