Operations & Monitoring
Running TestBase Cloud in production involves more than spinning up containers. Use the practices below to keep environments reliable.
Infrastructure baseline
- Compute: Deploy the Cloud API to Cloud Run or GCE. The reference deployment uses a GCE VM with Docker (
testbase-agent-vm) and exposes ports8080-8180for container traffic. - Image: Build from
testbase-cloud/docker/Dockerfile. It installs Node.js 20, Codex CLI, and the workspace sync utilities. - Storage: Provision a dedicated GCS bucket per environment (e.g.,
gs://computer-agents-firechatbot-a9654). Bucket objects are namespaced by container ID, simplifying cleanup.
Health checks
| Endpoint | Purpose | Action |
|---|---|---|
GET /health | Cloud API liveness/readiness | Feed into Cloud Run/GCE load balancer health checks. |
GET /api/v1/containers/:id/health | Container-level status | Trigger retries or replacements if healthy === false. |
cloud.getHealth(containerId) | SDK helper | Integrate into automation or dashboards. |
When a container reports healthy: false, fetch recent logs (cloud.getLogs(containerId, 200)) and decide whether to recreate it or escalate.
Logging
- Container logs:
GET /api/v1/containers/:id/logs?lines=100orcloud.getLogs. Pipe into your log aggregation system for long-term retention. - Cloud API logs: The Express server uses Winston logging. Configure sinks (Stackdriver, Datadog, etc.) by adjusting
packages/cloud-api/src/app.ts. - Session artefacts: GCS stores diff summaries and transcripts—treat them as compliance records and ensure bucket retention policies match business requirements.
Scaling strategies
- Start with ephemeral containers (
persistContainer: false) to conserve compute. Storage is cheap; compute is not. - For high-throughput workloads, pre-create a small pool of persistent containers and reuse them to avoid cold starts.
- Autoscale Cloud Run based on request concurrency, or size your GCE VM to handle peak container counts (each container consumes one port in
8080-8180).
Security
- Restrict API keys by environment (
tb_dev_*,tb_test_*,tb_prod_*). Rotate keys regularly and audit usage. - Store secrets (OpenAI keys, MCP tokens) in Secret Manager or Vault, then inject them via environment variables when creating containers.
- Configure IAM on the GCS bucket so only the Cloud API service account can read/write container workspaces.
- Enable HTTPS on the Cloud API endpoint; Cloud Run does this by default, while GCE requires a load balancer or reverse proxy.
Disaster recovery
- Workspace backups: Since workspaces live in GCS, enable bucket versioning or scheduled exports if you need point-in-time recovery.
- Container recreation: Keep Docker image digests in CI outputs. Rebuilds are deterministic when you pin versions in
docker/Dockerfile. - API redundancies: Deploy multiple Cloud API instances behind a load balancer. Clients can retry on
5xxerrors because operations are idempotent (container IDs act as natural keys).
Observability integrations
- Emit metrics around container creation latency, execution duration, MCP failures, and file upload size. The Cloud SDK exposes timestamps you can forward to OpenTelemetry.
- Tag containers with business metadata by storing labels in an external system keyed by
container.id. - Combine session IDs (returned by
run) with log traces to follow requests end-to-end.
Following these practices keeps TestBase Cloud predictable while your team scales agent workloads.
Last updated on