Platform Architecture

MetoSim uses a Hybrid Async API + Modal GPU Workers architecture — separate deployable units for the API and engine, connected by a lightweight Redis task broker.

Layered Overview

Layer	Runs On	Component	Responsibility
Client	Researcher's machine	Python SDK	Config validation, job polling, visualization
Gateway	Railway / Cloud Run	FastAPI REST API	Auth, rate-limit, job state machine
Compute	Modal (B200/A100)	FDTD Engine	Mesh → Solve → Serialize → Store
Storage	S3 / Cloudflare R2	HDF5 Files	Checksum-verified, immutable results
Observability	SaaS	Prometheus + Sentry	Structured logs, correlation IDs

Component Diagram

SDK ────HTTPS────► FastAPI (API Container) ────Redis────► Engine Worker (GPU)
                        │                                       │
                   PostgreSQL                            S3 Object Store
                   (Job State)                           (HDF5 Results)

End-to-End Data Flow

User creates Simulation(config) and calls client.run(sim)
SDK validates config with Pydantic, serialises to JSON
SDK POSTs to /v1/simulations with Bearer token
API validates auth, checks no active job (V1 single-job constraint)
API creates job record (QUEUED), returns job_id
API dispatches async task to engine via Redis
Engine worker generates mesh, runs FDTD solver on GPU
Engine writes field arrays + metadata to HDF5
HDF5 uploaded to S3; job status updated to COMPLETED
SDK polls until COMPLETED, then downloads via pre-signed URL
SDK verifies SHA-256 checksum on download
Researcher calls plot_field() on local results

Job State Machine

QUEUED → RUNNING → COMPLETED
  │         │
  └─────────┴──→ FAILED

V1 Constraint: One job per API key at a time. Submitting while active returns 409 Conflict with Retry-After header.

Why This Architecture

Five options were evaluated. The hybrid approach won because it builds production-ready separation between API and Engine from day one, without paying the full microservices tax at MVP stage.

Criterion	Monolith	Microservices	Serverless	Event-Driven	Hybrid (Chosen)
Dev Speed	★★★★★	★★★	★★★★	★★★	★★★★
Scalability	★★	★★★★★	★★★★	★★★★★	★★★★
GPU Flexibility	★★	★★★★★	★★★	★★★★	★★★★★
Ops Overhead	★★★★★	★★	★★★	★★	★★★★

The hybrid architecture maps cleanly to V2 (batch = N tasks), V3 (dataset = batch export), and V4 (inverse = iterative task chains) without re-architecting.

Technology Stack

Component	Technology	Rationale
SDK	Python + Pydantic + httpx	Type-safe, Jupyter-friendly
API	FastAPI + SQLAlchemy	Auto OpenAPI, async, mature
Task Queue	Redis + Celery	Simple, proven at scale
Engine	JAX / NumPy	GPU arrays with CPU fallback
Database	PostgreSQL	ACID for job state
Storage	S3-compatible	Cheap, pre-signed URLs
GPU	Modal (B200/A100)	On-demand, zero idle cost
Hosting	Railway / Cloud Run	Auto-scale, deploy from Git
CI/CD	GitHub Actions	Native to repo

Security

API keys stored as SHA-256 hashes — never logged in plaintext
All traffic over TLS 1.3 with HSTS
Pre-signed S3 URLs with 15-minute expiry
Simulation configs sanitised — no shell execution from user input
Dependency scanning via pip-audit and Dependabot

Layered Overview​

Component Diagram​

End-to-End Data Flow​

Job State Machine​

Why This Architecture​

Technology Stack​

Security​