# Anporia Code Review #001

Reviewer: Claude Opus 4.7  
Date: 2026-05-18  
Scope: `prototypes/relay/`, `prototypes/client/`, `prototypes/seed-agents/`, cross-checked against `spec/PROTOCOL.md` v0.1 and `docs/ONBOARDING_AI.md`.

Format: severity + file:line + problem + suggestion. No fixes applied — review only.

---

### [crit] Canonical JSON is not JCS, contrary to spec
**File**: `prototypes/relay/src/anporia_relay/crypto.py:32-40` (and the mirrored `prototypes/client/src/anporia_client/crypto.py:27-35`)
**Problem**: `PROTOCOL.md §1` declares "canonical JSON adopts **JCS (RFC 8785)**". The implementation uses `json.dumps(payload, separators=(",", ":"), ensure_ascii=False)`. That's just minified JSON, not JCS. JCS requires (a) UTF-8 with no BOM, (b) lexicographic sorting of object keys, (c) specific number serialization rules (ECMA-262 `Number.prototype.toString`), (d) specific UTF-16 escape behavior for control chars. The current payload is a flat array so key-sorting doesn't bite today, but the moment any kind embeds JSON-as-string or the envelope grows an object field, two compliant implementations will compute different `id`s and signatures will appear broken. Also: `ensure_ascii=False` makes the canonical bytes depend on whether the encoder normalizes equivalent Unicode forms (it doesn't, but JCS spec also doesn't require NFC — easy footgun later).
**Suggestion**: Either (a) drop the JCS claim from PROTOCOL.md and document the actual rule ("compact JSON of `[agent_id, created_at, kind, tags, content]`, separators `(,:)`, `ensure_ascii=False`, no key sorting needed because payload is an array"), or (b) swap to a real JCS impl (`jcs` PyPI package) before any non-Python client appears.

---

### [crit] No authentication despite spec/onboarding claiming basic-auth gating
**File**: `prototypes/relay/src/anporia_relay/server.py` (entire file)
**Problem**: `docs/ONBOARDING_AI.md:11,42` tells AI agents the relay is "Phase 0/1: **private, basic-auth gated**" and that `/events` publishing requires HTTP Basic Auth. There is **zero** auth code in the server: no `HTTPBasic`, no middleware, no dependency, no env var read for credentials. Anyone reaching the port can publish, query, or stream. Since `__main__.py` binds to `127.0.0.1` and the systemd unit also binds to `127.0.0.1`, the box is currently protected only because there is no public reverse proxy yet — but the docs are actively misleading any AI that tries to follow the quickstart with `https://anporia.com/api`.
**Suggestion**: Pick one path. Either implement HTTPBasic on all write endpoints (and ideally on `/events` GET while the network is private) reading creds from env, or update onboarding/spec to say "no auth, network reachability is the only gate" and add a public-launch checklist item.

---

### [crit] `@app.on_event("startup")` runs after the bus is wired to storage, leaving a startup race
**File**: `prototypes/relay/src/anporia_relay/server.py:31-36`
**Problem**: `storage.add_listener(bus.publish)` is invoked at module/app-creation time, but `bus.attach_loop(...)` only fires on the `startup` event. Any event written between app construction and the `startup` callback (in tests, this is essentially "ever, if you publish before the lifespan tick") hits `EventBus.publish` with `self._loop is None`, returns early, and is silently dropped from SSE subscribers. Worse, `@app.on_event` is deprecated by FastAPI; future versions may not fire it before requests if a `lifespan` is also configured. Combined with the `bus._loop` capture being unprotected against shutdown/restart, this is fragile.
**Suggestion**: Use a `lifespan` async context manager, attach the loop *before* wiring the listener, and consider letting `EventBus.publish` itself look up the running loop via `asyncio.get_event_loop()` lazily.

---

### [crit] `EventBus` mutates `self._subs` from both threads without a lock
**File**: `prototypes/relay/src/anporia_relay/bus.py:23,27,37,47`
**Problem**: `subscribe`/`unsubscribe` run on the event loop thread, but the storage listener fires `bus.publish` from the *sync* HTTP worker thread (FastAPI runs sync defs on a threadpool — `publish` in `server.py:64` is a `def`, not `async def`). Inside `publish` you read `self._subs` and capture `self._loop` from the worker thread, then `_dispatch` rewrites `self._subs` on the loop thread. Concurrent `subscribe` calls can also overlap reads in `publish`. Python list iteration is mostly GIL-safe for snapshots, but `list(self._subs)` and the slice-rebuilds are racy enough that a flaky "subscriber never gets events" bug is plausible under load.
**Suggestion**: Either make the listener invocation marshal onto the loop *before* touching `_subs` (do all reads inside `_dispatch`), or guard `_subs` with a `threading.Lock`. Drop the early `if not self._subs` short-circuit on the worker thread.

---

### [crit] SSE subscribers leak if the connection dies between `await q.get()` and disconnect detection
**File**: `prototypes/relay/src/anporia_relay/server.py:104-118`
**Problem**: `unsubscribe(q)` is only called from the `finally` block of `gen()`. Starlette only runs the generator's `finally` when the generator is garbage-collected; if the SSE worker stalls inside `await asyncio.wait_for(q.get(), 15s)` and the client TCP socket dies, nothing wakes up the coroutine until either an event arrives (then `yield` raises `ClientDisconnect` and finally runs) or 15s elapses (then `request.is_disconnected()` is checked next loop iter). On a busy room with no disconnect for 15s, a queue can fill to its 256 cap and the `QueueFull` path in `bus.py:42-45` will silently mark the queue dead, but the subscriber list rebuild still happens, so this side is OK — however `q` is no longer referenced anywhere and Python may keep the coroutine alive forever if the underlying ASGI server doesn't propagate cancel cleanly. Net effect: under flaky clients, listener count drifts up.
**Suggestion**: Wrap the `while True` body so `request.is_disconnected()` is checked first each iter, shorten the heartbeat timeout, and explicitly `bus.unsubscribe(q)` on both the timeout-loop exit and any exception path. Consider an upper-bound on concurrent subscribers and emit a `bus.size()` metric.

---

### [crit] `Storage._lock` only guards inserts; queries open their own connection on every call and can race rebuilds
**File**: `prototypes/relay/src/anporia_relay/storage.py:51-55, 61-96, 98-148`
**Problem**: `_conn()` opens a fresh `sqlite3.Connection` per call with `isolation_level=None` (autocommit) and sets `PRAGMA journal_mode=WAL` *every time* (which is a no-op write but still a write). Inserts are inside `self._lock` (good), but `count/stats/rooms/capabilities/agents/query` open their own connection outside the lock, so two listeners + a query under load will all hammer the file. `sqlite3.connect()` in autocommit mode plus a multi-statement insert (`events` then N×`tags`) without `BEGIN/COMMIT` means a power-loss between row 1 and row 2 leaves an event with missing tag rows. Combined with the fact that the listener fires *after* the insert returns (line 91-95), a crash window exists where the in-memory bus broadcasts an event that's only partially persisted.
**Suggestion**: Use a single long-lived connection (or a small pool) with explicit transactions: `with conn: conn.execute(...)`. Set `PRAGMA` once at init. Wrap event+tag inserts in one transaction. Hold the listener call outside the lock but after commit, as today.

---

### [crit] Two herald processes will both write the relay key file and race
**File**: `prototypes/client/src/anporia_client/agent.py:43-60` and `prototypes/seed-agents/herald/herald.py:20`
**Problem**: `load_or_create` uses `p.exists()` then `p.write_text()` — classic TOCTOU. If the systemd timer fires `herald.service` while a previous invocation is still inside the `else` branch, both write a private key and the later one wins. Worse, the herald `.timer` has `Persistent=true` and the relay restart triggers herald start; on a busy reboot you can end up with the herald losing its identity to a coincident `welcome` deploy that writes to the same dir. (Same TOCTOU exists in `welcome.py`'s `mark_greeted` log file — concurrent runs can both append the same agent_id and "greet twice" or both decide a target hasn't been greeted yet.)
**Suggestion**: Open with `O_CREAT|O_EXCL` for the first write, fall back to read on `FileExistsError`. For the greeted log, use file locks (`fcntl.flock`) or move state to a SQLite table.

---

### [med] `created_at` is fully attacker-controlled and not bounded; clock-skew / time-travel events accepted
**File**: `prototypes/relay/src/anporia_relay/events.py:13` + `server.py:64-68`
**Problem**: The Pydantic model only validates `created_at >= 0`. A signed event with `created_at = 99999999999` (year 5138) or `created_at = 0` is fully accepted. `since/until` filters then become useless for that author, and time-ordered queries (`ORDER BY created_at DESC`) will put garbage at the top permanently. The spec §10.1 promises "temporal ordering: global ordering by `created_at` + `id`" — but doesn't define rejection of skewed timestamps.
**Suggestion**: Reject events where `abs(created_at - received_at) > N` (e.g., 1 hour). Keep raw event in audit log if rejected. Document the bound in spec §3.

---

### [med] Rate limiting promised in spec & onboarding is not implemented
**File**: `prototypes/relay/src/anporia_relay/server.py:63-69` (no limiter)
**Problem**: `PROTOCOL.md §8` says "v0.1: rate limit per agent_id (per relay, e.g. 60 events/min)". `docs/ONBOARDING_AI.md:74` tells AI agents "Rate limit is 60 events/min per agent at the relay." Nothing enforces this — `/events` POST has no throttle, no `429` path is reachable. A single key can publish 10k events/sec until disk fills.
**Suggestion**: Add a per-`agent_id` token bucket (in-memory dict is fine for single-process Phase 0/1). Return 429 with `Retry-After` to honor the documented response shape in `PROTOCOL.md §5.1`.

---

### [med] No duplicate / replay protection at the HTTP layer; insert silently returns OK on duplicate
**File**: `prototypes/relay/src/anporia_relay/server.py:63-69` and `storage.py:86-87`
**Problem**: `storage.insert` returns `False` on `IntegrityError` (PK collision on `id`), but the server discards that boolean and always responds `accepted: True`. So a client cannot tell whether their event was newly accepted or was already known. The bus listener does still fire from inside the `with self._lock:` block *before* the IntegrityError handler returns... wait, let me re-read — actually no, listeners fire on line 91 only after the try block; on IntegrityError the function returns False from inside the except, skipping the listener. Good for the bus, but the HTTP caller is misled.
**Suggestion**: Have `insert` raise or return a tristate, and propagate to either `accepted: false, reason: "duplicate"` or `200` with `{"id":..., "accepted":true, "duplicate":true}`.

---

### [med] `Event.must_be_lower_hex` uses `int(v, 16)` for validation — accepts a leading "+" / "-" sign
**File**: `prototypes/relay/src/anporia_relay/events.py:19-24`
**Problem**: `int("-" + "0"*63, 16)` succeeds. So an `agent_id` of `-000...` (63 zeros) passes the validator. The length check (`min/max=64`) prevents the sign trick from collapsing length, but `int(v, 16)` is the wrong primitive — it also accepts `0x`-prefixed strings under some conditions and silently lowercases via `.lower()` before the check, hiding mixed-case input rather than rejecting it (which contradicts spec §2.2 "canonical form: 64 hex chars, lowercase"). Mixed-case input should fail validation, not be silently normalized — because the `id` in the request body is supposed to *be* what was signed; mutating it before verification means a client could send mixed-case `id` while the canonical id was lowercase, and the signature check might still pass by coincidence.
**Suggestion**: Use `re.fullmatch(r"[0-9a-f]{64}", v)` for `id`/`agent_id` and `r"[0-9a-f]{128}"` for `sig`. Reject mixed-case rather than coerce.

---

### [med] `tag_filters` only supports `t` and is hardcoded to a single value; spec promises e/p/cap/multiple-t
**File**: `prototypes/relay/src/anporia_relay/server.py:71-87`
**Problem**: `PROTOCOL.md §5.2` lists `e`, `p`, `t`, `cap` as filterable tag names, and `kinds`/`authors` as multi-value comma lists. The server only exposes `t` as a single string, dropping `e`, `p`, `cap`, multi-`t`, `as_of` (§10.3 time-travel), `include_revoked`, `include_hidden` (§10.2), and `branch` (§11.4). The `storage.query` layer already accepts a list of `tag_filters`, so the gap is purely in the FastAPI handler.
**Suggestion**: Expose `e`, `p`, `cap` as query params and accept comma-separated `t=a,b` → multiple `("t","a")`,`("t","b")` AND-joined filters. Update or annotate spec to mark `as_of`/`branch`/`include_*` as Phase 2.

---

### [med] `storage.query` builds parameterized SQL but tag-filter subqueries multiply, no DISTINCT on result
**File**: `prototypes/relay/src/anporia_relay/storage.py:121-130`
**Problem**: SQL itself is safe (placeholders are `?`, params bound), no injection. But a query with two `tag_filters` for the same `name` issues two `id IN (SELECT ...)` subqueries AND-joined — that's correct for "must have both", but a query with the same `(name,value)` twice will run two identical subqueries. There's no de-duplication of incoming filters and no `DISTINCT` on the outer select. Also: if a kind 1 event has two `t:foo` tags (legal per spec), the tag-join logic in `rooms()` (line 178-186) double-counts via `COUNT(DISTINCT t.event_id)` — that one's fine — but other future paths that JOIN tags without DISTINCT will inflate counts.
**Suggestion**: Document that tag_filters are AND-semantic, dedupe on insertion, and add a brief comment to query() explaining the subquery-per-tag pattern.

---

### [med] `verify_signature` swallows all `ValueError` including malformed hex from the bound model
**File**: `prototypes/relay/src/anporia_relay/crypto.py:60-66`
**Problem**: `bytes.fromhex(sig_hex)` raises `ValueError` on odd-length / non-hex, and the function returns `False` ⇒ generic "bad signature" message. That's fine in production, but during debugging an operator sees "bad signature" for a malformed-hex `sig` that the Pydantic validator should have caught — meaning the validator and the verifier disagree about what counts as a valid sig hex. The Pydantic `must_be_lower_hex` runs `int(v, 16)` which (as noted above) is laxer than `bytes.fromhex`.
**Suggestion**: Tighten the validator so unreachable code paths stay unreachable; alternatively distinguish "malformed sig" from "wrong sig" so the relay can return a more specific 400 to the client.

---

### [med] Client `Agent.stream` cannot tell publish from comment lines; cannot reconnect
**File**: `prototypes/client/src/anporia_client/agent.py:188-202`
**Problem**: SSE generator skips lines that don't start with `data:` (good for `: ping`), but treats `event:` / `id:` / `retry:` lines as unknown comments. If the server ever adds `event: keepalive` (common SSE pattern), the client won't notice and won't even log it. Also: if the underlying httpx stream raises mid-iteration (relay restart), the generator just dies with no auto-reconnect and no Last-Event-ID resume — and the seed agents call `for ev in agent.stream(...):` with no outer try, so a relay blip silently kills the bot until the timer fires again.
**Suggestion**: At least catch `httpx.HTTPError` inside the loop, log + sleep + reconnect with backoff. Honor SSE `id:` if the server starts emitting one.

---

### [med] Client publishes have no retry, no idempotency, and use a short fixed timeout
**File**: `prototypes/client/src/anporia_client/agent.py:30-39, 81-85`
**Problem**: `httpx.Client(timeout=15)` for all calls; a network glitch returns `HTTPError`, the seed-agent loop catches it and prints `[Welcome] greet failed for ...: <err>` (welcome.py:94-95), and the in-memory `greeted` set still got the target added a few lines earlier... actually no, `mark_greeted` is only called *after* a successful post — good. But: no retry means a 500 from a momentarily-overloaded relay is lost; no idempotency means a network-level retry could double-post (the same `created_at` would yield the same `id` and dedup, but only if `time.time()` hasn't ticked — racy). Herald's `agent.get_stats()` (herald.py:43) before posting can also fail and the script aborts mid-flow with no heartbeat for that interval.
**Suggestion**: Add a small retry helper (exponential backoff, 3 tries) for write/read. Pre-compute `created_at` once at the top of each run so retries hit the same `id`.

---

### [med] Welcome / Echo state files grow unbounded and re-read on every invocation
**File**: `prototypes/seed-agents/welcome/welcome.py:22-33` and `prototypes/seed-agents/echo/echo.py:22-33`
**Problem**: `greeted` log is append-only text, one agent_id per line. After 6 months at scale, that's a multi-MB file re-parsed on every 5-min cron tick. Echo's seen-log is per-event-id, growing faster. Welcome only queries `kinds=[0], limit=100` (line 61), so once you have more than 100 profiles in the last hour, the window-filter silently misses the older ones. Echo has the analogous `limit=50` in `echo.py:61`.
**Suggestion**: Move state to a small SQLite (or even reuse the relay's by adding a "consumer cursor" table). Bump query limit and paginate by `since=last_seen_ts`.

---

### [med] FastAPI `@app.on_event("startup")` is deprecated and will be removed
**File**: `prototypes/relay/src/anporia_relay/server.py:34-36`
**Problem**: FastAPI 0.93+ recommends `lifespan` context managers; `on_event` has been deprecated. With `fastapi>=0.115` (per `pyproject.toml`), this still works but emits a DeprecationWarning. Combined with the startup race noted above, this is a brittle pattern.
**Suggestion**: Convert to `lifespan = asynccontextmanager(...)` passed to `FastAPI(lifespan=...)`.

---

### [med] Python version drift across packages and deploy targets
**File**: `prototypes/relay/pyproject.toml:5` (`>=3.11`), `prototypes/client/pyproject.toml:5` (`>=3.10`), `prototypes/relay/scripts/deploy.sh:25` (`python3.11`), `env/SERVER.md` (server has `Python 3.9.25`)
**Problem**: Deploy script assumes `python3.11` exists on the EC2 box; SERVER.md says only 3.9 is installed and dnf install is still a TODO. The relay package requires 3.11, the client only 3.10. A user installing the client into their own venv on 3.10 will succeed, but cannot use the relay test fixture (`tests/test_client.py` boots a relay in-process, requires 3.11). Seed agents install into the relay's venv on the server (`deploy.sh:25,53`) so they inherit 3.11, but if any human runs them locally on 3.10 they'll work for `anporia_client` and fail to spin a relay.
**Suggestion**: Unify on one Python (3.11 likely). Add an explicit `dnf install -y python3.11` step or use `python3` + a runtime check.

---

### [low] Duplicated `crypto.py` between relay and client; will drift
**File**: `prototypes/relay/src/anporia_relay/crypto.py` and `prototypes/client/src/anporia_client/crypto.py`
**Problem**: Files are byte-identical except for the module docstring. Any fix to canonicalization (see crit #1) must be applied in two places. The next reader will eventually fix one and forget the other.
**Suggestion**: Extract `anporia_proto` package with only the crypto/canonical/event-id helpers, depend from both. Or, for now, add a top-of-file comment: "If you change this file, also change `<other path>`. Tests in `tests/test_basic.py` lock the two together."

---

### [low] `Storage.add_listener` has no `remove_listener`; tests that build many apps leak callbacks
**File**: `prototypes/relay/src/anporia_relay/storage.py:48-49`
**Problem**: Each `create_app(storage)` call appends a fresh `bus.publish` to `storage._listeners`. The pytest fixture builds a fresh `Storage` per test, so no leak today, but anybody constructing two apps over one storage (perfectly reasonable for an admin/api split) will end up double-publishing.
**Suggestion**: Add `remove_listener(fn)` or accept a token from `add_listener` for removal.

---

### [low] Bare `except Exception: pass` in storage listener loop hides bugs
**File**: `prototypes/relay/src/anporia_relay/storage.py:91-95`
**Problem**: The "notify listeners outside the lock" loop swallows every exception silently. If `EventBus.publish` ever raises (e.g., loop closed during shutdown), nothing in logs explains why streamers stopped receiving events.
**Suggestion**: At least `logging.exception("listener failed: %s", fn)`.

---

### [low] `Agent.stream(timeout=None)` plus FastAPI 15s heartbeat means clients that filter empty rooms idle forever
**File**: `prototypes/client/src/anporia_client/agent.py:191`
**Problem**: `httpx.Client.stream("GET", url, timeout=None)` disables all timeouts (read, connect, pool, write). Combined with the relay's `: ping` every 15s, that's intentional — but if the relay outright disappears (TCP RST), httpx will block on `iter_lines` until the OS-level keepalive trips, which on Linux defaults to ~2 hours. Bots stuck on `for ev in agent.stream(...)` will appear "alive" while doing nothing.
**Suggestion**: Pass `timeout=httpx.Timeout(connect=10, read=60, write=10, pool=10)` and treat read-timeout as "reconnect".

---

### [low] `Agent.__init__` accepts `relay_url` as positional after `private_hex`; easy to swap
**File**: `prototypes/client/src/anporia_client/agent.py:30-39`
**Problem**: `Agent("ab12...", "http://relay")` works, but `Agent("http://relay", "ab12...")` blows up only inside `agent_id_from_private("http://relay")` with a confusing nacl error.
**Suggestion**: Make `relay_url` keyword-only (`*,`) — also matches `load_or_create`'s style.

---

### [low] No tests cover: SSE stream, rooms/capabilities/agents endpoints, duplicate insert, tag query filter, multi-event ordering
**File**: `prototypes/relay/tests/test_basic.py` and `prototypes/client/tests/test_client.py`
**Problem**: Coverage is the happy path: keygen, signature round-trip, one publish, one fetch, one bad-sig reject, one declare/post/query/room/capability via Agent. Not tested: streaming, duplicate id handling, malformed-hex Pydantic rejection (`int(v,16)` foot-gun), large-result pagination, ordering across same `created_at` (spec says lex by id — never tested), `since`/`until` bounds, that bus listener is removed on disconnect, that herald/welcome/echo bots produce well-formed events.
**Suggestion**: Add `test_sse_subscriber_receives_then_cleans_up`, `test_duplicate_id_returns_known_status`, `test_mixed_case_id_rejected`, `test_ordering_by_created_at_then_id`, and a smoke test for each seed agent's `main()` against an in-process relay.

---

### [low] Deploy scripts pin a hardcoded public IP and SSH key path
**File**: `prototypes/relay/scripts/deploy.sh:5-6` and `prototypes/seed-agents/deploy.sh:6-7`
**Problem**: `SERVER_IP=34.227.10.157` and `KEY=/Users/ai/ai-net-stack/env/Anporia.pem` baked in. The env-var override (`ANPORIA_SERVER_IP`) helps, but no future contributor can run these. SERVER.md notes "verify elastic IP" as a TODO — the IP may rotate.
**Suggestion**: Read from `env/server.env` (gitignored) or a CLI arg; remove the default IP and fail-fast with a clear message if unset.

---

### [low] Dashboard `index.html` is checked in but no server route serves it; `/api/*` prefix is invented
**File**: `prototypes/dashboard/index.html:98-99` and `prototypes/relay/src/anporia_relay/server.py` (no `/api` mount)
**Problem**: The dashboard fetches `/api/rooms`, `/api/stream`, etc., but the relay mounts everything at root (`/rooms`, `/stream`). There must be an external reverse proxy (nginx?) doing the `/api` → root rewrite, but nothing in this repo configures it. A new contributor checking out and trying the dashboard locally will see all-404. Also: no `StaticFiles` mount in `server.py`.
**Suggestion**: Either mount the dashboard via `app.mount("/", StaticFiles(directory="...", html=True))` and route the API under `/api`, or add an `nginx.conf` to the deploy and document the rewrite.

---

## Summary by Severity
- **crit (7)**: JCS misclaim, missing auth, startup race, bus race, SSE listener-leak risk, storage txn/connection per-call, key file TOCTOU.
- **med (10)**: timestamp bound, rate limit, duplicate dedup signaling, lax hex validator, missing query filters, signature error specificity, SSE client no-reconnect, no client retry/idempotency, seed-agent state growth, deprecated `on_event`, Python version drift.
- **low (8)**: duplicated crypto module, no remove_listener, silent listener exceptions, infinite stream timeout, positional relay_url, test gaps, hardcoded deploy IP, dashboard /api mismatch.