Roadmap¶

Per-release delivery plan and cross-cutting capability tracks for Aegis-KMS through v1.0.

This page renders the project-root ROADMAP.md. The canonical source is at github.com/sharma-bhaskar/aegis-kms/blob/main/ROADMAP.md.

Aegis-KMS Roadmap¶

This document is the single source of truth for what Aegis-KMS is building toward and which release each capability lands in. The README sells the design; this file tracks the execution.

If you want to follow along or contribute, the live work happens in GitHub Issues, grouped by Milestones and visualized on the Project board. Each row in the tables below corresponds to one or more issues with area/* and kind/* labels.

Status legend:

✅ Shipped — works in the latest release.
⚠️ MVP — partially shipped; functional but minimal.
🚧 WIP — being worked on now.
🔜 Designed — SPI/skeleton in place, implementation queued.
💡 Opportunity — not yet in design; on the table for community input.

Vision¶

Aegis exists to make AI-agent access to keys safe by default. The wedge is four checks on every request, regardless of which wire it came in on:

Identity & Context — who is this, on whose behalf, in what scope?
Risk Scoring — does this request match the actor's behavioral baseline?
Anomaly Detection — is this part of a pattern we should escalate?
Real-time Response — allow / step-up / deny / rotate / revoke / alert, automatically, before the next request lands.

Every release moves the substrate closer to that loop closing inside Aegis without a human in the path for the cases that don't need one.

See POSITIONING.md for the full framing and ARCHITECTURE.md for the system design. This roadmap is the delivery plan that turns those documents into shipped capability.

Release milestones¶

The trade-off in each release is roughly balanced across three axes — make it a real KMS (cryptographic operations + lifecycle), ship the wedge (the four pillars that differentiate Aegis), and build the platform (integrations, observability, deployment). Each release moves the needle on all three.

v0.1.0 — Substrate ✅ shipped¶

The foundation that everything else lands on.

Real KMS	Wedge	Platform
`KeyService` algebra (create/get/locate/activate/revoke/destroy)	`Principal.Human` / `Principal.Agent` with parent linkage	Postgres event journal (Doobie/HikariCP)
In-memory + Pekko-actor backed lifecycle state	`BaselineDetector` MVP (scope + rate-spike)	AWS KMS root-of-trust adapter
REST endpoints `/v1/keys/*`	`RoleBasedPolicyEngine` (allowlist + recursive parent-check)	JWT bearer auth (HS256 issue + verify)
`aegis` admin CLI (version/login/keys CRUD)	`AuditingKeyService` decorator + stdout sink	Docker image (GHCR) + library jars

Non-goals: cryptographic operations, multi-cloud RoT, KMIP, MCP, OIDC. All deferred to later milestones.

v0.1.1 — Make it a real KMS ✅ shipped¶

Theme: turn Aegis from a key registration service into a key management service. Every claim about layered-mode AWS KMS in the README should run end-to-end after this release.

#	Capability	Area	Status
1.1.a	`sign(id, message, alg)` in `KeyService` algebra + REST + AWS adapter	Real KMS	✅
1.1.b	`verify(id, message, signature, alg)`	Real KMS	✅
1.1.c	`encrypt(id, plaintext, ctx)` / `decrypt(id, ciphertext, ctx)`	Real KMS	✅
1.1.d	`wrap(id, dek)` / `unwrap(id, wrappedDek)`	Real KMS	✅
1.1.e	`rotate(id, policy)` — new active version, old → Deactivated	Real KMS	✅
1.1.f	`compromise(id, reason)` operator override	Real KMS	✅
1.1.g	Prometheus metrics endpoint (`/metrics`) — request rate, latency, error rate, audit lag, journal append latency	Platform	✅
1.1.h	OpenTelemetry tracing (auto-instrument REST + JDBC + AWS SDK)	Platform	✅
1.1.i	`Resource[IO, Unit]` boot scope — clean Postgres pool / actor system shutdown	Platform	✅
1.1.j	Anomaly detector: time-of-day baseline, source-IP set, op-type histogram	Wedge	✅
1.1.k	Maven Central publish (Sonatype OSSRH + GPG + workflow secrets)	Platform	⚠️ workflow ready; blocked on Sonatype + GPG secrets

Demo target: "Boot Aegis against an existing AWS KMS CMK, sign a payload, see the audit record, watch metrics in Grafana."

v0.2.0 — Ship the wedge ✅ shipped¶

Theme: the first release where the README's "Claude goes rogue" example actually runs end-to-end. Anomaly + risk + auto-response close the loop.

#	Capability	Area	Status
2.0.a	Risk scorer (W2) — multi-factor numeric score; reasoning recorded in audit context	Wedge	✅
2.0.b	Decision adapter (`allow` / `step-up` / `deny`) wired into IAM	Wedge	✅
2.0.c	Auto-responder (W3) — configurable rules from `AgentRecommendation` → action (revoke / deactivate / freeze / alert)	Wedge	✅
2.0.d	Agent-token issuance HTTP endpoint (`POST /v1/agents/issue`)	Wedge	✅
2.0.e	`aegis agent issue` CLI — wire the existing stub to the new endpoint	Wedge	✅
2.0.f	Postgres audit table (queryable, indexed on actor + occurredAt + key)	Platform	✅
2.0.g	`GET /v1/audit?since=&actor=&key=&op=` audit-read API	Platform	✅
2.0.h	`aegis audit tail` CLI — wire the existing stub to the new audit-read API	Wedge	✅
2.0.i	Generic SIEM webhook audit sink (HTTPS POST per event, batched + retried)	Platform	✅
2.0.j	Kafka audit fan-out (Pekko-Connectors-Kafka)	Platform	✅
2.0.k	Redis-backed JWT revocation list (jti blacklist) — required for instant agent revoke	Platform	✅
2.0.l	Honey keys / canary keys — fake keys with auto-alert if any agent touches them	Wedge	✅
2.0.m	OIDC verifier + JWKS rotation + RS256 / ES256 signature support	Wedge	✅
2.0.n	OPA (Open Policy Agent) integration — externalize policy evaluation (Rego) via sidecar	Wedge	💡 deferred to v0.3.0 (#27)
2.0.o	NATS / NATS JetStream audit fan-out	Platform	✅
2.0.p	MySQL + SQLite event journal adapters	Platform	✅
2.0.q	`RoleBasedPolicyEngine` wired in `Server.boot` (role-based RBAC for humans)	Platform	✅
2.0.r	`source.ip` plumbed from the HTTP layer into `AuditRecord.context`	Platform	✅

Demo target: "Issue an agent token, watch it sign 49 invoices fine, watch it try one off-scope key, watch the auto-revoke fire before the next attempt — all in aegis audit tail."

v0.2.1 — LLM advisor ✅ shipped¶

Theme: read-only AI assistant that explains the audit log to operators.

#	Capability	Area	Status
2.1.a	`aegis advisor scan` — finds unused keys, scope-creep, anomalies; deterministic (LLM narration via 2.1.c later)	Wedge	✅
2.1.b	`aegis advisor explain <agent-id>` — human-readable timeline of why a recommendation fired	Wedge	✅
2.1.c	Pluggable LLM provider — `LlmClient` SPI + Anthropic + OpenAI + Ollama shipped (Bedrock fast-follow)	Wedge	✅
2.1.d	Prompt safety: read-only system prompt, bounded structured input, graceful LLM-failure fallback	Wedge	✅

Demo target: "aegis advisor scan returns 'these 3 keys haven't been used in 60 days, these 2 agents have unusually broad scopes, there are no active anomalies.'"

v0.2.2 — Hardening fix pass¶

Theme: remove the things a first-time evaluator hits in their first ten minutes.

#	Capability	Area	Status
2.2.a	Real SDKs — `aegis-sdk-scala` full REST coverage + working Java facade, replacing the `NotImplementedError` stubs (#98)	Platform	🚧
2.2.b	Production preflight — warn/enforce when dev-grade settings bind a non-loopback address (#99)	Platform	🚧
2.2.c	Issue hygiene — tracking issues for journal snapshotting (#100), agent registry (#101), kill-switch (#102), v1.0.0 rows (#103–#107)	Platform	🚧

v0.3.0 — Multi-cloud + production deployment¶

Theme: make Aegis credible for the "we run multiple clouds and a few HSMs" enterprise.

#	Capability	Area	Status
3.0.a	GCP KMS root-of-trust adapter (`google-cloud-kms`)	Real KMS	🔜
3.0.b	Azure Key Vault root-of-trust adapter (`azure-security-keyvault-keys`)	Real KMS	🔜
3.0.c	HashiCorp Vault Transit root-of-trust adapter	Real KMS	🔜
3.0.d	Software root-of-trust (JCE-backed; dev/test only)	Real KMS	🔜
3.0.e	Per-key RoT routing — different keys can live in different backends	Real KMS	💡
3.0.f	Helm chart (`deploy/helm/aegis-kms`) — production-ready with Postgres dependency	Platform	🔜
3.0.g	Kubernetes operator (CRDs for `AegisKey`, `AegisAgent`)	Platform	💡
3.0.h	OpenTelemetry log export (Loki / Datadog / Honeycomb)	Platform	💡
3.0.i	Multi-tenancy: per-tenant key/agent/audit isolation	Platform	💡
3.0.j	Time-windowed access policies (e.g. business-hours-only keys)	Wedge	🔜
3.0.k	Just-In-Time (JIT) access — agent requests scoped permission on-demand	Wedge	💡
3.0.l	Approval workflows — Slack / PagerDuty / OpsGenie integration for step-up	Wedge	💡

Demo target: "Helm-install Aegis on a fresh cluster, register an existing AWS CMK, register a GCP CryptoKey, watch a single audit feed cover both backends."

v0.4.0 — KMIP + MCP¶

Theme: open the wires that bring storage vendors and AI hosts to the same control plane.

#	Capability	Area	Status
4.0.a	KMIP TTLV codec (1.4 / 2.0 / 2.1 / 3.0 with version negotiation)	Real KMS	🔜
4.0.b	KMIP TLS 1.3 server with mTLS	Real KMS	🔜
4.0.c	KMIP operations: `Create`, `Get`, `Activate`, `Destroy`, `Register` (BYOK), `Encrypt`, `Decrypt`, `Sign`, `SignatureVerify`	Real KMS	🔜
4.0.d	Tested integrations: NetApp ONTAP, Veeam, Oracle TDE	Real KMS	💡
4.0.e	MCP server (`aegis-mcp-server`) with curated tool surface	Wedge	🔜
4.0.f	MCP tool annotations + host-side approval flow	Wedge	🔜
4.0.g	Per-prompt accountability — record originating LLM prompt with each MCP-driven key op	Wedge	💡
4.0.h	Model identifier in audit (`actor.model = "claude-3.5-sonnet"`)	Wedge	💡
4.0.i	LangChain / LlamaIndex tool integration package	Wedge	💡
4.0.j	OpenAI function-calling surface (`aegis-agent-ai`) for non-MCP frameworks	Wedge	🔜

Demo target: "An NetApp filer, a Claude agent, and a Java application all use the same Aegis instance, with one audit trail, one identity model, one policy engine."

v0.5.0 — HSM-backed + Standalone hardening¶

Theme: the FIPS / regulated-industry release.

#	Capability	Area	Status
5.0.a	PKCS#11 root-of-trust adapter (Thales Luna, Entrust nShield, YubiHSM, AWS CloudHSM, SoftHSM for dev)	Real KMS	🔜
5.0.b	SoftHSM Testcontainer for CI integration testing	Real KMS	🔜
5.0.c	Hardware attestation docs (FIPS 140-2 Level 3 attestation chain)	Platform	💡
5.0.d	Standalone-mode hardening: AEAD wrapping, key-derivation hierarchy	Real KMS	🔜
5.0.e	Master-key rotation tooling	Real KMS	💡
5.0.f	Audit log immutability proofs (Merkle hash chain or signed batches)	Wedge	💡

Demo target: "Boot Aegis against a SoftHSM container, generate a key inside the HSM, prove the bytes never leave the device — all in CI."

v0.6.0 — Compliance + multi-tenancy¶

Theme: the SaaS-readiness release.

#	Capability	Area	Status
6.0.a	Multi-tenant Aegis: tenants / projects / per-tenant isolation across keys, agents, audit, policies	Platform	💡
6.0.b	Compliance reports: "list every key any AI agent touched in Q2" exportable as CSV / PDF	Wedge	💡
6.0.c	NIST AI RMF / EU AI Act mapping doc — how Aegis features map to compliance requirements	Wedge	💡
6.0.d	GDPR / data-residency labels on keys + region-based deny rules	Platform	💡
6.0.e	Audit log retention policies + cold-storage tiering (S3 → Glacier)	Platform	💡
6.0.f	SOC2-friendly access logs (separate from operational logs)	Platform	💡

v1.0.0 — API stability + production-hardened¶

Theme: the "we promise the algebra won't change under you" release.

#	Capability	Area	Status
1.0.a	All `KeyService` operations have full coverage on every shipped RoT	Real KMS	💡
1.0.b	Backward-compatibility guarantees documented for the algebra + REST surface	Platform	💡
1.0.c	Performance targets: 1000 sign ops/sec at p99 < 50ms (with AWS KMS RoT, single pod)	Platform	💡
1.0.d	Chaos/fault testing: journal partition tolerance, RoT outage handling, audit sink backpressure	Platform	💡
1.0.e	Production deployment docs: HA topology, capacity planning, runbook	Platform	💡

Cross-cutting capability tracks¶

The release tables above slice the work by milestone. The tables below slice by capability area — useful when you're contributing in a specific domain.

Database support¶

Database	Status	Use-case	Tracking
Postgres 14+	✅ v0.1.0	event journal, audit table	—
CockroachDB	✅ works as Postgres	drop-in HA replacement	document only
MySQL 8+	✅ v0.2.0	event journal	—
MariaDB	✅ wire-compatible with MySQL adapter	event journal	document only
SQLite	✅ v0.2.0	embedded dev / CI / single-node demo	—
DynamoDB	💡 opportunity	AWS-native event store	v0.3.0+
MongoDB	💡 not pursued	document journal model	unlikely — Postgres covers the use

Audit / event fan-out (downstream sinks)¶

Sink	Status	Tracking
Stdout JSON	✅ v0.1.0	—
Postgres audit table	✅ v0.2.0	—
Generic SIEM webhook	✅ v0.2.0	—
Kafka	✅ v0.2.0	—
NATS / NATS JetStream	✅ v0.2.0	—
AWS SQS / SNS	💡 v0.3.0	issue-tagged `area/audit area/integration/aws`
GCP Pub/Sub	💡 v0.3.0	issue-tagged `area/audit area/integration/gcp`
Azure Event Hubs / Service Bus	💡 v0.3.0	issue-tagged `area/audit area/integration/azure`
Splunk HEC	💡 v0.3.0	issue-tagged `area/audit area/integration/splunk`
OpenTelemetry log export	💡 v0.3.0	issue-tagged `area/observability`
S3 object-store fan-out	🔜 v0.3.0	issue-tagged `area/audit`
WebSocket live audit feed	💡 v0.4.0	issue-tagged `area/audit kind/feature`

Root of Trust adapters (key bytes)¶

Backend	Status	Tracking
AWS KMS	✅ v0.1.0	—
GCP KMS	🔜 v0.3.0	issue-tagged `area/integration/gcp area/crypto`
Azure Key Vault	🔜 v0.3.0	issue-tagged `area/integration/azure area/crypto`
HashiCorp Vault Transit	🔜 v0.3.0	issue-tagged `area/integration/vault area/crypto`
Software RoT (JCE)	🔜 v0.3.0	issue-tagged `area/crypto`
PKCS#11 (Luna / nShield / YubiHSM / CloudHSM / SoftHSM)	🔜 v0.5.0	issue-tagged `area/crypto kind/security`

Wire planes¶

Plane	Status	Tracking
REST (`aegis-http`)	✅ v0.1.0 (basic)	extend in 0.1.1 with crypto ops
OpenAPI advertising + Swagger UI	✅ v0.1.1 (`/docs/`)	—
KMIP (`aegis-kmip`) — TTLV / TLS / multi-version	🔜 v0.4.0	issue-tagged `area/wire/kmip`
MCP (`aegis-mcp-server`)	🔜 v0.4.0	issue-tagged `area/wire/mcp area/ai-governance`
Agent-AI (`aegis-agent-ai`) — function-call surface	🔜 v0.4.0	issue-tagged `area/wire/agent-ai`

Policy management¶

Capability	Status	Tracking
Role/scope allowlist + parent-check	✅ v0.1.0	—
Risk-scored decisions (allow / step-up / deny)	✅ v0.2.0	—
Time-windowed access (key X usable Mon-Fri 9-18 UTC)	🔜 v0.3.0	`area/policy`
Just-In-Time (JIT) access	💡 v0.3.0+	`area/policy area/ai-governance`
Approval workflows (Slack / PagerDuty / OpsGenie)	💡 v0.3.0+	`area/policy area/integration/*`
OPA (Rego) externalized policy	💡 v0.3.0	`area/policy area/integration/opa` (#27)
AWS Cedar policy language	💡 v0.4.0+	`area/policy area/integration/aws`
Policy-as-code with Git (hot-reload on commit)	💡 v0.4.0+	`area/policy`
Policy simulation / preview (dry-run on past 24h)	💡 v0.4.0+	`area/policy kind/feature`
Policy explainer (denied requests return the firing rule)	💡 v0.3.0	`area/policy kind/feature`
Policy versioning & rollback	💡 v0.4.0+	`area/policy`
Tenant isolation in policy evaluation	🔜 v0.6.0	`area/policy area/multi-tenancy`

AI governance (beyond the four pillars)¶

Capability	Status	Tracking
Agent registry (list all live agents, parents, scopes, last activity)	🔜 v0.3.0	#101
Agent kill-switch ("revoke all agents under alice@org issued in 24h")	🔜 v0.3.0	#102
Per-prompt accountability (record originating LLM prompt)	💡 v0.4.0	`area/ai-governance area/wire/mcp`
Model identifier in audit (`actor.model = "..."`)	💡 v0.4.0	`area/ai-governance area/audit`
Token cost / op-rate tracking per agent	💡 v0.4.0+	`area/ai-governance`
Honey keys / canary keys with auto-alert	✅ v0.2.0	—
Scope-creep detection (effective scope vs. baseline)	💡 v0.3.0	`area/ai-governance area/risk`
Compliance reports (SOC2 / PCI / HIPAA exportable)	💡 v0.6.0	`area/ai-governance area/compliance`
LLM advisor with RAG over audit log	🔜 v0.2.1	`area/ai-governance area/wedge/llm-advisor`
Bring-your-own-LLM provider for advisor	💡 v0.2.1+	`area/ai-governance`
NIST AI RMF / EU AI Act compliance mapping	💡 v0.6.0	`area/ai-governance area/compliance kind/docs`

Observability + ops¶

Capability	Status	Tracking
Stdout JSON logs	✅ v0.1.0	—
Prometheus metrics (`/metrics`)	✅ v0.1.1	`area/observability`
OpenTelemetry tracing (REST + JDBC + AWS SDK)	✅ v0.1.1	`area/observability`
OpenTelemetry log export	💡 v0.3.0	`area/observability`
`request-id` MDC propagation (logs ↔ audit ↔ response header)	✅ v0.1.1	`area/observability`
`Resource[IO, Unit]` boot scope (graceful shutdown)	✅ v0.1.1	`area/server-tier`
Helm chart	🔜 v0.3.0	`area/deployment`
Kubernetes operator (CRDs)	💡 v0.3.0+	`area/deployment kind/feature`
Docker Compose hardening (no default passwords)	✅ v0.1.1	`area/deployment kind/security`

Authentication + identity¶

Capability	Status	Tracking
HMAC JWT (HS256) issue + verify	✅ v0.1.0	—
Dev-mode `X-Aegis-User` header	✅ v0.1.0	—
OIDC discovery + JWKS rotation	✅ v0.2.0	—
RS256 / ES256 verifier (EdDSA deferred)	✅ v0.2.0	—
OIDC providers tested: Keycloak, Authentik, Dex, Auth0, Okta	💡 v0.3.0	`area/iam area/integration/oidc`
Agent-token issuance HTTP endpoint	✅ v0.2.0	—
Redis-backed JWT revocation list (jti blacklist)	✅ v0.2.0	—
mTLS for KMIP plane	🔜 v0.4.0	`area/iam area/wire/kmip`
Hardware-bound credentials (WebAuthn for human auth)	💡 v0.4.0+	`area/iam kind/security`

SDK + client surface¶

SDK	Status	Tracking
Scala SDK (`aegis-sdk-scala`) — full REST coverage	🚧 v0.2.2 (#98)	—
Java SDK (`aegis-sdk-java`) — key ops + agent issuance facade	🚧 v0.2.2 (#98)	full audit/advisor coverage later
Kotlin coroutines wrapper	💡	community welcome
TypeScript / Node SDK	💡	community welcome
Python SDK	💡	community welcome
Go SDK	💡	community welcome
Rust SDK	💡	community welcome
`langchain-aegis` Python package	💡 v0.4.0	`area/sdk area/ai-governance`

Contributing¶

If you want to pick up something from this roadmap:

Find the capability in the tables above.
Look at the corresponding GitHub issue (every actionable row has one — search the issues page by the area label, e.g. area/integration/nats).
Comment on the issue to claim it, ask design questions, or propose a different approach.
Open a PR that closes the issue with Closes #N.

Items marked 💡 Opportunity don't have detailed designs yet — they're the best candidates for proposing your own approach. Items marked 🔜 Designed have an SPI or skeleton in place; the work is mostly mechanical implementation.

Most issues will be tagged good first issue once their parent area has its first contributor; if you're new to the codebase, the good first issue filter is the place to start.

How this roadmap is maintained¶

This file is edited as part of the same PR that ships a feature. When v0.1.1 ships, the v0.1.1 table moves from "next" to "shipped" and v0.1.2's table appears. The status legend in the cross-cutting tables is updated in lockstep.
Major changes in scope (a new milestone, a major capability moving between releases) get their own PR with discussion in the description.
Anything marked 💡 Opportunity can move to 🔜 Designed when somebody writes the design doc — typically a docs/proposals/<topic>.md PR — and gets sign-off.

The goal of this file is to be honest about the gap between what we ship and what we plan to ship, to help users decide whether Aegis is right for their use today, and to give contributors a clear picture of where the project is heading.

Open an issue if anything here looks wrong or out of date.