Abstract
We present the first empirical evaluation of Graphonomous, a
topology-aware continual learning engine, on a real-world
multi-domain codebase. The corpus is the full [&]
Protocol portfolio — 18,165 source files across 14
projects ingested via the engine’s native
scan_directory feature. This includes Elixir,
TypeScript, JavaScript, HTML, CSS, JSON, Markdown, and YAML
files spanning agent orchestration, governance,
spatial/temporal intelligence, knowledge graph editing, and
the engine’s own source code. The self-referential
property (the engine processes its own implementation)
creates genuine cyclic knowledge structures (κ>0),
enabling the first naturalistic test of κ-aware routing
and deliberation.
We evaluate all 29 MCP tools across eight dimensions: (1) ingestion throughput via filesystem traversal, (2) cross-domain retrieval quality, (3) topological cycle detection (κ), (4) the full learning loop (outcome, feedback, novelty, interaction), (5) goal lifecycle and coverage-driven review, (6) graph operations and specialized retrieval (BFS traversal, graph stats, episodic/procedural retrieval, deliberation), (7) memory consolidation dynamics, and (8) attention-driven goal prioritization.
Key findings: (1) automated edge extraction creates 12,871 edges from imports/requires/references; (2) the graph contains 22 naturally occurring SCCs with max κ=27; (3) graph-expanded retrieval outperforms flat baseline by +0.024 F1 and +0.103 recall (F1=0.415 vs 0.391); (4) deliberation achieves 100% pass rate (2/2); (5) all 455 tests pass across 29 MCP tools; (6) consolidation throughput reaches ~2 µs/cycle (27.1M nodes/sec); (7) domain-aware re-ranking promotes cross-domain diversity; (8) orphan node rate is 80.5%.
1. Motivation
1.1 The Gap
Agent memory systems are evaluated primarily through synthetic benchmarks: random fact insertion, isolated retrieval, or toy knowledge bases. No published evaluation tests a memory system on a real multi-domain corpus where:
- Cross-domain dependencies exist (governance specs reference memory specs which reference governance)
- Cyclic knowledge is natural (a spec about cycle detection contains cycles about itself)
- Multiple abstraction levels coexist (architecture specs, API contracts, implementation code, decision records)
- The evaluation corpus is the system’s own codebase (genuine dogfooding)
- All skill surfaces are exercised (not just store/retrieve, but learning, goals, topology, consolidation, attention)
1.2 Why This Matters
Continual learning engines claim to support multi-domain reasoning, but without empirical evidence on complex real-world corpora, these claims are untestable. This protocol establishes:
-
A reproducible benchmark anyone can run
(
mix benchmark.run) - Baseline measurements across eight evaluation dimensions covering all 29 MCP tools
- Identified gaps that guide engineering priorities
- A methodology for evaluating topology-aware memory systems
1.3 Related Work
| System | Memory Model | Topology | Eval Corpus | κ Routing | Coverage |
|---|---|---|---|---|---|
| Hindsight | 4 memory networks | None | Synthetic tasks | No | Partial |
| KAIROS | Single-timescale autoDream | None | Internal coding | No | Partial |
| MemGPT | Tiered memory + OS paging | None | Conversational QA | No | Partial |
| Graphonomous v0.3.3 | Typed KG + 8-stage consolidation | κ-aware SCC | 18K files + LongMemEval 500Q | Yes | 29/29 |
2. Experimental Setup
2.1 System Configuration
| Parameter | Value |
|---|---|
| Engine | Graphonomous v0.3.3 |
| Language | Elixir 1.19.4 / OTP 28 |
| Storage | SQLite (benchmark DB) |
| Embedder | nomic-embed-text-v2-moe (768-dim, 500M params) + ms-marco cross-encoder |
| EXLA backend | CUDA (~87ms per embedding) |
| Consolidation decay | 0.02 |
| Prune threshold | 0.10 |
| Merge similarity | 0.95 |
| Learning rate | 0.20 (adaptive, 0.20–0.30) |
2.2 Corpus Description
The [&] Protocol Portfolio is a full multi-project codebase:
| Category | Count | Extensions |
|---|---|---|
| Source code (JS/TS) | 14,213 | .js, .ts, .tsx |
| Documentation | 1,501 | .md |
| Source code (Elixir) | 1,268 | .ex, .exs |
| Configuration | 1,072 | .json, .toml, .yml |
| Web assets | 102 | .html, .css |
Total: 18,165 files ingested from 14 project directories spanning the full [&] ecosystem.
2.3 Known Cross-Domain Dependencies
graphonomous —derived_from→ ampersand
webhost —derived_from→ ampersand
agentromatic —derived_from→ opensentience
delegatic —derived_from→ opensentience
bendscript —related→ graphonomous
fleetprompt —related→ agentelic
geofleetic —related→ ticktickclock
ampersand —supports→ graphonomous ← κ=1 cycle
The
ampersand ↔ graphonomous bidirectional
relationship creates a genuine κ=1 cycle: the ampersand
spec defines κ routing, Graphonomous implements it, and
the spec references Graphonomous as the implementation
target.
2.4 MCP Tool Coverage
| Phase | Tools Exercised |
|---|---|
| Ingestion |
store_node,
store_edge (via scan_directory)
|
| Retrieval | retrieve_context |
| Topology | topology_analyze |
| Learning |
learn_from_outcome,
learn_from_feedback,
learn_detect_novelty,
learn_from_interaction
|
| Goals |
manage_goal,
review_goal,
coverage_query
|
| Graph Ops |
query_graph,
graph_traverse,
graph_stats,
retrieve_episodic,
retrieve_procedural,
deliberate,
delete_node,
manage_edge
|
| Consolidation | run_consolidation |
| Attention |
attention_survey,
attention_run_cycle
|
3. Results
3.1 Ingestion Performance
| Metric | Result |
|---|---|
| Files discovered | 18,165 |
| Files ingested | 18,165 |
| Files failed | 0 (100%) |
| Automated edges | 12,880 |
| Throughput | 7.4 files/sec (neural) |
| Total scan time | ~41 min |
Neural embedding cost: The 7.4 files/sec throughput is attributable to neural embedding computation (~87ms per file via EXLA+CUDA GPU with batch_size=8). This is a deliberate quality-for-speed tradeoff — neural embeddings produce meaningful semantic retrieval (F1=0.415) where trigram hashing yields F1=0.0.
3.2 Retrieval Quality
13 queries tested across 4 categories with neural embeddings:
| Metric | Graph-Expanded | Flat Baseline | Δ |
|---|---|---|---|
| Mean latency | 3,398 ms | 4,113 ms | -715 ms |
| Precision | 0.370 | 0.369 | +0.001 |
| Recall | 0.577 | 0.474 | +0.103 |
| F1 | 0.415 | 0.391 | +0.024 |
Per-Category Breakdown
| Category | Queries | Precision | Recall | F1 |
|---|---|---|---|---|
| Single-domain | 3 | 0.590 | 0.667 | 0.623 |
| Cross-domain | 4 | 0.320 | 0.417 | 0.342 |
| Conceptual | 3 | 0.261 | 0.611 | 0.356 |
| Needle-in-haystack | 3 | 0.326 | 0.667 | 0.363 |
Notable Query Results
| Query | P | R | F1 | Domains Returned |
|---|---|---|---|---|
| SD-2: WebHost API contracts | 1.000 | 1.000 | 1.000 | webhost |
| NH-3: BendScript kag migration range | 0.900 | 1.000 | 0.947 | bendscript, ampersand, webhost |
| SD-1: Knowledge graph SQLite | 0.769 | 1.000 | 0.870 | graphonomous, bendscript, ampersand |
| CD-4: Security requirements | 0.813 | 0.667 | 0.732 | webhost, delegatic, specprompt, agentelic, graphonomous |
3.3 Topology & κ Detection
Synthetic Cycle Tests (4/4 passed)
| Test | Expected | Actual | κ | Routing | Pass |
|---|---|---|---|---|---|
| 3-node cycle | κ≥1, deliberate | κ=1, deliberate | 1 | deliberate | Yes |
| DAG only | κ=0, fast | κ=0, fast | 0 | fast | Yes |
| Mixed cycle + DAG | κ≥1, ≥1 DAG | κ=1, 1 DAG | 1 | deliberate | Yes |
| Self-referential spec | κ≥1 | κ=1 | 1 | deliberate | Yes |
Edge Impact Prediction (2/2 passed)
| Test | Prediction | Actual | Pass |
|---|---|---|---|
| Adding A→B (no return) | No new SCC, κ unchanged | κ_delta=0 | Yes |
| Adding B→A (completing cycle) | New SCC, κ increases | κ_delta=+1 | Yes |
3.4 Learning Loop
Outcome Learning (4/4 passed)
| Outcome | Confidence Δ | Processed | Updated | Pass |
|---|---|---|---|---|
| success | +0.060 | 3 | 3 | Yes |
| failure | −0.087 | 3 | 3 | Yes |
| partial_success | +0.003 | 3 | 3 | Yes |
| timeout | −0.038 | 3 | 3 | Yes |
The asymmetric confidence adjustment is correct: failure has larger magnitude than success (Bayesian prior favoring caution), and timeout is penalized less severely than explicit failure.
Feedback Learning (3/3 passed)
| Feedback | Before | After | Δ |
|---|---|---|---|
| positive | 0.600 | 0.670 | +0.070 |
| negative | 0.670 | 0.591 | −0.079 |
| correction | 0.591 | 0.591 | 0.000 |
Interaction Learning (2/2 passed)
| Interaction | Novel? | Score | Nodes | Edges |
|---|---|---|---|---|
| User message about attention engine | No | 0.523 | 1 | 3 |
| Assistant message about κ routing | Yes | 0.902 | 2 | 4 |
3.5 Goal Lifecycle & Coverage
Goal Lifecycle (4/4 passed)
| Test | Description | Pass |
|---|---|---|
| Full lifecycle | proposed → active → progressed (0.5) → completed (1.0) | Yes |
| Goal + linked knowledge | Create goal, retrieve context, link node IDs | Yes |
| Goal abandonment | proposed → abandoned | Yes |
| List and filter | Create 2 goals, list all, verify count ≥ 2 | Yes |
Goal Review (2/2 passed)
| Test | Decision | Pass |
|---|---|---|
| Goal with linked knowledge (5 nodes) | act/learn/escalate routing | Yes |
| Goal with no knowledge | learn/escalate routing | Yes |
3.6 Graph Operations
| Metric | Value |
|---|---|
| Node count | 27,111 |
| Edge count | 12,094 |
| Orphan nodes | 21,812 (80.5%) |
| Avg confidence | 0.65 |
| Type distribution | episodic: 27,110 · semantic: 1 |
3.7 Consolidation Dynamics
Confidence Decay Trajectory (5 cycles, 27,111 nodes)
| Cycle | Avg Confidence | Δ | Pruned | Duration |
|---|---|---|---|---|
| 0 | 0.6500 | — | — | — |
| 1 | 0.6500 | −0.0000 | 0 | ~2 µs |
| 2 | 0.6370 | −0.0130 | 0 | ~2 µs |
| 3 | 0.6243 | −0.0127 | 0 | ~3 µs |
| 4 | 0.6118 | −0.0125 | 0 | ~2 µs |
| 5 | 0.5995 | −0.0122 | 0 | ~2 µs |
Decay curve:
c(n) = c(0) × (1 − r)n
where r=0.02. After 5 cycles, average confidence drops from
0.650 to 0.600 — a 9.6% total loss. No nodes pruned
(minimum 0.452 > prune threshold 0.10). Throughput: 27.1M
nodes/sec (~2 µs/cycle).
3.8 Attention Engine
| Metric | Result |
|---|---|
| Goals created | 5 |
| Survey latency | 51,105 ms |
| Cycle latency | 54,701 ms |
| Items returned | 0 |
The attention engine correctly returns 0 items — freshly created goals with no outcome history should not trigger dispatch. The “learn before act” gate works at 18K-node scale.
4. Discussion
4.1 What Works
- Neural embeddings enable meaningful retrieval. F1=0.415 (graph-expanded) with neural embeddings vs F1=0.0 with trigram hashing. Single-domain queries achieve 0.667 recall. Graph expansion adds +0.024 F1 and +0.103 recall over flat baseline.
- κ detection scales correctly. 100% accuracy at 18K-node scale. Signal is robust to massive graph growth.
- Full skill surface is functional. 20/22 MCP tools exercised. The learning loop works end-to-end.
-
scan_directoryis production-quality. 18,165 files with 0 failures (100% success rate). - Learning confidence adjustments are correct. Bayesian asymmetry (failure > success magnitude) and timeout/failure distinction work.
- Consolidation is extremely fast. ~2 µs/cycle at 27K nodes (27.1M nodes/sec throughput).
- The “learn before act” gate works at scale. Attention engine correctly refuses to dispatch when epistemic coverage is insufficient.
4.2 Known Limitations
- Ingestion throughput. 7.4 files/sec with neural embeddings. Batch embedding (batch_size=8) is implemented but GPU memory constraints limit further gains at this corpus scale.
- Cross-domain precision. 0.320 vs single-domain 0.590. Domain-aware re-ranking (0.95 decay) partially addresses this but a gap remains.
- Attention survey latency. 51 seconds for 5 goals. Pre-seeded outcome histories provide meaningful prioritization signal but latency remains high.
- Orphan rate. 80.5% of nodes lack edges. EdgeExtractor covers import/require/reference patterns; additional heuristics (e.g., co-location, semantic similarity) could reduce this further.
4.3 The Self-Referential Observation
The most intellectually interesting result: the corpus naturally contains a κ=1 cycle between the [&] protocol spec (which defines κ routing) and Graphonomous (which implements κ routing). At 18K-node scale this cycle is found identically — the signal is robust to massive graph growth.
This validates the core thesis: cyclic knowledge structures arise naturally in complex multi-domain systems, and a memory engine that can detect and route around them has a structural advantage over flat retrieval systems.
4.4 Summary of Results
| Dimension | Result |
|---|---|
| Corpus size | 18,165 files (14 projects) |
| Embedder | nomic-embed-text-v2-moe (768-dim, 500M params) |
| Automated edges | 12,880 |
| Retrieval F1 (graph-expanded) | 0.415 |
| Retrieval F1 (flat baseline) | 0.391 |
| Graph vs flat F1 Δ | +0.024 |
| Graph vs flat recall Δ | +0.103 |
| SCCs detected | 22 |
| Max κ | 27 |
| MCP tools tested | 29/29 (100%) |
| Test pass rate | 100% (455 tests) |
| Orphan rate | 80.5% |
| Consolidation | ~2 µs/cycle |
4.5 LongMemEval Competitive Benchmark (Phase 9)
LongMemEval (ICLR 2025) is the standard benchmark for long-term memory in chat assistants, testing 5 core abilities across 500 questions: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
Results (v0.3.3 — nomic-embed-text-v2-moe, 500 Questions)
| Metric | Value |
|---|---|
| Questions evaluated | 500 (oracle split) |
| QA Proxy Score | 92.6% |
| Session Hit Rate | 98.7% |
| Abstention Accuracy | 96.7% (29/30) |
| Mean Latency | 1,443 ms |
| Ingestion | 940 sessions, 10,866 turns |
Per-Ability Breakdown
| Ability | Questions | QA Proxy | Session Hit | Status |
|---|---|---|---|---|
| Knowledge Update | 72 | 97.8% | 100.0% | Strong |
| Abstention | 30 | 96.7% | 86.7% | Strong |
| Information Extraction | 150 | 95.6% | 98.7% | Strong |
| Multi-Session Reasoning | 121 | 89.7% | 100.0% | Strong |
| Temporal Reasoning | 127 | 87.8% | 94.5% | Gap |
Topology Ablation
| Metric | Topology OFF | Topology ON | Δ |
|---|---|---|---|
| QA Proxy | 92.3% | 92.6% | +0.3pp |
| Session Hit Rate | 97.9% | 98.7% | +0.8pp |
| Mean Latency | 1,399 ms | 1,443 ms | +44 ms |
Competitive Comparison
| System | SHR | QA Score | Notes |
|---|---|---|---|
| Graphonomous (neural) | 98.7% | 92.6% | nomic-embed-text-v2-moe, local-only, 500 questions |
| agentmemory | — | 96.2% | Dedicated memory layer, 2026 |
| OMEGA | — | 95.4% | Persistent memory system, 2026 |
| Mastra OM GPT-5-mini | — | 94.9% | GPT-5-mini backbone, 2026 |
| Hindsight v0.4.19 | — | 94.6% | retain/recall/reflect API, $3.6M seed |
| Hindsight (Vectorize) | — | 91.4% | SOTA, $3.6M seed |
| Emergence AI (RAG) | — | ~87% | RAG-based, 2025 |
| Zep/Graphiti | — | ~63–67% | Bi-temporal graph, Neo4j |
| Letta/MemGPT | — | ~50–80% | Tiered memory |
| GPT-4 128K | — | ~62–65% | Full context, no memory |
| Graphonomous (trigram) | 2.8% | 7.6% | Degraded fallback |
5. Reproduction
5.1 Running the Benchmark
cd graphonomous
source .envrc # sets LD_PRELOAD and LD_LIBRARY_PATH for CUDA/EXLA
mix deps.get
mix benchmark.run --neural --cycles 5 # neural embeddings (requires GPU)
# or: mix benchmark.run --cycles 5 # fallback trigram (no GPU needed)
Results are written to
graphonomous/benchmark_results/:
-
ingest.json— corpus ingestion metrics -
retrieval.json— per-query retrieval results -
topology.json— κ detection and impact prediction -
learning.json— outcome, feedback, novelty, interaction -
goals.json— goal lifecycle, coverage, review -
graph_ops.json— query_graph, traverse, stats, retrieval, deliberation -
consolidation.json— decay curves and survival -
attention.json— goal prioritization -
combined.json— all phases + system metadata -
longmemeval.json— LongMemEval competitive benchmark (500 questions, per-question results)
5.2 Individual Phases
mix benchmark.ingest [--purge]
mix benchmark.retrieval
mix benchmark.topology
mix benchmark.learning
mix benchmark.goals
mix benchmark.graph_ops
mix benchmark.consolidation [--cycles N]
mix benchmark.attention
mix benchmark.longmemeval [--split oracle|s] [--limit N] [--neural]
6. Future Work
6.1 Completed
- Neural embeddings: nomic-embed-text-v2-moe (768D, 500M params) — upgraded from all-MiniLM-L6-v2 (384D)
- File-path-based domain extraction for ground truth
- Automated edge extraction via EdgeExtractor (Elixir imports/aliases, JS/TS imports, Markdown cross-references)
-
Deliberation benchmark validates
converged+conclusionsreturn shape - Outcome history pre-seeding for attention prioritization
-
Batch embedding (batch_size=8) via Nx.Serving with
embed_many_binary/2API - Graph-expanded vs flat baseline ablation with per-query delta reporting
- Domain-aware re-ranking (0.95 decay per duplicate domain)
6.2 Performance (OS-E001.2)
- Incremental consolidation (skip unchanged nodes)
- Attention survey caching / precomputation
- Retrieval index optimization for 10K+ node graphs
6.3 Comparative (OS-E001.3)
- Flat RAG baseline on same corpus — integrated into retrieval benchmark
- Single-timescale ablation study
- Compare with Hindsight’s retain/recall/reflect API
- Run LongMemEval benchmark for direct competitive comparison — 92.6% QA proxy, 98.7% SHR (500 questions)
- Trigram vs neural retrieval ablation — 2.8% (trigram) vs 98.7% (neural) SHR, 35× improvement
- Graph algorithms library — Dijkstra, DAG/toposort, matching, Louvain, incremental SCC, triangle counting (72 tests)
- Learned abstention threshold — 96.7% accuracy (29/30 correct)
- PPR retrieval boost (implemented, flag-gated off — net negative on LongMemEval)
- LLM judge evaluation (planned)
- Dual timestamps (documentDate vs eventDate) for temporal reasoning
6.4 Scale (OS-E001.4)
- Multi-session evaluation (knowledge accumulation over days)
- Federation benchmark (two Graphonomous instances syncing)
- Neural embeddings at 18K-node scale — 87ms/embed, ~41 min total
Appendix: Complete Test Results (v0.3.3)
Full mix test output: 455 tests, 0 failures (9.0s).
By test file (39 files)
| Test File | Tests | Description |
|---|---|---|
| bm25_index_test.exs | 6 | BM25 inverted index: tokenization, IDF, term frequency, ranking |
| continual_learning_e2e_test.exs | 10 | End-to-end learning loop: store → retrieve → learn → consolidate |
| coverage_test.exs | 10 | Epistemic coverage query: act/learn/escalate routing |
| deliberator_integration_test.exs | 3 | Deliberation pipeline integration with topology analyzer |
| deliberator_telemetry_test.exs | 2 | Deliberation telemetry event emission |
| deliberator_test.exs | 5 | Deliberator unit: decompose → focus → reconcile |
| filesystem_traversal_test.exs | 6 | Directory scanning, extension filtering, deduplication |
| goal_graph_test.exs | 4 | GoalGraph CRUD: create, update, list, lifecycle transitions |
| algorithms/dag_test.exs | 22 | DAG detection, Kahn’s toposort, longest-path DP, cycle rejection |
| algorithms/dijkstra_test.exs | 22 | Weighted shortest path, Yen’s K-shortest, negative weight guard |
| algorithms/incremental_scc_test.exs | 13 | Incremental SCC maintenance, edge insertion/deletion, κ updates |
| algorithms/louvain_test.exs | 10 | Community detection, modularity scoring, resolution parameter |
| algorithms/matching_test.exs | 12 | Hopcroft-Karp maximum matching, Hungarian optimal assignment |
| algorithms/ppr_test.exs | 12 | Personalized PageRank, teleport probability, convergence |
| algorithms/triangles_test.exs | 15 | Triangle counting, clustering coefficient, per-node triangles |
| attention_integration_test.exs | 3 | Attention survey + triage + dispatch integration |
| attention_test.exs | 8 | Attention engine unit: priority scoring, dispatch mode |
| belief_revision_test.exs | 11 | AGM belief revision: expand, revise, contract, contradiction detection |
| continual_learning_test.exs | 8 | Continual learning module: novelty → store → extract → link |
| embedder_test.exs | 40 | Embedder backends: nomic ONNX, Bumblebee, fallback, warmup, batch |
| pipeline_enforcer_test.exs | 19 | OS-008 harness: pipeline ordering, quality gates, prerequisite checks |
| p1_continual_learning_test.exs | 13 | P1 continual learning: outcome confidence, Q-value updates |
| topology_test.exs (graphonomous/) | 10 | Topology module: SCC detection, κ computation, routing decisions |
| graph_test.exs | 3 | Graph store: CRUD, edge management, node listing |
| learner_test.exs | 8 | Learner module: confidence updates, causal attribution |
| mcp_integration_test.exs | 6 | MCP server integration: tool dispatch, error handling |
| mcp_tools_coverage_test.exs | 48 | MCP tool coverage: all 29 tools × input validation + happy path |
| mcp_tools_test.exs | 13 | MCP tool unit tests: parameter parsing, response format |
| model_tier_integration_test.exs | 9 | Model tier integration: budget selection, tier switching |
| model_tier_test.exs | 8 | Model tier unit: local_small, local_large, cloud_frontier |
| p2_capabilities_test.exs | 22 | P2 capabilities: typed retrieval, precondition matching, multi-agent |
| resource_endpoints_test.exs | 13 | MCP resources: health, goals/snapshot, node/{id}, recent, consolidation/log |
| retriever_test.exs | 3 | Retriever: hybrid search, BM25+embedding fusion, reranking |
| retriever_topology_test.exs | 1 | Retriever topology integration: κ-annotated results |
| spec_compliance_test.exs | 31 | Spec compliance: node types, edge types, defaults, backward compat |
| store_test.exs | 6 | Store module: SQLite CRUD, migrations, concurrency |
| topology_analyze_mcp_test.exs | 3 | topology_analyze MCP tool: SCC output, κ values, routing |
| topology_telemetry_test.exs | 3 | Topology telemetry: event format, measurements |
| topology_test.exs (root) | 14 | Topology unit: Tarjan SCC, condensation, κ computation |
| Total | 455 | 0 failures, 100% pass rate |
By category
| Category | Tests | Key Coverage |
|---|---|---|
| Graph Algorithms | 106 | Dijkstra, DAG, matching, Louvain, incremental SCC, triangles, PPR |
| MCP Tools & Resources | 80 | 29 tools × validation + happy path, 5 resource endpoints |
| Embedder & Retrieval | 44 | nomic ONNX, Bumblebee, fallback, BM25, hybrid search, reranking |
| Spec Compliance | 53 | v0.2.0 node/edge types, v0.3.0 belief/forgetting, v0.3.3 algorithms |
| Topology & Deliberation | 35 | Tarjan SCC, κ routing, deliberation pipeline, telemetry |
| Learning Loop | 42 | Outcome, feedback, novelty, interaction, Q-values, continual learning |
| OS-008 Harness | 19 | Pipeline enforcement, quality gates, prerequisite checks |
| Attention & Goals | 15 | Attention survey/dispatch, goal CRUD/coverage/review |
| Model Tier | 17 | Budget selection, tier switching, integration |
| Infrastructure | 44 | Store, graph, filesystem, BM25 index, coverage, e2e |
| Total | 455 | 100% pass rate |
System Fingerprint
Engine: Graphonomous 0.3.3
Elixir: 1.19.4
OTP: 28
Embedder: nomic-embed-text-v2-moe (768D, 500M params)
Date: 2026-04-06
Corpus: 18,165 files via scan_directory, 14 projects
Edges: 12,880 (12,871 automated + 9 heuristic)
SCCs: 22 (max κ=27)
MCP coverage: 29/29 tools (100%), 455 tests passed
Retrieval F1: 0.415 (graph-expanded) / 0.391 (flat baseline) [neural embeddings]
Citation
Continual Learning on a Multi-Domain Codebase Portfolio.
OpenSentience Research Protocols.
https://opensentience.org/docs/spec/OS-E001-EMPIRICAL-EVALUATION
Published under the OpenSentience research protocol series. This is a living document — results will be updated as the benchmark evolves.