Matsya - spec_0.1a (rigorous RAG for theory + translation)¶

matsya #rag #correctness #translation #theory #verification¶

This document specifies the first correctness-oriented architecture for Matsya as a Bellman calculus research and translation assistant.

It is distinct from the earlier deployment note development/dolo-plus-impl/spec_0.1k-rag-assistant.md:

0.1k is mainly about standing up the assistant in the existing hosting stack.
0.1a is about how Matsya should be built if we want high-confidence theory answers, high-confidence YAML <-> formal MDP translation, and a clean path toward future training on verified examples.

The motivating use cases are already visible in the current docs:

theory and research Q&A over Bellman calculus, DDSL, category-theoretic foundations, and dynamic programming
strict dolo-plus YAML <-> formal MDP translation
refinement / round-tripping for higher confidence
runtime de-sugaring / interpretation support in development workflows

See:

AI/matsya/matsya.md
AI/context/matsya-context.md
development/research-notes/matsya-desugar-whisperer.md

0. Implementation location¶

This spec lives in bellman-ddsl because that repository contains the canonical theory, spec, examples, and curated context that Matsya must read.

However, Matsya itself is implemented as part of the HARK_ask-your-project stack, not as a runtime subsystem inside bellman-ddsl.

Current local implementation target:

/Users/akshayshanker/Desktop/econark/HARK_ask-your-project/

Accordingly, the boundary is:

bellman-ddsl:
canonical docs/specs/examples
curated RAG context
development notes describing desired Matsya behavior
HARK_ask-your-project:
retrieval/index build pipeline
vector store / metadata store
system prompts
UI / CLI / deployment
runtime answer assembly and tool orchestration

Implementation-facing files currently visible in HARK_ask-your-project include:

config/repositories.yml
config/system_prompts/Bellman-DDSL.md
scripts/build/build_final_index.py
scripts/build/repository_discovery.py
scripts/rag-cli/rag
scripts/rag-cli/matsya-local
app.py

So this document should be read as:

a requirements/specification document stored in bellman-ddsl, but
a document whose runtime/build obligations are primarily implemented in HARK_ask-your-project.

1. Goal¶

Matsya should support two primary tasks:

Theory mode: help users understand the mathematical and semantic structure of BE-DDSL, including theory that is still under development.
Translation mode: translate between dolo-plus YAML and formal mathematical MDP writeups with strong guarantees of structural correctness.

The core requirement is correctness before fluency. If Matsya lacks evidence, or if the retrieved sources disagree, it should expose the uncertainty rather than synthesize a smooth but unsupported answer.

2. Modes and correctness targets¶

Mode	Primary output	Correctness target
Theory / research Q&A	Explanation with equations and citations	Every substantive claim is grounded in retrieved sources; unresolved tensions are surfaced explicitly
Translation: YAML -> formal MDP	Structured mathematical writeup	Output matches retrieved syntax/semantic rules and the input YAML; no invented fields or operators
Translation: formal MDP -> YAML	Valid dolo-plus stage YAML	Output passes parser/schema/consistency checks
Refine / repair	Improved YAML or writeup plus diff	Each iteration reduces inconsistency; fixed-point or explicit failure reason
Development-theory mode	Explanation of provisional or dev-spec semantics	Clearly separate canonical sources from provisional dev notes

The key asymmetry is:

In translation mode, correctness can often be checked mechanically.
In theory mode, correctness is mostly evidence-grounded rather than fully formal, unless and until we attach formal proof artifacts.

3. Design principles¶

3.1 Canonical-first retrieval¶

Matsya should prefer sources in this order:

canonical docs/spec/reference pages
canonical example YAMLs and grammar/code sources
development specs and working notes
curated external theory and literature

This avoids a common failure mode where draft notes or prompt mirrors out-rank the actual project canon.

3.2 Structure fidelity¶

Do not flatten math-heavy documents into generic text windows if a stronger document structure is available.

For Matsya, the native structure is often the meaning-bearing structure:

a section / subsection in a theory page
a theorem / definition / proposition block in TeX
a YAML block such as symbols, equations, cntn_to_dcsn_mover
a grammar rule or parser normalization note

Recent academic-RAG work argues for preserving this hierarchy rather than destroying it during chunking. This is especially important for mathematical and specification-heavy corpora.

3.3 Evidence before synthesis¶

Answer generation should happen only after Matsya has assembled an evidence packet:

retrieved chunks
reranked support
source identities
trust tier
local contradictions or missing pieces

This is more important than raw model size.

3.4 Machine checking whenever possible¶

If a claim can be checked by a parser, schema validator, round-trip procedure, or compiler-like consistency pass, Matsya should use that route rather than relying on free-form generation.

3.5 Fail safely¶

For this project, "I cannot verify that from the current sources" is a better answer than a polished hallucination.

4. Corpus policy¶

The corpus should be organized into trust tiers. AI/context/matsya-context.md is the curation policy for what to ingest.

Tier 1: project canon¶

docs/
packages/dolo/examples/models/doloplus/
packages/dolang/dolang/
selected reference pages such as symbol conventions, schema, and syntax-semantic rules

Tier 2: active development sources¶

AI/working/
docs/development/dolo-plus-impl/
selected research notes that explain active design decisions

Tier 3: curated external theory¶

AI/context/literature/
curated modular-MDP theory sources
selected Dyn-X / StageClass theory notes

Tier 4: optional large or noisy sources¶

PDFs
exploratory notebooks
large external codebases

Tier 4 should be used sparingly because it often hurts retrieval precision.

5. Document preparation and chunking¶

5.1 Chunk by semantic unit, not only by token count¶

Recommended chunking policy:

Markdown: split by heading; keep equations and nearby explanatory prose together
YAML: split by semantic block (symbols, equations, sub-blocks such as V[>], dcsn_to_cntn_transition, Bellman)
TeX: split by section, subsection, definition, theorem, proposition, and equation neighborhoods
Code / grammar: split by parser rule, transformation function, or validation routine

Fixed-size windows may still be used inside very large sections, but only after the semantic boundary has been respected.

5.2 Preserve metadata¶

Each chunk should carry at least:

source path
heading / section path
document type (spec, theory, yaml-example, grammar, working-note, literature)
trust tier
key symbols / operators mentioned
model family / example family where relevant

5.3 Prefer source text over PDFs¶

For mathematics, native .tex and .md files are strongly preferred over PDF extraction. If both exist, Matsya should rank the source files above the PDF text.

5.4 De-duplicate mirrors¶

Prompt-context mirrors, repeated exports, and copied external folders should not all be indexed equally. Duplicate content lowers retrieval quality and inflates spurious confidence.

6. Retrieval architecture¶

6.1 One corpus, multiple retrieval profiles¶

Matsya can share one underlying index, but retrieval should be profile-specific.

Theory profile¶

Prioritize:

theory docs
syntax-semantic rules
symbol conventions
curated literature and foundational notes

Translation profile¶

Prioritize:

schema/spec pages
canonical example YAMLs
Dolang grammar / parser behavior
working notes that record active syntax decisions

Development-theory profile¶

Prioritize:

dev specs
research notes
active working notes

But the final answer must label claims as:

canonical
development / provisional
unresolved / ambiguous

6.2 Hybrid retrieval¶

Matsya should not rely on vector search alone.

Recommended stack:

dense retrieval for semantic recall
lexical / BM25 retrieval for exact notation, symbol names, and file-local terminology
fused candidate set
reranking step before answer generation

This is particularly important for syntax questions such as V[>][label], argmax_{c}, or dcsn_to_cntn_transition.

6.3 Reranking¶

Use a reranker to improve precision after recall:

cross-encoder reranker if available
otherwise an LLM scoring pass over candidate chunks

The reranker should favor:

definition-bearing chunks
local equation neighborhoods
canonical sources over drafts when both are relevant

6.4 Query expansion and adaptive retrieval¶

For hard questions, retrieval should be allowed to adapt:

reformulate the query
retrieve by symbols and by meaning
widen from exact operator names to surrounding semantic concepts
retrieve additional support if the first pass is thin or contradictory

This is closer to scientific-RAG and proof-search systems than to a static one-shot retriever.

6.5 Retrieve examples, not just definitions¶

For translation and proof-like tasks, Matsya should retrieve:

relevant definitions
similar YAML examples
similar formal writeups
nearby or analogous solved cases

The closest analog from theorem-proving systems is retrieving both premises and similar proofs from the same project.

7. Answer synthesis¶

7.1 Theory mode¶

Theory answers should be:

grounded in retrieved sources
citation-rich
explicit about what is canonical vs provisional
conservative about unstated claims

Long answers should use an outline-guided synthesis pass so that the final answer is coherent rather than a bag of retrieved snippets.

7.2 Translation mode¶

Translation should be treated as a constrained transduction task:

retrieval seeds conventions and examples
the generator proposes a candidate output
validators check the output
refinement repairs the output if needed

This is different from ordinary prose QA. The output must satisfy structural rules, not merely sound plausible.

7.3 Development-theory mode¶

When the question touches active design work, Matsya should surface disagreement directly. For example:

what the canonical spec says
what the current dev-spec says
whether the point is still marked as ambiguous

This is essential for a project where the semantics are still evolving.

8. Verification stack¶

8.1 Translation verification¶

For generated YAML, Matsya should aim for the following checks:

YAML parse success
Dolang parse success for embedded equations
schema / symbol-group validation
internal consistency checks:
declared symbols vs used symbols
perch-tag consistency
branch-label consistency
operator/binder consistency
optional round-trip refinement:
YAML -> formal MDP -> YAML
formal MDP -> YAML -> formal MDP

8.2 Theory verification¶

Theory answers cannot usually be machine-proved at v0.1a, but they can still be checked for evidential rigor:

every nontrivial claim must be supportable by retrieved chunks
equations should be copied or paraphrased only when the source clearly supports them
if sources conflict, Matsya must say so
if evidence is weak, Matsya must lower its confidence

8.3 Correctness ladder¶

Matsya outputs should be thought of in levels:

Retrieved: relevant sources were found
Grounded: answer explicitly relies on those sources
Parser-valid: output syntax is accepted
Schema-valid: output satisfies structural constraints
Round-trip stable: output survives refinement / round-trip
Human-reviewed: output accepted as canonical example

Translation mode should push as high on this ladder as possible.

9. Example bank and future training¶

Over time, Matsya should accumulate a verified example bank.

Each example should carry metadata such as:

source model family
translation direction
parser-valid
schema-valid
round-trip converged
human-reviewed
retrieved supporting sources

These examples serve two purposes:

better retrieval at inference time
future fine-tuning data

Training should happen only on examples that have passed validation. Retrieval improvement should generally come before model fine-tuning.

10. Evaluation¶

Matsya should be evaluated as a technical assistant with correctness constraints, not as a generic chatbot.

10.1 Theory benchmark¶

Build a benchmark of questions about:

Bellman calculus foundations
perch semantics
syntax and notation
branching and composition
development-time ambiguities

Metrics:

citation precision
canonical-source hit rate
unsupported-claim rate
contradiction detection rate
human judgment on correctness and clarity

10.2 Translation benchmark¶

Build a benchmark of:

YAML -> formal MDP tasks
formal model -> YAML tasks
repair / refine tasks

Metrics:

parse pass rate
schema pass rate
round-trip convergence
human acceptance
failure mode classification

10.3 Retrieval benchmark¶

Measure retrieval directly:

does the top-k contain the canonical source?
does the retrieved set contain the needed example?
are noisy prompt-draft chunks excluded?

Poor retrieval should not be confused with poor generation.

11. Immediate implementation implications¶

All implementation tasks in this section are primarily tasks for the HARK_ask-your-project codebase and deployment stack. The corresponding role of bellman-ddsl is to provide the corpus, canon, examples, and evaluation cases that Matsya indexes and reasons over.

11.1 Now¶

curate the corpus aggressively using AI/context/matsya-context.md
add trust-tier metadata
introduce profile-specific retrieval
add reranking
keep structure-preserving chunks for specs, TeX, and YAML

Concrete implementation targets in HARK_ask-your-project will likely include:

config/repositories.yml for repository groups, weights, and indexing options
scripts/build/build_final_index.py and related build utilities for chunking / metadata / retrieval profiles
config/system_prompts/Bellman-DDSL.md for mode behavior and answer discipline
app.py and CLI entry points for profile selection, citations, and verification-facing UX

11.2 Next¶

add machine validators to translation mode
formalize round-trip refinement as a correctness pass
build a benchmark set of theory and translation tasks

11.3 Later¶

collect verified examples and repair traces
train retrieval components on project-local examples
consider fine-tuning generation only after validators and evals are stable
explore stronger proof / theorem-verification hooks for theory mode

12. Relation to existing Matsya notes¶

development/dolo-plus-impl/spec_0.1k-rag-assistant.md
deployment and hosting note
development/research-notes/matsya-desugar-whisperer.md
Matsya as a de-sugaring / interpretation layer in the solver pipeline
AI/context/matsya-context.md
corpus inclusion / exclusion policy
AI/matsya/matsya.md
user-facing modes and operational instructions

13. External references informing this spec¶

PT-RAG: structure-fidelity retrieval for academic papers
https://arxiv.org/abs/2602.13647v1
SciRAG: adaptive, citation-aware, outline-guided scientific synthesis
https://arxiv.org/abs/2511.14362
PaperQA2: high-accuracy scientific-document RAG with evidence gathering and reranking
https://github.com/Future-House/paper-qa
Rango: retrieval-augmented proving with similar proofs + premises
https://arxiv.org/abs/2412.14063
RAG-Verus: repository-level retrieval for verification tasks
https://arxiv.org/abs/2502.05344
AWS guidance on context engineering / writing for RAG
https://docs.aws.amazon.com/prescriptive-guidance/latest/writing-best-practices-rag/introduction.html

14. Bottom line¶

Matsya should be built as a retrieval + verification system, not just a prompt over a vector store.

For this project:

theory mode needs evidence discipline
translation mode needs compiler-like validation
future training should be based on verified examples, not raw generated text

That is the minimal path to a research assistant that is actually useful for BE-DDSL development.