If the system is built primarily for agents, the canonical source of truth should not be YAML, one large JSON document, or SQLite. It should be atomic JSONL records, validated by schema, tested by evals, and compiled into fast runtime targets.
I recently changed my mind about the canonical data layer for agent-first systems. If humans are the primary authors, Markdown or YAML can be a good starting point. If agents are the primary readers, writers, retrievers, and auditors, the source of truth should optimize for their operating model instead.
For SAP Agent Context, that means the canonical SSOT should be JSONL files plus JSON Schema. SAP Agent Context is not one configuration object; it is a knowledge corpus and agent context graph with apps, tables, fields, workflows, roles, claims, sources, relations, and eval fixtures.
That distinction matters. For one contract object, a single JSON file can be the right canonical form. For many atomic knowledge records, JSONL is stronger. For query/runtime/index speed, SQLite or DuckDB should be generated. YAML can remain a human authoring format, but it should not be the hard agent contract.
Make JSONL records the canonical source for the knowledge corpus, keep JSON Schema as the contract, generate SQLite/DuckDB/Turso/vector artifacts from it, and treat evals as part of the quality contract.
JSONL is boring in the best possible way: one JSON object per line. That gives an agent a stream of small, typed, addressable records instead of a document it has to interpret.
| Property | Why agents care |
|---|---|
| Atomic records | Each line can be read, validated, diffed, embedded, cited, replaced, or rejected independently. |
| Stable Git diffs | One changed fact can become one changed line when records are sorted and IDs are stable. |
| Streaming friendly | Agents and pipelines can process large files line by line without loading the whole corpus. |
| Schema native | JSON Schema gives a crisp machine contract for required fields, enums, relations, freshness, and constraints. |
| Toolchain compatible | Python, Node, embeddings pipelines, vector stores, SQLite, DuckDB, Turso, and Postgres all ingest it easily. |
Most importantly: JSONL pushes you away from documents and toward records. That matters when the product is not a human-readable knowledge base, but a context operating system for agents.
The answer is not “JSONL for everything.” The answer is: choose the canonical format based on the shape of the truth.
| Source shape | Best canonical form |
|---|---|
| One configuration or schema contract | JSON. A single vault-schema.json-style contract can be the right source of truth. |
| Many atomic knowledge records | JSONL. Records stay small, reviewable, streamable, citeable, and independently validatable. |
| Query, runtime, search, analytics | Generated SQLite, DuckDB, Turso, vector indexes, and bundles. Fast to use, not canonical to edit. |
| Human drafting | Optional YAML or Markdown authoring layer. Useful for people, but compiled into the stricter record contract. |
I would also split curated retrieval data from generated indexes. Keywords, query patterns, and negative retrieval hints can be canonical because they encode intent. Embeddings, FTS indexes, ANN indexes, and database tables are generated artifacts.
SQLite is excellent. I want it in the system. I just do not want it as the canonical source.
SQLite, DuckDB, libSQL, Turso, vector indexes, and JSON bundles should be generated runtime artifacts. They make agents fast at query time. They should not be the place where the public repository stores the canonical facts.
Use SQLite for local querying and runtime speed. Use JSONL for reviewable, mergeable, schema-validated truth.
As a canonical repo format, SQLite has the wrong failure modes: opaque diffs, heavy migrations, awkward merge conflicts, and poor review ergonomics. Agents can read SQLite, but contributors and reviewers cannot easily inspect what changed. For an agent-first repo, that is backwards: the source should be transparent, and the runtime should be generated.
An agent-first record should not be a loose summary. It needs stable identity, evidence, relations, retrieval hints, constraints, and freshness. If claims carry evidence, confidence, and freshness, they should often become first-class records instead of always being embedded inside an item.
{"id":"sap.app.eam.pm.ie03","kind":"sap_app","title":"IE03 - Display Equipment","summary":"SAP GUI transaction context for displaying equipment.","topics":["sap-gui","eam","plant-maintenance","equipment"],"claim_ids":["sap.claim.ie03.display-equipment"],"source_ids":["sap.source.tc-ie03-public"]}
{"id":"sap.claim.ie03.display-equipment","subject_id":"sap.app.eam.pm.ie03","statement":"IE03 is used to display equipment master data.","confidence":"high","evidence_ids":["sap.source.tc-ie03-public"],"freshness":{"reviewed_at":"2026-06-24","review_after":"2026-12-24"}}
The point is not that every record must be large. The point is that the important semantics are explicit. A good agent record can answer:
The repo should not be organized around documents. It should be organized around typed, testable, buildable records.
records/
items/
sap_app.jsonl
sap_table.jsonl
sap_field.jsonl
workflow.jsonl
role.jsonl
claims.jsonl
relations.jsonl
sources.jsonl
schema/
item.schema.json
claim.schema.json
relation.schema.json
source.schema.json
bundle.schema.json
eval.schema.json
retrieval/
hints.jsonl
query_patterns.jsonl
evals/
retrieval.jsonl
bundle_quality.jsonl
adversarial.jsonl
dist/
sap-agent-context.sqlite
sap-agent-context.duckdb
indexes/
bundles/
embeddings/
The CLI then becomes the agent interface:
sap-agent-context validate
sap-agent-context query "IE03 equipment display"
sap-agent-context bundle --goal "FO for SAP GUI equipment display"
sap-agent-context build-db --target sqlite
sap-agent-context build-db --target duckdb
sap-agent-context eval retrieval
sap-agent-context explain sap.app.eam.pm.ie03
The product is not the repository. The product is the bounded context bundle an agent can trust for a specific goal: relevant records, must-cite evidence, warnings, gaps, confidence, and forbidden assumptions.
{"bundle_id":"bundle.eam.equipment-display","goal":"write FO for EAM equipment display flow","items":["sap.app.eam.pm.ie03","sap.object.eam-equipment"],"must_cite":["sap.source.tc-ie03-public"],"warnings":[],"gaps":["customer-specific authorization model not supplied"],"forbidden_assumptions":["do not infer customer authorization design"],"confidence":"medium"}
If humans rarely touch the data directly, stop designing the canonical layer as if it is a document library. Design it for what agents actually need to do:
That is why my current answer is slightly more precise: JSONL is the best canonical SSOT for agent-first knowledge corpora. A single schema contract can still be JSON. SQLite, DuckDB, Turso, bundles, embeddings, and generated indexes are accelerators. Curated retrieval hints and evals are part of the quality contract. JSON Schema is the guardrail. The knowledge source is atomic JSONL.
This is the direction I want to process further into my own stack: typed JSONL records as the source, generated databases as runtime, and retrieval evals as proof that agents are getting the right context.