JSONL Is the Agent-First Data Structure

1The thesis

I recently changed my mind about the canonical data layer for agent-first systems. If humans are the primary authors, Markdown or YAML can be a good starting point. If agents are the primary readers, writers, retrievers, and auditors, the source of truth should optimize for their operating model instead.

For SAP Agent Context, that means the canonical SSOT should be JSONL files plus JSON Schema. SAP Agent Context is not one configuration object; it is a knowledge corpus and agent context graph with apps, tables, fields, workflows, roles, claims, sources, relations, and eval fixtures.

That distinction matters. For one contract object, a single JSON file can be the right canonical form. For many atomic knowledge records, JSONL is stronger. For query/runtime/index speed, SQLite or DuckDB should be generated. YAML can remain a human authoring format, but it should not be the hard agent contract.

Strong recommendation

Make JSONL records the canonical source for the knowledge corpus, keep JSON Schema as the contract, generate SQLite/DuckDB/Turso/vector artifacts from it, and treat evals as part of the quality contract.

records/*.jsonl → JSON Schema → evals/*.jsonl → dist/* runtime artifacts

2Why JSONL wins

JSONL is boring in the best possible way: one JSON object per line. That gives an agent a stream of small, typed, addressable records instead of a document it has to interpret.

Property	Why agents care
Atomic records	Each line can be read, validated, diffed, embedded, cited, replaced, or rejected independently.
Stable Git diffs	One changed fact can become one changed line when records are sorted and IDs are stable.
Streaming friendly	Agents and pipelines can process large files line by line without loading the whole corpus.
Schema native	JSON Schema gives a crisp machine contract for required fields, enums, relations, freshness, and constraints.
Toolchain compatible	Python, Node, embeddings pipelines, vector stores, SQLite, DuckDB, Turso, and Postgres all ingest it easily.

Most importantly: JSONL pushes you away from documents and toward records. That matters when the product is not a human-readable knowledge base, but a context operating system for agents.

3The important nuance

The answer is not “JSONL for everything.” The answer is: choose the canonical format based on the shape of the truth.

Source shape	Best canonical form
One configuration or schema contract	JSON. A single `vault-schema.json`-style contract can be the right source of truth.
Many atomic knowledge records	JSONL. Records stay small, reviewable, streamable, citeable, and independently validatable.
Query, runtime, search, analytics	Generated SQLite, DuckDB, Turso, vector indexes, and bundles. Fast to use, not canonical to edit.
Human drafting	Optional YAML or Markdown authoring layer. Useful for people, but compiled into the stricter record contract.

I would also split curated retrieval data from generated indexes. Keywords, query patterns, and negative retrieval hints can be canonical because they encode intent. Embeddings, FTS indexes, ANN indexes, and database tables are generated artifacts.

4Why SQLite is not the SSOT

SQLite is excellent. I want it in the system. I just do not want it as the canonical source.

SQLite, DuckDB, libSQL, Turso, vector indexes, and JSON bundles should be generated runtime artifacts. They make agents fast at query time. They should not be the place where the public repository stores the canonical facts.

Database as artifact

Use SQLite for local querying and runtime speed. Use JSONL for reviewable, mergeable, schema-validated truth.

As a canonical repo format, SQLite has the wrong failure modes: opaque diffs, heavy migrations, awkward merge conflicts, and poor review ergonomics. Agents can read SQLite, but contributors and reviewers cannot easily inspect what changed. For an agent-first repo, that is backwards: the source should be transparent, and the runtime should be generated.

5The agent-first record shape

An agent-first record should not be a loose summary. It needs stable identity, evidence, relations, retrieval hints, constraints, and freshness. If claims carry evidence, confidence, and freshness, they should often become first-class records instead of always being embedded inside an item.

records/items/sap_app.jsonl

{"id":"sap.app.eam.pm.ie03","kind":"sap_app","title":"IE03 - Display Equipment","summary":"SAP GUI transaction context for displaying equipment.","topics":["sap-gui","eam","plant-maintenance","equipment"],"claim_ids":["sap.claim.ie03.display-equipment"],"source_ids":["sap.source.tc-ie03-public"]}

records/claims.jsonl

{"id":"sap.claim.ie03.display-equipment","subject_id":"sap.app.eam.pm.ie03","statement":"IE03 is used to display equipment master data.","confidence":"high","evidence_ids":["sap.source.tc-ie03-public"],"freshness":{"reviewed_at":"2026-06-24","review_after":"2026-12-24"}}

The point is not that every record must be large. The point is that the important semantics are explicit. A good agent record can answer:

What is this?
Which claims does it make?
What evidence supports those claims?
How fresh is it?
How does it relate to other records?
When should it be retrieved?
When should it not be retrieved?

6The architecture I would build

The repo should not be organized around documents. It should be organized around typed, testable, buildable records.

repository shape

records/
  items/
    sap_app.jsonl
    sap_table.jsonl
    sap_field.jsonl
    workflow.jsonl
    role.jsonl
  claims.jsonl
  relations.jsonl
  sources.jsonl

schema/
  item.schema.json
  claim.schema.json
  relation.schema.json
  source.schema.json
  bundle.schema.json
  eval.schema.json

retrieval/
  hints.jsonl
  query_patterns.jsonl

evals/
  retrieval.jsonl
  bundle_quality.jsonl
  adversarial.jsonl

dist/
  sap-agent-context.sqlite
  sap-agent-context.duckdb
  indexes/
  bundles/
  embeddings/

The CLI then becomes the agent interface:

agent interface

sap-agent-context validate
sap-agent-context query "IE03 equipment display"
sap-agent-context bundle --goal "FO for SAP GUI equipment display"
sap-agent-context build-db --target sqlite
sap-agent-context build-db --target duckdb
sap-agent-context eval retrieval
sap-agent-context explain sap.app.eam.pm.ie03

The product is not the repository. The product is the bounded context bundle an agent can trust for a specific goal: relevant records, must-cite evidence, warnings, gaps, confidence, and forbidden assumptions.

dist/bundles/eam-equipment-display.json

{"bundle_id":"bundle.eam.equipment-display","goal":"write FO for EAM equipment display flow","items":["sap.app.eam.pm.ie03","sap.object.eam-equipment"],"must_cite":["sap.source.tc-ie03-public"],"warnings":[],"gaps":["customer-specific authorization model not supplied"],"forbidden_assumptions":["do not infer customer authorization design"],"confidence":"medium"}

7The north star

If humans rarely touch the data directly, stop designing the canonical layer as if it is a document library. Design it for what agents actually need to do:

Retrieve the right bounded context.
Reason over explicit claims and relations.
Cite stable IDs and evidence.
Detect gaps instead of hallucinating missing customer-specific proof.

That is why my current answer is slightly more precise: JSONL is the best canonical SSOT for agent-first knowledge corpora. A single schema contract can still be JSON. SQLite, DuckDB, Turso, bundles, embeddings, and generated indexes are accelerators. Curated retrieval hints and evals are part of the quality contract. JSON Schema is the guardrail. The knowledge source is atomic JSONL.

Stack implication

This is the direction I want to process further into my own stack: typed JSONL records as the source, generated databases as runtime, and retrieval evals as proof that agents are getting the right context.