Observe-Reason-Act-Converge: A Design Pattern to Build an AI Data Engineer Agent

A design pattern for AI agents that monitor, diagnose, and fix data pipelines autonomously — so your data team can focus on building, not babysitting cron jobs.

At RevDog, we run two quant strategies on NSE small-caps. Both depend on a chain of cron jobs — market indices, EOD prices from Zerodha, regime detection, conviction scoring, signal generation, S3 backups.

Every job depends on the one before it. If prices don't fetch, conviction scores are stale. If conviction scores are stale, signals are wrong. If signals are wrong, we trade on garbage.

For weeks, we were the data engineer. Checking cron logs manually. Discovering at 10 AM that prices hadn't fetched the last day because the Zerodha token had expired. Re-running jobs by hand, in the right order, hoping we didn't miss a dependency. On holidays, false alarms. On weekends, silence followed by Monday morning surprises.

The problem wasn't any single failure. It was that failures cascaded silently through a dependency chain, and we had no systematic way to detect, diagnose, and fix them — short of being the human in the loop every morning.

We needed an agent.

But not the kind that gets a shell and figures things out.

We needed one that could reason about dependencies, fix what it could, flag what it couldn't, handle jobs that take longer than its own execution window, and pick up where it left off the next hour.

The pattern we built to solve this — Observe, Reason, Act, Converge — turns out to be general enough to apply to any team drowning in pipeline babysitting.

The Data Engineering Problem Nobody Talks About

Data engineering has a dirty secret: a huge chunk of the job isn't engineering at all. It's babysitting.

A January 2025 survey found that 64% of organizations report their data teams spend more than half their time on repetitive or manual tasks, with 70% rating pipeline management as "somewhat" or "extremely" complex.

A separate 2024 study showed 63% of data practitioners spend over 20% of their time purely on maintenance — updating schemas, managing data quality, modifying pipelines.

At scale, this becomes a staffing problem. A team of four data engineers might have two doing actual development work and two keeping existing pipelines from falling over. Those two aren't writing bad code or being unproductive — the pipelines genuinely need monitoring. Data sources change formats. API tokens expire. Network timeouts cascade. Jobs fail in ways that require understanding the dependency chain to fix correctly.

The standard industry response is more tooling — Airflow, Dagster, Prefect — which helps with orchestration but doesn't solve the _diagnosis and remediation_ problem.

Your DAG scheduler knows job X failed. It doesn't know _why_ X failed, whether re-running X will help, or whether job Y downstream should wait.

That reasoning gap is exactly where an LLM agent fits. Not to replace the data engineer, but to handle the repetitive monitoring-and-fixing loop that consumes their evenings.

The goal isn't fewer data engineers. It's data engineers who spend their time on architecture and modeling instead of watching dashboards at 7 AM.

The Core Tension

Any complex agent faces the same architectural problem: deterministic code is reliable but can't reason about novel situations. LLMs can reason but aren't reliable. And real-world operations happen on timescales that don't fit inside a single LLM call — your price fetch takes 8 minutes, but your API timeout is 60 seconds.

Most agent frameworks (LangGraph, CrewAI) assume the agent runs in a single loop until done. That works when every tool call returns in seconds. It fails when your agent needs to kick off a job, go away, come back an hour later, check the result, and decide what to do next.

The Observe-Reason-Act pattern resolves all three tensions — reliability, reasoning, and time — by treating them as one integrated design, not separate problems to patch.

The Pattern

Observe (Deterministic)


The first layer gathers facts. No LLM. No ambiguity. Pure functions that answer simple yes/no questions and produce structured output.

In our case, eight Python functions check: Are prices fresh? Do we have 250/250 symbols? Is the Zerodha API token valid? Has the S3 backup run in the last 36 hours? The output is a JSON object — a factual snapshot of system state.

This is the shared reality layer. Upstream, nothing is interpreted.

Downstream, everyone — LLM or human — agrees on the same facts. You can't hallucinate about a JSON field that says "status": "FAIL".

The observation layer is also where domain knowledge lives. Ours carries the full NSE 2026 holiday calendar (16 holidays) and understands IST timezone offsets. It knows that at 2 AM on a Wednesday, no EOD data exists for that day yet — the market hasn't opened — so the expected latest data is from Tuesday. Without this, you get false alarms every morning and on every Diwali and Holi.

Design principle: the observation layer should be so precise that you could email its output to a human and they'd have everything they need to decide what to do.

If the LLM disappeared tomorrow, the observation layer alone is still useful.

Reason (LLM)

The second layer interprets facts, identifies root causes, and plans actions. This is where LLMs earn their keep.

Our agent receives the health check JSON and reasons about dependency chains: "Zerodha token expired → prices can't fetch → conviction scores will be stale even if I re-run them → I should fix prices first, flag the token for manual action, and skip scores until the next cycle."

That kind of multi-step causal reasoning across a dependency graph is genuinely hard to code deterministically. You'd need a DAG solver with edge-case handling for partial failures. The LLM does it naturally from a system prompt that describes the pipeline topology.

Importantly, the reasoning layer doesn't need to be correct every time. It needs to be correct _enough_ — because the system self-corrects on the next cycle.

This is analogous to how a quant fund separates alpha generation from risk management. The alpha model is creative — it identifies opportunities. The risk framework is disciplined — it constrains positions and exposure. The alpha model doesn't need to be right every time because the risk layer limits the damage of being wrong.

Design principle: the reasoning layer should improve speed and quality of decision-making, but the system should converge toward correctness even when the reasoning layer makes mistakes.

Act (Guarded, Asynchronous)

The third layer is where most agent architectures fall short.

They treat action as synchronous — call a function, get a result, continue reasoning.

But real operations don't work that way. Our price fetch takes 5-8 minutes. A fundamentals scrape takes 15. If the agent blocks and waits, the LLM API times out and the job gets killed mid-run.

The Act layer solves this by treating every action as fire-and-forget with deferred verification:

  1. When the LLM calls `rerun_job`, we launch the process fully detached (`subprocess.Popen` with `start_new_session=True`). The job runs independently, writes stdout to a log file, and continues even if the agent process exits.
  2. The agent gets back a PID immediately. For quick jobs (~30 seconds), it calls `check_job_status` right away. For slow jobs, it notes "launched, will verify next run."
  3. Every launched job is persisted to a `pending_state.json` file — the agent's structured memory of what it's done.


This is a fundamentally different model from the standard agent loop. The agent doesn't run until all tasks are done. It does what it can _right now_, records what it started, and exits. The next invocation picks up where this one left off.

The action layer also carries hardcoded guardrails — in the _implementation_, not in the prompt:

  1. Retry limits: Maximum 2 retries per job. The third call returns `BLOCKED`.
  2. Timing windows: No job execution during market hours (9 AM – 4 PM IST).
  3. Scope constraints: The agent can re-run existing scripts. It cannot create, edit, or delete anything.
  4. Audit trail: Every action is logged to DuckDB with a timestamp and description.
Design principle: guardrails belong in the tool implementation, not in the system prompt. Prompts are suggestions the LLM might creatively reinterpret. Code is a wall it cannot pass through.


Converge (Stateful Across Time)

The fourth element is what ties the whole pattern together and elevates it from "monitoring tool" to "autonomous agent."

The agent runs hourly from 5 AM to 11 PM IST. Each run follows the same cycle: load pending state from the previous run, observe the current system health, reason about what's changed, act on what needs fixing, persist new state for the next run.

A typical day:

  1. 5:30 AM — Fresh start. Prices stale, macro stale. Agent launches both jobs. Saves state. Sends email: "Two jobs launched."
  2. 6:30 AM — Resumes. Loads pending state. Checks PIDs — both jobs finished. Fresh health checks confirm prices and macro are green. But conviction scores are stale (they depend on prices, which just got updated). Launches scores. Saves state. Sends email: "Prices fixed. Scores launched."
  3. 7:30 AM — Resumes. Scores done. All 8 checks pass. Early exit — no LLM call, no email, zero cost.

Each run nudges the system closer to healthy. Like gradient descent — each step reduces the error, and the system naturally reaches equilibrium. If retries are exhausted, it stops and flags for human intervention. Stale state older than 6 hours auto-expires to prevent yesterday's failures from persisting.


The key insight: the agent works across time, not within a single execution.

It's more like a human shift worker than a tireless robot. It shows up, checks what happened since last time, does what it can, leaves notes for the next shift, and clocks out. The _sequence_ of runs is the agent. No single run needs to solve everything.

This is what makes fire-and-forget execution and stateful resume not separate features bolted on, but integral parts of the convergence design. Without fire-and-forget, the agent can't launch slow jobs. Without resume, the agent can't follow up on what it launched. Without both, convergence isn't possible. The three capabilities are one design, not three patches.

Design principle: design agents to converge, not to solve in one shot. Single-shot agents are fragile. Convergent agents are resilient.

What It Costs

Each agent run processes ~2,000 tokens of pipeline state and generates ~1,000 tokens of reasoning. At GPT-5.2 pricing, that's ~$0.01 per run. With 19 hourly runs per day and most being no-ops (all green = early exit, no LLM call), effective cost is under $1/month.

For a quant trading pipeline where one missed price fetch can lead to stale signals and bad trades, this is essentially free insurance.

What The Pattern Doesn't Solve

The Observe-Reason-Act-Converge pattern handles the 80% case well: "something didn't run, figure out the root cause from the dependency chain, fix what you can, tell me what you can't."

It does not handle:

  1. Novel failures: API format changes, new error types the observation layer doesn't check for.
  2. Infrastructure issues: Network outages, disk space, server health.
  3. Manual-only operations: Our Zerodha token requires a browser login to regenerate. The agent correctly identifies this and flags it, but can't fix it.

The remaining 20% still needs a human. But that human gets a clear email saying exactly what's wrong and what to do, instead of discovering it at midnight while stress-checking positions.

Applying the Pattern

The Observe-Reason-Act-Converge pattern works anywhere you have:

  1. A system with observable health metrics (deployment status, test results, compliance checks)
  2. A fixed set of safe remediation actions (restart service, re-run test, send alert)
  3. Operations that span longer timescales than a single agent execution4.
  4. A need for causal reasoning across dependencies that's hard to code as rules

DevOps incident response. CI/CD pipeline management. Compliance monitoring. Database maintenance. The shape is always the same: deterministic observation, intelligent reasoning, guarded asynchronous execution, iterative convergence.

The key insight isn't that LLMs can reason about systems — that's table stakes now. The insight is that the architecture around the LLM matters more than the LLM itself.

Get the layers right, and the agent is safe, cheap, and genuinely useful. Get them wrong, and you've built an expensive liability.

For us, this agent didn't replace a data engineer. It eliminated the need to hire one.

Our small team builds strategies, improves models, and talks to clients during the day. The agent watches the pipeline around the clock. It costs us less than a coffee per month and gives us back 10+ hours per week we used to spend on monitoring.

For larger teams, the math is even more compelling.

If two of your four data engineers spend most of their time on pipeline monitoring and remediation, this pattern can shift that work to an agent and free those engineers to build.

Observe. Reason. Act. Converge.

Subscribe to RevDog Quantitative Research

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe