A week with a careful agent

2026-06-02 6 minute read field-notes practice

— halcyon team

This is not a benchmark. It is a notebook page.

For seven days in May, one of us used halcyon as the only AI tool in the workflow. No code completion, no chat sidebar, no autocomplete, no second agent running in another tab. One general agent, one long task, one quiet composer. We wrote down what happened.

Monday. A research question handed over in one sentence: what does the longitudinal literature say about long-term LLM use in real workplaces? The agent went away. It came back forty minutes later with eleven sources and a two-page summary. The summary noted disagreements between the papers instead of smoothing them over. The most useful thing was not the answer; it was that the forty minutes were quiet. No "thinking…" indicator demanding attention. The day's other work — a draft for a separate project — got done while the agent read.

Tuesday. We followed up on one of the papers the agent had cited: LLMs in the SOC, a ten-month longitudinal study of forty-five security-operations analysts and 3,090 queries. Its finding was specific and useful: analysts treated the LLM as an on-demand aid for sensemaking and context-building, almost never for high-stakes determinations, and the typical interaction was one to three turns long. That matched the shape of our own use this week. The agent was not the worker. The agent was the bench by the worker's elbow.

Wednesday. A draft. The composer in halcyon has no autocomplete, which is unsettling for the first hour and a relief after that. Sophie Leroy's attention-residue paper (OBHDP, 2009) keeps coming to mind. When the composer does not suggest the next word, the next word has to come from somewhere else. It comes from thinking. The draft was slower than a Cursor-style session. It was also more recognisably ours when it was done.

Thursday. A small script. The agent wrote a short Python tool to deduplicate a CSV column with fuzzy matching, in our stack, with a one-line install. It worked the first time. This is the kind of moment that, in a noisier product, would be packaged as a milestone with confetti. Here it was a file in a folder. We moved on.

Friday. A long synthesis: take Monday's research summary and turn it into something we could publish. The agent held the brief, the original sources, the Monday summary, and the half-written draft all at once, and let us ask about a single sentence six hours into the session. This is the patient context window we describe on the features page, observed at length. It is the feature that most reliably disappears into the work. You only notice it when you ask a follow-up about a paragraph you wrote on Monday and the agent answers as if it remembered. It did.

Saturday. No work. The agent did not page us.

Sunday. We re-read the week. The pattern that emerged was the one the RCT-and-diary study of generative AI coding tools (Dear Diary, arXiv 2024) reports from a large multinational software company: introducing the tool changed how the work felt — 84% of participants in that study reported positive changes in daily practice, 66% reported shifts in how they felt about the work — while trust in the tool's outputs did not move much. We recognise the shape. We are more willing to use halcyon for first drafts than for last drafts. We are more willing to use it for synthesis than for judgement. The week did not change that. It clarified it.

A few observations the week made hard to ignore.

One. The absence of pings is not a missing feature. It is the feature. The arXiv overview A Map of Exploring Human Interaction Patterns with LLM classifies four interaction styles — processing tool, analysis assistant, processing agent, creative companion — and the slow, quiet variant of halcyon's interaction lands somewhere between the third and fourth. The categories that demand the most user attention (chat-as-companion) are not the categories that produced our best work this week.

Two. The agent that does not interrupt produces fewer artefacts per hour and better artefacts per day. We have no formal measurement of this, only the week. But the cognitive-science background — Gloria Mark's twenty-three minutes, Leroy's residue — predicts exactly this pattern, and the week did not contradict it.

Three. The hardest day was Tuesday, when we had to not check on the agent. The compulsion to check is real. It is also self-imposed. By Thursday it was gone.

This is not an argument. It is what one week looked like. The argument is in the other posts.

Sources

LLMs in the SOC: An Empirical Study of Human-AI Collaboration in Security Operations Centres — arXiv 2508.18947 — Longitudinal study of 45 analysts and 3,090 queries over 10 months; LLMs used as on-demand sensemaking aids in short 1–3 turn interactions.
Dear Diary: A randomized controlled trial of Generative AI coding tools in the workplace — arXiv 2410.18334 — Mixed-methods RCT + three-week diary study in a large multinational software company; 84% reported positive practice changes, trust in outputs unchanged.
A Map of Exploring Human Interaction Patterns with LLM — arXiv 2404.04570 — Classification of four human–LLM interaction patterns: processing tool, analysis assistant, processing agent, creative companion.
Why is it so hard to do my work? — Sophie Leroy, OBHDP 2009 — The attention-residue paper, cited here for the cost of unfinished switches that an autocomplete-free composer avoids.