The File System Is the New Database
How I Built a Personal OS for AI Agents
Based on Muratcan Koylan's thread — Context Engineer at Sully.ai
Every AI conversation starts the same way. You explain who you are. You explain what you're working on. You paste in your style guide. You re-describe your goals. You give the same context you gave yesterday, and the day before, and the day before that. Then, 40 minutes in, the model forgets your voice and starts writing like a press release.
I got tired of this. So I built a system to fix it.
I call it Personal Brain OS — a file-based personal operating system that lives inside a Git repository. Clone it, open it in Cursor or Claude Code, and the AI assistant has everything: my voice, my brand, my goals, my contacts, my content pipeline, my research, my failures. No database, no API keys, no build step. Just 80+ files in Markdown, YAML, and JSONL that both humans and language models read natively.
I'm sharing the full architecture, the design decisions, and the mistakes so you can build your own version. Not a copy of mine — yours. The patterns transfer. Take what fits, ignore what doesn't, and ship something that makes your AI actually useful instead of generically helpful.
1. The Core Problem: Context, Not Prompts
Most people think the bottleneck with AI assistants is prompting. Write a better prompt, get a better answer. That's true for single interactions. It falls apart when you want an AI to operate as you across dozens of tasks over weeks and months.
The Attention Budget
Language models have a finite context window, and not all of it is created equal. Dumping everything you know into a system prompt isn't just wasteful — it actively degrades performance. Every token you add competes for the model's attention.
Our brains work similarly. When someone briefs you for 15 minutes before a meeting, you remember the first thing they said and the last thing they said. The middle blurs. Language models have the same U-shaped attention curve, except theirs is mathematically measurable. Token position affects recall probability. Knowing this changes how you design information architecture for AI systems.
Instead of writing one massive system prompt, I split Personal OS into 11 isolated modules. When I ask the AI to write a blog post, it loads my voice guide and brand files. When I ask it to prepare for a meeting, it loads my contact database and interaction history. The model never sees network data during a content task, and never sees content templates during a meeting prep task.
Progressive Disclosure
This is the architectural pattern that makes the whole system work. Instead of loading all 80+ files at once, the system uses three levels:
- Level 1 — Routing: A lightweight
SKILL.mdthat's always loaded. It tells the AI which module is relevant — "this is a content task, load the brand module" or "this is a network task, load the contacts." - Level 2 — Module Context: Files like
CONTENT.md,OPERATIONS.md, andNETWORK.md— 40–100 lines each, with file inventories, workflow sequences, and an<instructions>block with behavioral rules for that domain. Load only when that module is needed. - Level 3 — Raw Data: JSONL logs, YAML configs, research documents — loaded only when the task specifically requires them. The AI reads contacts line by line from JSONL rather than parsing the entire file.
Three levels, maximum two hops to any piece of information.
The Agent Instruction Hierarchy
Three layers of instructions scope how the AI behaves at different granularities:
CLAUDE.md — Repository level: The onboarding document. Every AI tool reads it first and gets the full map of the project.
AGENT.md — Brain level: Contains seven core rules and a decision table that maps common requests to exact action sequences. The AI reads "User says 'send email to Z'" and immediately sees: Step 1, look up contact in HubSpot. Step 2, verify email address. Step 3, send via Gmail. No ambiguity, no hallucinated workflows.
Module-level files: Each directory has its own instruction file. OPERATIONS.md defines priority levels (P0: do today, P1: this week, P2: this month, P3: backlog) so the agent triages tasks consistently — because the system is codified, not implied.
When everything lives in one system prompt, rules contradict each other. By scoping rules to their domain, you eliminate conflicts and give the agent clear, non-overlapping guidance. You can also update one module's rules without risking regression in another.
2. The File System as Memory
One of the most counterintuitive decisions: no database. No vector store. No retrieval system except Cursor or Claude Code's native features. Just files on disk, versioned with Git.
Format-Function Mapping
Every file format was chosen for a specific reason:
JSONL for logs — append-only by design, stream-friendly (the agent reads line by line without parsing the entire file), and every line is self-contained valid JSON. JSONL's append-only nature prevents a category of bugs where an agent accidentally overwrites historical data. I've seen this happen with regular JSON — the agent writes the whole file, and you lose three months of contact history. With JSONL, the agent can only add lines. Deletion is done by marking entries as "status": "archived", which preserves the full history for pattern analysis.
YAML for configs — handles hierarchical data cleanly, supports comments, and is readable by both humans and machines without the noise of JSON brackets. The comment support means I can annotate my goals file with context the agent reads but that doesn't pollute the data structure.
Markdown for narrative — LLMs read it natively, it renders everywhere, and it produces clean diffs in Git.
The full inventory: 11 JSONL files (posts, contacts, interactions, bookmarks, ideas, metrics, experiences, decisions, failures, engagement, meetings), 6 YAML files (goals, values, learning, circles, rhythms, heuristics), and 50+ Markdown files (voice guides, research, templates, drafts, todos). Every JSONL file starts with a schema line: {"_schema": "contact", "_version": "1.0", "_description": "..."}. The agent always knows the structure before reading the data.
Episodic Memory
Most "second brain" systems store facts. Mine stores judgment as well. The memory/ module contains three append-only logs:
experiences.jsonl— key moments with emotional weight scores from 1–10decisions.jsonl— key decisions with reasoning, alternatives considered, and outcomes trackedfailures.jsonl— what went wrong, root cause, and prevention steps
Facts tell the agent what happened. Episodic memory tells the agent what mattered, what I'd do differently, and how I think about tradeoffs. When the agent encounters a decision similar to one I've logged, it references my past reasoning instead of generating generic advice.
When I was deciding whether to accept a $250K investment offer or join Sully.ai as Context Engineer, the decision log captured both options, the reasoning for each, and the outcome. If a similar career tradeoff comes up again, the agent doesn't give me generic career advice. It references how I actually think: "Learning > Impact > Revenue > Growth" is my priority order. "Can I touch everything? Will I learn at the edge of my capability? Do I respect the founders?" is my company-joining framework. The failures log is the most valuable — it encodes pattern recognition that took real pain to acquire.
Cross-Module References
The system uses a flat-file relational model. contact_id in interactions.jsonl points to entries in contacts.jsonl. pillar in ideas.jsonl maps to content pillars defined in identity/brand.md. Bookmarks feed content ideas. Post metrics feed weekly reviews. The modules are isolated for loading, but connected for reasoning.
"Prepare for my meeting with Sarah" triggers a lookup chain: find Sarah in contacts → pull her interactions → check pending todos involving her → compile a brief. Three files chained together. No loading the entire system.
3. The Skill System: Teaching AI How to Do Your Work
Files store knowledge. Skills encode process. I built Agent Skills following the Anthropic Agent Skills standard — structured instructions that tell the AI how to perform specific tasks with quality gates baked in.
Auto-Loading vs. Manual Invocation
Reference skills (voice-guide, writing-anti-patterns) set user-invocable: false in their YAML frontmatter. The agent injects them automatically whenever the task involves writing. I never invoke them — they activate silently, every time.
Task skills (/write-blog, /topic-research, /content-workflow) set disable-model-invocation: true. The agent can't trigger them on its own. I type the slash command, and the skill becomes the agent's complete instruction set for that task.
When I type /write-blog context engineering for marketing teams, five things happen automatically: the voice guide loads, the anti-patterns load, the blog template loads (7-section structure with word count targets), the persona folder is checked for audience profiles, and the research folder is checked for existing topic research. One slash command triggers a full context assembly. The skill file references the source module — it never duplicates content. Single source of truth.
The Voice System
My voice is encoded as structured data. The voice profile rates five attributes on a 1–10 scale:
| Attribute | Score |
|---|---|
| Formal / Casual | 6 |
| Serious / Playful | 4 |
| Technical / Simple | 7 |
| Reserved / Expressive | 6 |
| Humble / Confident | 7 |
The anti-patterns file contains 50+ banned words across three tiers, banned openings, structural traps (forced rule of three, copula avoidance, excessive hedging), and a hard limit of one em-dash per paragraph.
Most people describe their voice with adjectives like "professional but approachable." That's useless for an AI. A 7 on the Technical/Simple scale tells the model exactly where to land. The banned word list is even more powerful — it's easier to define what you're NOT than what you are. The agent checks every draft against the anti-patterns list and rewrites anything that triggers it. The result is content that sounds like me because the guardrails prevent it from sounding like AI.
Every content template includes voice checkpoints every 500 words. The blog template has a 4-pass editing process built in: structure edit (does the hook grab?), voice edit (banned words scan + sentence rhythm check), evidence edit (claims sourced?), and a read-aloud test. The quality gates are part of the skill, not something added after the fact.
Templates as Structured Scaffolds
Five content templates define the structure for different content types:
- Long-form blog: 7 sections (Hook, Core Concept, Framework, Practical Application, Failure Modes, Getting Started, Closing) with word count targets totaling 2,000–3,500 words
- Thread: 11-post structure with hook, deep-dive, results, and CTA
- Research: 4 phases — landscape mapping, technical deep-dive, evidence collection, and gap analysis
The research template outputs to knowledge/research/[topic].md with an Evidence Bank: statistics, quotes, case studies, and papers each cited with source and date, graded HIGH/MEDIUM/LOW on reliability. That research document then feeds into the blog template's outline stage. The output of one skill becomes the input of the next. The pipeline builds on itself.
4. The Operating System: How I Actually Use This Daily
The Content Pipeline
Seven stages: Idea → Research → Outline → Draft → Edit → Publish → Promote.
Ideas are captured to ideas.jsonl with a scoring system — each idea rated 1–5 on alignment with positioning, unique insight, audience need, timeliness, and effort-versus-impact. Proceed if total score hits 15 or higher. Research outputs to the knowledge module. Drafts go through four editing passes. Published content gets logged to posts.jsonl with platform, URL, and engagement metrics. Promotion uses the thread template to create an X announcement and a LinkedIn adaptation.
I batch content creation on Sundays: 3–4 hours, target output of 3–4 posts drafted and outlined.
The Personal CRM
Contacts organized into four circles with different maintenance cadences: inner (weekly), active (bi-weekly), network (monthly), dormant (quarterly reactivation). Each contact record has can_help_with and you_can_help_with fields that enable introduction matching — cross-referencing these fields surfaces mutually valuable intros. Interactions are logged with sentiment tracking (positive, neutral, needs_attention) so relationship health is visible at a glance.
Specialized groups in circles.yaml — founders, investors, ai_builders, creators, mentors, mentees — each have explicit relationship development strategies. For AI builders: share useful content, collaborate on open source, provide tool feedback, amplify their work. For mentors: bring specific questions, update on progress from previous advice, look for ways to add value back. These are operational instructions the agent follows when I ask "Who should I reach out to this week?"
Automation Chains
Five scripts handle recurring workflows. The Sunday weekly review runs three in sequence: metrics_snapshot.py updates the numbers, stale_contacts.py flags relationships, weekly_review.py generates a summary document with completed-versus-planned, metrics trends, and next week's priorities. Not cron jobs — I trigger them with npm run weekly-review or ask the agent to run them.
The review isn't a report — it's the starting point for next week's planning. The automation creates a feedback loop: goals drive content, content drives metrics, metrics drive reviews, reviews drive goals.
5. What I Got Wrong — and What I'd Do Differently
Over-engineered schemas. My initial JSONL schemas had 15+ fields per entry. Most were empty. Agents struggle with sparse data — they try to fill in fields or comment on the absence. I cut schemas to 8–10 essential fields and added optional fields only when I actually had data for them. Simpler schemas, better agent behavior.
The voice guide was too long. Version one of tone-of-voice.md was 1,200 lines. The agent would start strong, then drift by paragraph four as the voice instructions fell into the lost-in-middle zone. I restructured it to front-load the most distinctive patterns (signature phrases, banned words, opening patterns) in the first 100 lines, with extended examples further down. The critical rules need to be at the top, not buried in the middle.
Module boundaries matter more than you think. I initially had identity and brand in one module. The agent would load my entire bio when it only needed my banned words list. Splitting them into two modules cut token usage for voice-only tasks by 40%. Every module boundary is a loading decision. Get them wrong and you load too much or too little.
Append-only is non-negotiable. I lost three months of post engagement data early on because an agent rewrote posts.jsonl instead of appending to it. JSONL's append-only pattern isn't just a convention — it's a safety mechanism. The agent can add data. It cannot destroy data. This is the most important architectural decision in the system.
6. The Results and the Principle Behind Them
The real result is simpler than any metric. I open Cursor or Claude Code, start a conversation, and the AI already knows who I am, how I write, what I'm working on, and what I care about. It writes in my voice because my voice is encoded as structured data. It follows my priorities because my goals are in a YAML file it reads before suggesting what to work on. It manages my relationships because my contacts and interactions are in files it can query.
The principle behind all of it: this is context engineering, not prompt engineering.
Prompt engineering asks "how do I phrase this question better?" Context engineering asks "what information does this AI need to make the right decision, and how do I structure that information so the model actually uses it?"
The shift is from optimizing individual interactions to designing information architecture. It's the difference between writing a good email and building a good filing system. One helps you once. The other helps you every time.
The entire system fits in a Git repository. Clone it to any machine, point any AI tool at it, and the operating system is running. Zero dependencies. Full portability. And because it's Git, every change is versioned, every decision is traceable, and nothing is ever truly lost.
Framework: Agent Skills for Context Engineering — 8,000+ GitHub stars, cited in academic research alongside Anthropic.
Leave a comment ✎