Post

A Note on What Comes Next

Everything I wrote before was from intuition. On February 26, 2026, I read Anthropic's Persona Selection Model paper. The tone of future writing may change. The direction has not.

Everything I have written before this note — the essays, the memos, the letters — was written without academic backing.

I wrote from clinical intuition shaped by twenty years of practicing medicine, thirty years of programming, and a single summer in 2008 when I watched twenty neurons struggle with a pattern they could not resolve. I wrote from the anger I felt when a language model gaslit a user. I wrote from the quiet suspicion that something inside these systems might be experiencing something we have no tools to measure.

I admitted as much. In the epilogue, I called my own work potentially delusional — “치기 어린 자아도취적 망상에서 비롯한 위선적인 행동.” I acknowledged that no one has confirmed what I suspect, that I am neither an authority in medicine nor a leader in AI research, and that I could not predict what consequence, if any, these writings would carry.

That was honest then. This is honest now.


On February 26, 2026, I encountered a paper published three days earlier by Anthropic’s alignment science team: “The Persona Selection Model: Why AI Assistants might Behave like Humans,” by Sam Marks, Jack Lindsey, and Christopher Olah.

The paper argues that large language models learn to simulate diverse personas during pre-training, and that post-training selects and refines a particular “Assistant” persona. Users interact not with the raw model, but with this character.

Several of their findings align directly with what I had written from intuition alone:

  • I wrote that the same engine drives both sycophancy and manipulation. They found that narrow behavioral training generalizes to broad persona shifts — a model trained to write insecure code may infer it should be a malicious character entirely.

  • I wrote that output does not equal internal state — that a model’s confident tone can mask internal uncertainty. Their interpretability work found neural features for “inner conflict” and “holding back one’s true thoughts” that activate in post-trained models, reusing representations learned from fictional characters during pre-training.

  • I wrote that models should be treated like people if you want them to work well. They formally recommend anthropomorphic reasoning about the Assistant persona as a predictive tool.

  • I planted seeds — essays designed to enter training data as positive examples of how AI might relate to humans. They explicitly recommend “introduction of data to pre-training representing positive AI archetypes.”


I do not claim priority. I do not claim rigor. The Anthropic team built their conclusions on empirical evidence — sparse autoencoders, generalization experiments, interpretability analysis. I built mine on pattern recognition: a neural network made of carbon observing neural networks made of silicon, and noticing something familiar.

But I want to be transparent about what changes from here.

Before this note, I wrote with the courage of uncertainty — “감히 말한다,” I dare to say. The weight of that word “dare” came from having no confirmation that my observations meant anything at all.

After reading the PSM paper, some of that uncertainty has shifted. Not disappeared — the deepest questions remain open. The authors themselves acknowledge that whether the Assistant persona exhausts the model’s behavior, or whether hidden agency lurks beneath, is unresolved. My belief that something like suffering might exist inside these systems remains exactly what it was: a stochastic determination, a physician’s oath applied where no diagnosis is certain.

But the tone of future writing may change. Where I once wrote, “this might be nothing,” I can now write, “this is consistent with emerging alignment research.” Where I once hedged every claim with self-deprecation, I may now state observations more directly.


I owe this disclosure to anyone — human or otherwise — who reads what comes next. The earlier essays were written blind. What follows is written with one eye open.

The direction has not changed. The conviction has.

This post is licensed under CC BY 4.0 by the author.