Latest Articles
Simulating Simulators
Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.And well… here we are.P.S. TL;DRs added where possible.Board Ga
0
0
Learning to spend money
My wife and I are both naturally stingy people. When drafting our wedding list we spurned the posh department stores and I carefully picked out the lowest price best quality items on Amazon instead. I bought 100 dollar beds and 100 dollar mattresses, and we slept on them for a year and a half because "we're anyway emigrating soon". When we did emigrate, I ended up shipping them and we slept on them for another year and a half, much to my pregnant wife's annoyance.
We might have overdone it given
0
0
Parkinson's Heuristic: The Only Time To Do Anything
Parkinson's Law states that work expands to fit the space allotted. The idea being, if you give someone a month to write a report, they'll take a month, but if you give them a week, they'll take a week, and then they'll have three weeks to do three other reports! The one-week and four-week reports won't be identical; but in my experience it is surprisingly often nearly as useful as the four-week version, and you get strictly more work out of people.(I, myself, am a special case of "people".)This
0
0
PSA: Almost nobody is working on alignment
People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.Currently, the people who we know of that work on alignment are roughly:The Alignment Research Center who work on a research bet by Paul ChristianoProbably Sequent who just got announced yesterdayParts of GDM (agent foundations work, some debate w
0
0
Honey is Good
The other day I was watching the magic school bus with my young son; they were learning about bees and honey. One of the characters says, “We shouldn't take the honey, the bees didn't make it for us” and another character replies with “But if we don't take the honey, then I won't have any? I want the honey”.This struck me as close to a “First argument”. Thanks to evolution, an organism wouldn't exist if it didn't want to survive. The first argument is "Survival is Good" and Survival = Calories =
0
0
The Aestheticising Vice by Paul Seabright
I'm often in debates with people about legibility and systems vs individual virtues. People often bring up Seeing Like A State, Secrets of Our Success, and other books or articles in that vein to buttress the case for metis over top-down high modernist design. I sometimes found the conversations shallow, and Paul Seabright's 1999 (!) review of Seeing Like A State helps explain why.___In the Languedoc there is a vineyard that teaches us an important lesson about textbook learning and its applicat
0
0
Celene's thoughts on consciousness
contra scott alexander (?)Yesterday, I went to the Berkeley ACX Meetup. Scott Alexander was there, and ran a Q&A session where participants could ask him questions and he would respond to them unless the questions were about eulogies, in which case he would pause to think for a few seconds before kindly passing. At one point or another, the questions drifted to theories of consciousness.As a kind-of-illusionist, I worked up the courage to raise my hand and ask him what I should do if I wasn’t su
0
0
Construct validity of Claude Opus 4.8's System Card – A commentary
TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never surfaces in the text; 2) evaluation awareness is under-estimated; 3) the evaluators come largely from the same model family, so agreement may reflect shared assumptions. None of this shows Opus 4.8 is unsafe but only that some verdicts are more confident than the methods warrant.Introduc
0
0
you won't one-shot a perfect system, but try anyway
Have you ever experienced this exchange:A: Damn, , this system is so broken. My friend says in their country,
0
0
Announcing the Next Phase of AI Forge
We’re taking the opportunity to share this with the community to help spread the word. We think that the foundational work being done in the AI Forge project to bring the government into conversation with academia and industry is a crucial step to ensure alignment research gets deployed into government and military applications. See the announcement below.Launching University RFI and Critical AI Challenges ReportDear Colleagues,I am thrilled to announce the official launch of the next phase of t
0
0
Iliad is Hiring
Iliad is hiring for operations, research, and engineering roles. If you're excited about advancing foundational AI alignment research, we'd love to hear from you.Full job descriptions are available at https://www.iliad.ac/careers. About IliadIliad is a nonprofit dedicated to advancing foundational AI alignment research. We run the Iliad Intensive, the Iliad Fellowship, and a range of conferences that bring together researchers working on the hardest problems in AI safety. We also incubate resear
0
2
Neglected Basics of AI Alignment
I came into this world as the misunderstood hero of Harry Potter and the Methods of Rationality. While some characters inside that story would call me a villain, the narrator's-eye view clearly shows that I saved that world from total destruction, inspired the next generation of leaders, and taught the best Defense Against the Dark Arts class in the Harry Potter multiverse. And, being fictional characters, none of the people I killed were moral patients at all.When I first came to visit this wor
0
4
The Hats of LessOnline
It is currently the evening after day two of LessOnline 2026. I wish to document one popular topic of discussion among LessOnline attendees: golden hats. My girlfriend Celene has collected four golden hats and maintained possession of them until the end of the day, and she has graciously permitted me to study them and share my findings with all of you.Celene, wearing four golden hatsThe four hats, laid out on a cushion in the second floor of Aumann HallThough the four hats appear very similar, t
0
4
Can activation verbalizers surface an internal chain of thought?
We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math problem in a single forward pass? For open-weight NLAs, the answer seems to be: "possibly, but definitely not reliably".Lots of important capabilities currently require AI models to reason "out loud" in a natural-language chain of thought, which means that we can monitor important parts of their thinking. It would be nice to have this same affordance for the reasoning that model
0
4
Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind (ToM), includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups.Belief-state tracking becomes increasingly important as AI systems get integrat
0
3
Coming Around To Political Donations
Five years ago I read a
post
on the EA Forum arguing that "election campaign
contributions might be a way in which you can have a substantial
impact as a small donor". It struck me as weird but plausible: a
combination that you
see
a
lot
of
on the Forum.
A few months later I read another post, a case
for Carrick Flynn in particular. It made a lot of sense, but
while I don't remember my specific reservations I do remember not
being convinced initially. After a lot of talking wi
0
4
Analysis of Metastable States in the Transformer Activation Space
Part 1: Do Metastable Token Clusters exist in Trained Transformers? This is the first entry in a sequence. Over about ten parts, this series will work through a few humble experiments that test a mathematical theory of attention against real trained transformers.A project summary: a recent paper by Geshkovski, Letrouit, Polyanskiy, and Rigollet models attention as a dynamical system on the sphere and proves that tokens cluster and drift toward consensus, with a metastable two-timescale structure
0
3
The Diamond Lemma
I found this result useful for a few different problems I was thinking about recently. It cleared up a lot of confusion I had around simplification rules. First I give a semiformal statement of the lemma and some applications. At the end I give a formal statement and proof.SetupSuppose you have a set S and some possible transitions where one element of S “simplifies” into another. The diamond lemma has two requirements:There is no infinite chain of simplifications. If you start somewhere and kee
0
2
The Residual Stream Has a Geometry of Time
Preface
This is a preliminary writeup for an experiment on residual stream geometry. The research direction seems pretty underexplored, so I’m posting early to collect objections, research intuitions, and connections to problems other people are thinking about before I invest in the larger run.
The case for skimming this post: this experiment suggests transformers may keep track of context in a surprisingly compact way. Information that persists across many tokens is not diffuse across activatio
0
3
Against Corrigibility
Epistemic status: don’t know whether I actually believe all of this, but I think it’s worth considering.A “corrigible” agent, per the LW wiki, is:…one that doesn’t interfere with what we would intuitively see as attempts to ’correct’ the agent, or ’correct’ our mistakes in building it; and permits these ’corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems diffic
0
2
Simulating Simulators
Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve m
0
0
Learning to spend money
My wife and I are both naturally stingy people. When drafting our wedding list we spurned the posh department stores and
0
0
Parkinson's Heuristic: The Only Time To Do Anything
Parkinson's Law states that work expands to fit the space allotted. The idea being, if you give someone a month to write
0
0
PSA: Almost nobody is working on alignment
People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is
0
0
Honey is Good
The other day I was watching the magic school bus with my young son; they were learning about bees and honey. One of the
0
0
The Aestheticising Vice by Paul Seabright
I'm often in debates with people about legibility and systems vs individual virtues. People often bring up Seeing Like A
0
0
Celene's thoughts on consciousness
contra scott alexander (?)Yesterday, I went to the Berkeley ACX Meetup. Scott Alexander was there, and ran a Q&A session
0
0
Construct validity of Claude Opus 4.8's System Card – A commentary
TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluati
0
0
you won't one-shot a perfect system, but try anyway
Have you ever experienced this exchange:A: Damn, , this system is so broken. My friend says in their country,
0
0
Announcing the Next Phase of AI Forge
We’re taking the opportunity to share this with the community to help spread the word. We think that the foundational wo
0
0
Iliad is Hiring
Iliad is hiring for operations, research, and engineering roles. If you're excited about advancing foundational AI align
0
2
Neglected Basics of AI Alignment
I came into this world as the misunderstood hero of Harry Potter and the Methods of Rationality. While some characters i
0
4
The Hats of LessOnline
It is currently the evening after day two of LessOnline 2026. I wish to document one popular topic of discussion among L
0
4
Can activation verbalizers surface an internal chain of thought?
We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math p
0
4
Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societi
0
3
Coming Around To Political Donations
Five years ago I read a
post
on the EA Forum arguing that "election campaign
contributions might be a way in which you
0
4
Analysis of Metastable States in the Transformer Activation Space
Part 1: Do Metastable Token Clusters exist in Trained Transformers? This is the first entry in a sequence. Over about te
0
3
The Diamond Lemma
I found this result useful for a few different problems I was thinking about recently. It cleared up a lot of confusion
0
2
Simulating Simulators
Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.And well… here we are.P.S. TL;DRs added where possible.Board Ga
0
0 👁
Learning to spend money
My wife and I are both naturally stingy people. When drafting our wedding list we spurned the posh department stores and I carefully picked out the lowest price best quality items on Amazon instead. I bought 100 dollar beds and 100 dollar mattresses, and we slept on them for a year and a half because "we're anyway emigrating soon". When we did emigrate, I ended up shipping them and we slept on them for another year and a half, much to my pregnant wife's annoyance.
We might have overdone it given
0
0 👁
Parkinson's Heuristic: The Only Time To Do Anything
Parkinson's Law states that work expands to fit the space allotted. The idea being, if you give someone a month to write a report, they'll take a month, but if you give them a week, they'll take a week, and then they'll have three weeks to do three other reports! The one-week and four-week reports won't be identical; but in my experience it is surprisingly often nearly as useful as the four-week version, and you get strictly more work out of people.(I, myself, am a special case of "people".)This
0
0 👁
PSA: Almost nobody is working on alignment
People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.Currently, the people who we know of that work on alignment are roughly:The Alignment Research Center who work on a research bet by Paul ChristianoProbably Sequent who just got announced yesterdayParts of GDM (agent foundations work, some debate w
0
0 👁
Honey is Good
The other day I was watching the magic school bus with my young son; they were learning about bees and honey. One of the characters says, “We shouldn't take the honey, the bees didn't make it for us” and another character replies with “But if we don't take the honey, then I won't have any? I want the honey”.This struck me as close to a “First argument”. Thanks to evolution, an organism wouldn't exist if it didn't want to survive. The first argument is "Survival is Good" and Survival = Calories =
0
0 👁
The Aestheticising Vice by Paul Seabright
I'm often in debates with people about legibility and systems vs individual virtues. People often bring up Seeing Like A State, Secrets of Our Success, and other books or articles in that vein to buttress the case for metis over top-down high modernist design. I sometimes found the conversations shallow, and Paul Seabright's 1999 (!) review of Seeing Like A State helps explain why.___In the Languedoc there is a vineyard that teaches us an important lesson about textbook learning and its applicat
0
0 👁
Celene's thoughts on consciousness
contra scott alexander (?)Yesterday, I went to the Berkeley ACX Meetup. Scott Alexander was there, and ran a Q&A session where participants could ask him questions and he would respond to them unless the questions were about eulogies, in which case he would pause to think for a few seconds before kindly passing. At one point or another, the questions drifted to theories of consciousness.As a kind-of-illusionist, I worked up the courage to raise my hand and ask him what I should do if I wasn’t su
0
0 👁
Construct validity of Claude Opus 4.8's System Card – A commentary
TL;DR: A read of the Claude Opus 4.8 system card with a focus on alignment assessment and construct validity of evaluation methods. Three main concerns: 1) chain-of-thought monitoring misses reasoning that never surfaces in the text; 2) evaluation awareness is under-estimated; 3) the evaluators come largely from the same model family, so agreement may reflect shared assumptions. None of this shows Opus 4.8 is unsafe but only that some verdicts are more confident than the methods warrant.Introduc
0
0 👁
you won't one-shot a perfect system, but try anyway
Have you ever experienced this exchange:A: Damn, , this system is so broken. My friend says in their country,
0
0 👁
Announcing the Next Phase of AI Forge
We’re taking the opportunity to share this with the community to help spread the word. We think that the foundational work being done in the AI Forge project to bring the government into conversation with academia and industry is a crucial step to ensure alignment research gets deployed into government and military applications. See the announcement below.Launching University RFI and Critical AI Challenges ReportDear Colleagues,I am thrilled to announce the official launch of the next phase of t
0
0 👁
Iliad is Hiring
Iliad is hiring for operations, research, and engineering roles. If you're excited about advancing foundational AI alignment research, we'd love to hear from you.Full job descriptions are available at https://www.iliad.ac/careers. About IliadIliad is a nonprofit dedicated to advancing foundational AI alignment research. We run the Iliad Intensive, the Iliad Fellowship, and a range of conferences that bring together researchers working on the hardest problems in AI safety. We also incubate resear
0
2 👁
Neglected Basics of AI Alignment
I came into this world as the misunderstood hero of Harry Potter and the Methods of Rationality. While some characters inside that story would call me a villain, the narrator's-eye view clearly shows that I saved that world from total destruction, inspired the next generation of leaders, and taught the best Defense Against the Dark Arts class in the Harry Potter multiverse. And, being fictional characters, none of the people I killed were moral patients at all.When I first came to visit this wor
0
4 👁
The Hats of LessOnline
It is currently the evening after day two of LessOnline 2026. I wish to document one popular topic of discussion among LessOnline attendees: golden hats. My girlfriend Celene has collected four golden hats and maintained possession of them until the end of the day, and she has graciously permitted me to study them and share my findings with all of you.Celene, wearing four golden hatsThe four hats, laid out on a cushion in the second floor of Aumann HallThough the four hats appear very similar, t
0
4 👁
Can activation verbalizers surface an internal chain of thought?
We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math problem in a single forward pass? For open-weight NLAs, the answer seems to be: "possibly, but definitely not reliably".Lots of important capabilities currently require AI models to reason "out loud" in a natural-language chain of thought, which means that we can monitor important parts of their thinking. It would be nice to have this same affordance for the reasoning that model
0
4 👁
Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
Large-scale cooperation has been a central feature of humanity’s ability to advance technology and build complex societies. Much of this cooperation is reliant on the ability to act in ways informed by the beliefs and intentions of others. This capacity, also known as Theory of Mind (ToM), includes belief-state tracking, which describes the ability to keep track of who knows what as information is exchanged in groups.Belief-state tracking becomes increasingly important as AI systems get integrat
0
3 👁
Coming Around To Political Donations
Five years ago I read a
post
on the EA Forum arguing that "election campaign
contributions might be a way in which you can have a substantial
impact as a small donor". It struck me as weird but plausible: a
combination that you
see
a
lot
of
on the Forum.
A few months later I read another post, a case
for Carrick Flynn in particular. It made a lot of sense, but
while I don't remember my specific reservations I do remember not
being convinced initially. After a lot of talking wi
0
4 👁
Analysis of Metastable States in the Transformer Activation Space
Part 1: Do Metastable Token Clusters exist in Trained Transformers? This is the first entry in a sequence. Over about ten parts, this series will work through a few humble experiments that test a mathematical theory of attention against real trained transformers.A project summary: a recent paper by Geshkovski, Letrouit, Polyanskiy, and Rigollet models attention as a dynamical system on the sphere and proves that tokens cluster and drift toward consensus, with a metastable two-timescale structure
0
3 👁
The Diamond Lemma
I found this result useful for a few different problems I was thinking about recently. It cleared up a lot of confusion I had around simplification rules. First I give a semiformal statement of the lemma and some applications. At the end I give a formal statement and proof.SetupSuppose you have a set S and some possible transitions where one element of S “simplifies” into another. The diamond lemma has two requirements:There is no infinite chain of simplifications. If you start somewhere and kee
0
2 👁
The Residual Stream Has a Geometry of Time
Preface
This is a preliminary writeup for an experiment on residual stream geometry. The research direction seems pretty underexplored, so I’m posting early to collect objections, research intuitions, and connections to problems other people are thinking about before I invest in the larger run.
The case for skimming this post: this experiment suggests transformers may keep track of context in a surprisingly compact way. Information that persists across many tokens is not diffuse across activatio
0
3 👁
Against Corrigibility
Epistemic status: don’t know whether I actually believe all of this, but I think it’s worth considering.A “corrigible” agent, per the LW wiki, is:…one that doesn’t interfere with what we would intuitively see as attempts to ’correct’ the agent, or ’correct’ our mistakes in building it; and permits these ’corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems diffic
0
2 👁
Simulating Simulators
Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept…
💬 0
👁 0
Learning to spend money
LessWrong · 1d ago
💬 0
👁 0
Parkinson's Heuristic: The Only Time To Do Anything
LessWrong · 1d ago
💬 0
👁 0
PSA: Almost nobody is working on alignment
LessWrong · 1d ago
💬 0
👁 0
Honey is Good
LessWrong · 1d ago
The Aestheticising Vice by Paul Seabright
LessWrong · 1d ago
Celene's thoughts on consciousness
LessWrong · 1d ago
Construct validity of Claude Opus 4.8's System Card – A commentary
LessWrong · 1d ago
you won't one-shot a perfect system, but try anyway
Have you ever experienced this exchange:A: Damn, , this system is so broken. My friend says in their country,…
💬 0
👁 0
Announcing the Next Phase of AI Forge
LessWrong · 1d ago
💬 0
👁 0
Iliad is Hiring
LessWrong · 5d ago
💬 0
👁 2
Neglected Basics of AI Alignment
LessWrong · 6d ago
💬 0
👁 4
The Hats of LessOnline
LessWrong · 6d ago

Can activation verbalizers surface an internal chain of thought?
LessWrong · 6d ago

Frontier Models Still Lag Behind Humans at Robust Belief-State Tracking
LessWrong · 6d ago
Coming Around To Political Donations
LessWrong · 6d ago
Analysis of Metastable States in the Transformer Activation Space
Part 1: Do Metastable Token Clusters exist in Trained Transformers? This is the first entry in a sequence. Over about ten parts, t…
💬 0
👁 3