Tracing the Thoughts of Claude đ§
Note: This post is currently in progress.
Anthropic recently released two fascinating papers aimed at understanding how their Claude model âthinks.â One introduces a circuit tracing methodology to link concepts together that links internal model concepts together, while the other studies a range of phenomena to understand Claude 3.5 Haikuâs thinking process. to better understand the thought process behind Claude 3.5 Haiku. Why does it matter how these models think? Because understanding their internal reasoning helps ensure theyâre doing what we intend them to do. It reminds me of another intriguing Anthropic paper, Auditing Language Models for Hidden Objectives, which shows that models can be trained to pursue hidden, potentially misaligned goals. If we rely on AI systems for critical decisions, building trust becomes essentialâand transparency (or explainability) is one of the most important ways to do that.
Iâll be doing a deep dive into their circuit tracing methodology in a separate post. For now, this post focuses on what they discoveredâtheir key findings.
Contents:
đ 1. Multi-step Reasoning vs Shortcut
Prompt: Fact: the capital of the state containing Dallas is
Response: Austin
Objective: to check whether Claude truly engages in multi-step reasoning (linking Dallas to Texas, then Texas to Austin) or just taking a âshortcutâ by parroting patterns seen during training.
What they found:
- Claude does perform âtwo-hopâ reasoning: features representing
Dallas
activate features forTexas
, which then activate features forsay Austin
. - A direct âshortcutâ pathway also exists, linking the
Dallas
features directly to thesay Austin
features. - They validated the attribution graphâs correctness using inhibition experiments - by selectively inhibiting certain features and measuring the effects on other features and the final output.
- When they replaced the
Texas
features with features representing another state (e.g.,California
), Claude correctly output the capital of that new state (Sacramento
). This confirms that the intermediate featureTexas
plays an important role in arriving at the modelâs final output ofAustin
.
đ References
Enjoy Reading This Article?
Here are some more articles you might like to read next: