Note: This post is currently in progress.

Anthropic recently released two fascinating papers aimed at understanding how their Claude model “thinks.” One introduces a circuit tracing methodology to link concepts together that links internal model concepts together, while the other studies a range of phenomena to understand Claude 3.5 Haiku’s thinking process. to better understand the thought process behind Claude 3.5 Haiku. Why does it matter how these models think? Because understanding their internal reasoning helps ensure they’re doing what we intend them to do. It reminds me of another intriguing Anthropic paper, Auditing Language Models for Hidden Objectives, which shows that models can be trained to pursue hidden, potentially misaligned goals. If we rely on AI systems for critical decisions, building trust becomes essential—and transparency (or explainability) is one of the most important ways to do that.

I’ll be doing a deep dive into their circuit tracing methodology in a separate post. For now, this post focuses on what they discovered—their key findings.

Contents:

🌟 1. Multi-step Reasoning vs Shortcut


Prompt: Fact: the capital of the state containing Dallas is
Response: Austin

Objective: to check whether Claude truly engages in multi-step reasoning (linking Dallas to Texas, then Texas to Austin) or just taking a “shortcut” by parroting patterns seen during training.


What they found:

  • Claude does perform “two-hop” reasoning: features representing Dallas activate features for Texas, which then activate features for say Austin.
  • A direct “shortcut” pathway also exists, linking the Dallas features directly to the say Austin features.
  • They validated the attribution graph’s correctness using inhibition experiments - by selectively inhibiting certain features and measuring the effects on other features and the final output.
  • When they replaced the Texas features with features representing another state (e.g., California), Claude correctly output the capital of that new state (Sacramento). This confirms that the intermediate feature Texas plays an important role in arriving at the model’s final output of Austin.

📚 References