[AARR] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

[AARR] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

[AARR] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Author

Align AI R&D Team

Author

Author

Author

Align AI R&D Team

Align AI R&D Team

Align AI R&D Team

💡 The Align AI Research Review is a weekly dive into the most relevant research papers, AI industry news and updates


Insights

  • Sparse autoencoders are capable of extracting millions of interpretable features from a large language model.

  • Scaling laws can be used to lead the training of sparse autoencoders for large models.

  • The features are highly abstract, multilingual, multimodal, and generalize between concrete and abstract references.

  • The dictionary size required to resolve corresponding features is systematically correlated with concept frequency.

  • Some of the features are related to safety concerns such as deception, sycophancy, bias, and dangerous content.

The paper “*Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet”* has scaled up an interpretability technique termed "dictionary learning" to one of their implemented models, the research explores understanding and interpreting the cognitive processes of AI.

The Anthropic research team managed to extract interpretable features from the activations of a large language model using sparse autoencoders. The methodology demonstrates the potential to provide valuable insights into the model's internal representations and behaviors.

Additionally, the research simplifies the understanding of complex AI models. It discusses the process of splitting transformer models, such as Claude 3 Sonnet, into simpler and more interpretable features. These features, which represent particular patterns in the data (e.g., DNA sequences), are more accessible than analyzing individual neurons alone.

The researchers adopt sparse autoencoders learning algorithm which dynamically learns features from unlabeled data. Moreover, they effectively identified thousands of features utilized by the model to process information by training these autoencoders on a substantial dataset.

By applying this approach, a deeper understanding of the model's process of decision-making and output generation can be achieved.


How Sparse Autoencoders Reveal Interpretable Concepts and Scaling Laws in Dictionary Learning?

  • The main approach employed in this investigation is the use of sparse autoencoders, a type of dictionary learning model.

  • The autoencoder is trained, under a sparsity constraint, to reconstruct the activations of the middle layer of Claude 3 Sonnet. This process stimulates the model to acquire a set of sparse "features" that can be linearly combined to represent the original activations.

  • The proposed research trained autoencoders of different sizes (~1M, ~4M, and ~34M features) respectively. They revealed that these features are consistent with interpretable concepts, as demonstrated by their activation patterns on methodically selected datasets.

  • Specifically, they found numerous scaling laws that establish a correlation between the volume of training data and the scope of the model in terms of autoencoder performance.

  • These laws propose a comprehensive framework for efficiently expanding dictionary learning to even larger models.

Scaling laws


Can Manipulating Feature Activations Influence LLM Behavior?

These sparse features learned by the autoencoders are highly abstract, multilingual, multimodal, and generalize between concrete and abstract references.

Multi-linguistic nature!

The researchers identified a systematic correlation between the overall incidence of a concept in the training data and the dictionary size required to resolve a corresponding feature. This implies that it is possible to anticipate the computational resources required to identify features for a specific concept.

In addition, the researchers' most important achievement was demonstrating that these characteristics can be used to impact the model's behavior. By manipulating the activations of specific features, they could consistently induce the model to exhibit or refrain from specific behaviors.


Interpretable Features and Influence Model Outputs?

To evaluate the interpretability of the learned features and their ability to elucidate model behavior, the researchers conducted various analyses.

  • They began by manually evaluating a small number of features that respond to relatively simple concepts, such as the Golden Gate Bridge, cognitive sciences, monuments and tourist attractions, and transit infrastructure (see the example below).

  • Additionally, the researchers conducted "feature steering" experiments to demonstrate the impact of features on behavior.

  • By manipulating features to artificially high or low values during a forward pass and observing the resulting transformations in model outputs, they demonstrated that these manipulations produce output changes consistent with the feature's interpretation. For example, when the feature associated with the Golden Gate Bridge is constrained, the model references the Golden Gate Bridge in its outputs.


Safety-Relevant Code Features

The work identified features in the model that reflect deeper and more abstract knowledge, extending beyond basic concepts. By examining features activated in programming contexts, they discovered a "code error" feature. This feature responds to a wide variety of bugs and errors in code, including typos, invalid operations, and bad inputs.

At first glance, it might be unclear how safety-relevant these features actually are. Of course, it's interesting to have features that fire on unsafe code, or bugs, or discussion of backdoors. But do they really causally connect to potential unsafe behaviors?

Interestingly, this feature is not exclusively linked to the presence of errors. The researchers demonstrated that the model generates error messages even for bug-free code when this feature is activated. Conversely, deactivating this feature can cause the model to "autocorrect" flaws in a prompt. This was achieved through the use of feature steering.

Additionally, they identified a feature that appears to represent the abstract concept of addition in code. This "addition function" feature is activated not only for direct addition operations but also for function calls that implicitly result in addition. Activating this feature in code that does not involve addition "tricks" the model into executing an addition operation. (For more detail read FEATURES REPRESENTING FUNCTIONS in the paper).

The model has "learned" a wide range of rich, abstract features that reflect intricate knowledge regarding the structure and execution of code, as evidenced by these results.

This presents strong evidence that the interpretable features acquired through dictionary learning are not merely superficial correlates; rather, they appear to be the fundamental building blocks of the model's behavior.


Feature Completeness

The researchers discovered that as the size of the autoencoder increases, so does the model's coverage of concepts. However, the largest model, with 34 million features, did not encompass all the knowledge contained in Claude 3 Sonnet.

For example, it can identify all the London boroughs and many streets within them, but only about 60% of the boroughs have corresponding features in the 34M autoencoder.

Nearby features in dictionary vector space touch on similar concepts.”

They identified a systematic correlation between the frequency of a concept in the training data and the probability of locating a corresponding feature. The likelihood of dedicated features for rare concepts was significantly lower. This implies that truly exhaustive feature extraction may require autoencoders with billions of features and training datasets orders of magnitude larger than those used in this study.

A small insight into Sonnet’s personality. Deeply conflicted?

The researchers identified a variety of feature categories by utilizing a combination of targeted prompts, automated interpretability techniques, and an analysis of the geometric relationships between features.

These categories include features for specific individuals, countries, code syntax elements, and list positions.


Limitations

The work has few limitations. Some are superficial, related to the initial phase of this research, while others are deeply fundamental challenges that require novel research to address.

For example,

  • The research mainly perform dictionary learning over activations sampled from a text-only dataset like parts of our pretraining distribution. The benchmark dataset did not include any “Human:” / “Assistant:” formatted data that can be fine-tuned Claude to operate on, nor did it include any images.

  • The ground truth objective is not clear in this work.

  • The work lacks a clear method for assessing the completeness of feature recognition.

  • The research utilized L1 activation penalty to encourage sparsity. This approach is well known to have issues with “shrinkage,” where non-zero activations are systematically underestimated.


Final Words

Using dictionary learning, the research investigated the underlying features of their large language model, Claude Sonnet. This approach is undoubtedly a step in the right direction, as it helps researchers better understand the limitations of these intricate, currently uninterpretable models and supports the creation of more easily understandable LLM’s in the future.


Learn More!

  • How sparse autoencoders could recover monosemantic features from a small one-layer transformer? (Link)

  • To know more about superpositions (Link)

Want to ensure that your AI chatbot

is performing the way it is supposed to?

Want to ensure that your AI chatbot is performing the way it is supposed to?

Want to ensure that your AI chatbot

is performing the way it is supposed to?

United States

3003 N. 1st Street, San Jose, California

South Korea

23 Yeouidaebang-ro 69-gil, Yeongdeungpo-gu, Seoul

India

Eldeco Centre, Malviya Nagar, New Delhi

United States

3003 N. 1st Street, San Jose, California

South Korea

23 Yeouidaebang-ro 69-gil, Yeongdeungpo-gu, Seoul

India

Eldeco Centre, Malviya Nagar, New Delhi

United States

3003 N. 1st Street, San Jose, California

South Korea

23 Yeouidaebang-ro 69-gil, Yeongdeungpo-gu, Seoul

India

Eldeco Centre, Malviya Nagar, New Delhi