Benevolently
Posts
What makes AI safe?

What makes AI safe?

How does Claude use Anthropic’s Responsible Scaling Policy?

Tae Hong Min
February 26, 2024

Good evening everyone!! 👋 Today we will be reviewing how AI is safe using an example of Claude AI by Anthropic! One of the leading responsible AI companies out there in the sea of AI companies.

According to Anthropic, safe AI 🔒 refers to developing artificial intelligence systems that are highly capable and powerful, while still being safe and robustly aligned with human values and ethics. Some key principles that Anthropic promotes for safe AI development include:

1) Value alignment 🎯 - Ensuring the AI system's goals and behaviors are aligned with human values and that it internalizes these values deeply. The AI should be motivated to pursue beneficial outcomes for humanity. 🌍

2) Corrigibility 🔧 - The ability for humans to correct an AI system if it starts behaving in unintended or unaligned ways. Safe AI systems need to have a degree of transparency and oversight built in. 👀

3) Scalable oversight 🔍 - As AI systems become more advanced, finding ways to maintain meaningful human oversight and the ability to understand, interrupt or redirect extremely capable systems if needed. ⚠️

4) Avoiding negative side effects ⛔ - Highly capable AI could have vastly scaled up impacts, both positive and negative. Safe AI aims to reap the immense potential benefits while avoiding pitfalls like existential risk, value lock-in of flawed values, or runaway systems. 💥

5) Ethical training 🕊️ - Safe AI should be trained in a way that imbues it with robust ethical reasoning capabilities aligned with human ethics and values like avoiding harm and upholding human rights. ⚖️

The overarching goal is to develop transformative AI systems that are extremely capable but still controllable, predictable and beneficent from a human perspective and value system. Safety is paramount 🔑 as we develop AI of increasing scale and capability.

According to Claude’s Responsible Scaling Policy, Claude lands in the ASL-2 bucket. Here is a chart from them for reference:

https://www.anthropic.com/news/anthropics-responsible-scaling-policy

Let’s recap what ASL-2 is:

ASL-2 is the "AI Safety Level" that Anthropic has assigned to current language models like Claude. ASL stands for "AI Safety Level" and is modeled after the biosafety levels used for handling hazardous biological materials.

ASL-2 refers to AI systems that show early signs of potentially dangerous capabilities - for example, being able to provide instructions on building bioweapons. However, at ASL-2 the information provided would not be considered fully reliable or useful compared to what could already be found via web searches.

So Claude, as an ASL-2 system, may demonstrate some early warning signs of potentially risky capabilities. But the key point is that any such outputs from Claude are not considered significantly more useful or reliable than what is already publicly available.

In terms of how this affects Claude, ASL-2 represents Anthropic's current safety and security standards which are already in place. This involves techniques like content filtering, refusing harmful requests, privacy safeguards, capability transparency, and more - which I've outlined in previous messages.

The Responsible Scaling Policy sets out increasingly strict requirements and demonstrations of safety as AI systems progress to ASL-3, ASL-4 and beyond. But Claude's current deployment as an ASL-2 system is not directly impacted by the new policy, outside of continuing to adhere to Anthropic's existing strong safety practices.

What makes Claude AI safe?

Here are some of the key safety and responsibility principles built into Claude:

🚫 Refusal of Harmful Content: Claude has very strong filters and guidelines that prevent it from engaging with anything harmful, unethical, dangerous or illegal. It will not assist with violence, hate speech, explicit content, instructions for making weapons, etc.

🔒 Privacy and Security: Claude does not store any personal information, user data or conversation histories. Conversations stay completely private and confidential. Claude also has robust safeguards against being misused for malicious hacking, fraud, etc.

🤖 Transparency on Capabilities: Claude is always upfront about being an AI model with limits. It explains clearly what it can and cannot do, rather than trying to deceive users. This helps set proper expectations.

🧠 Bounded Knowledge: Claude's knowledge is limited to a specific training cutoff date, preventing exposure to harmful emerging information. It cannot learn or access new data at runtime.

🚫 Deception Prevention: Claude's responses are monitored to detect and block potential generation of false, deceptive or misleading information.

🟢 Content Filtering: Claude's outputs go through multiple filters to remove explicit, offensive or inappropriate content to keep things family-friendly.

So in many ways, Claude has responsibility and safety embedded into its core design, training and behavior. This is a key focus for Anthropic as AI systems become more advanced.

That’s all for today y’all! 🫶 Will continue next week with what makes safe ai! 👾

Disclaimer: Benevolently is for informational purposes only. It does not constitute legal advice or endorsement of specific technologies.