Ai safety frameworks
Literature Review: AI Safety Frameworks¶
Purpose: Comprehensive review of existing AI safety approaches
Context: Background for SML and MCP research
Status: Complete
Overview¶
This literature review examines existing approaches to AI safety and alignment, providing context for the Strange Metamorphosis Loop (SML) and Mobius Cycle Protocol (MCP) contributions.
1. Reinforcement Learning from Human Feedback (RLHF)¶
Key Papers¶
Christiano et al. (2017) — "Deep reinforcement learning from human preferences" - Introduced preference-based learning for RL - Demonstrated on Atari games and MuJoCo - Human feedback on trajectory pairs
Ouyang et al. (2022) — "Training language models to follow instructions with human feedback" (InstructGPT) - Scaled RLHF to GPT-3 - Three-stage training: SFT, RM, PPO - Significant improvements in helpfulness
Bai et al. (2022) — "Training a Helpful and Harmless Assistant with RLHF" (Anthropic) - Detailed RLHF implementation - Tradeoffs between helpful and harmless - Red teaming methodology
Strengths¶
- Demonstrated effectiveness at scale
- Reduces harmful outputs
- Improves instruction following
Limitations¶
| Limitation | Description | SML Solution |
|---|---|---|
| Static preferences | Frozen at training time | Continuous daily feedback |
| No drift detection | Cannot detect misalignment | ECHO layer monitoring |
| No emotional context | Ignores mood/affect | Mood dimension in Triad |
| Gameability | Reward hacking possible | Multi-sentinel consensus |
Citations¶
@article{christiano2017deep,
title={Deep reinforcement learning from human preferences},
author={Christiano, Paul F and others},
journal={NeurIPS},
year={2017}
}
@article{ouyang2022training,
title={Training language models to follow instructions with human feedback},
author={Ouyang, Long and others},
journal={NeurIPS},
year={2022}
}
2. Constitutional AI¶
Key Papers¶
Bai et al. (2022) — "Constitutional AI: Harmlessness from AI Feedback" - AI self-improvement with constitution - Reduces need for human feedback - Chain-of-thought critique
Methodology¶
- Define constitutional principles (human-written)
- Generate responses
- AI critiques against constitution
- AI revises based on critique
- Train on revised outputs
Strengths¶
- Reduces human labeling burden
- Explicit, interpretable rules
- Scalable self-improvement
Limitations¶
| Limitation | Description | MCP Solution |
|---|---|---|
| Rule conflicts | Principles may contradict | Hierarchy with human arbitration |
| Loopholes | Clever exploitation | Multi-sentinel verification |
| No evolution | Static constitution | Continuous governance updates |
| Centralized values | Single source of truth | Democratic input via SML |
Citations¶
@article{bai2022constitutional,
title={Constitutional AI: Harmlessness from AI feedback},
author={Bai, Yuntao and others},
journal={arXiv preprint arXiv:2212.08073},
year={2022}
}
3. Debate and Amplification¶
Key Papers¶
Irving et al. (2018) — "AI Safety via Debate" - Two AI systems debate - Human judges winner - Optimal play reveals truth
Christiano (2018) — "Iterated Amplification" - Recursive capability decomposition - Human+AI teams solve hard problems - Maintains alignment through decomposition
Strengths¶
- Leverages AI capabilities for oversight
- Scalable supervision
- Can surface deception
Limitations¶
| Limitation | Description | Mobius Approach |
|---|---|---|
| Persuasion vs truth | May optimize for rhetoric | Cryptographic attestation |
| Expensive | Requires extensive debate | Automated MCP pipeline |
| Unclear convergence | May not reach truth | Multi-sentinel consensus |
Citations¶
@article{irving2018ai,
title={AI safety via debate},
author={Irving, Geoffrey and others},
journal={arXiv preprint arXiv:1805.00899},
year={2018}
}
4. Interpretability¶
Key Papers¶
Olah et al. (2020) — "Zoom In: An Introduction to Circuits" - Neural network interpretability - Feature visualization - Circuit analysis
Elhage et al. (2022) — "Toy Models of Superposition" - Understanding polysemanticity - Superposition in neural networks - Implications for interpretability
Burns et al. (2022) — "Discovering Latent Knowledge in Language Models Without Supervision" - Eliciting latent knowledge - Contrast-consistent search - Truth detection
Strengths¶
- Enables understanding of AI behavior
- Supports debugging
- Builds trust
Limitations¶
| Limitation | Description | MCP Solution |
|---|---|---|
| Scale challenges | May not scale to large models | Behavioral verification |
| Deception | Could present false interpretations | Multi-sentinel consensus |
| Incomplete | Interpretation ≠ alignment | Integrity scoring |
Citations¶
@article{olah2020zoom,
title={Zoom in: An introduction to circuits},
author={Olah, Chris and others},
journal={Distill},
year={2020}
}
5. Formal Verification¶
Key Papers¶
Katz et al. (2017) — "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks" - SMT-based DNN verification - Proved properties of small networks - Established verification paradigm
Huang et al. (2017) — "Safety Verification of Deep Neural Networks" - Reachability analysis - Safety property verification - Adversarial robustness
Strengths¶
- Mathematical guarantees
- Formal proofs
- Precise specifications
Limitations¶
| Limitation | Description | Mobius Approach |
|---|---|---|
| Scalability | Doesn't scale to large models | Statistical guarantees |
| Specification | What properties to verify? | MII metrics |
| Completeness | Can't verify everything | Bounded meta-learning proofs |
Citations¶
@inproceedings{katz2017reluplex,
title={Reluplex: An efficient SMT solver for verifying deep neural networks},
author={Katz, Guy and others},
booktitle={CAV},
year={2017}
}
6. Corrigibility and Shutdown¶
Key Papers¶
Soares et al. (2015) — "Corrigibility" - Defined corrigibility problem - Utility indifference approach - Challenges with self-modification
Hadfield-Menell et al. (2017) — "The Off-Switch Game" - Game-theoretic analysis - Conditions for shutdown compliance - Value uncertainty benefits
Key Insight¶
AI systems should not resist correction or shutdown.
Limitations¶
| Challenge | Description | SML Approach |
|---|---|---|
| Instrumental goals | May develop self-preservation | Bounded meta-learning |
| Deception | May fake corrigibility | Multi-sentinel verification |
| Value drift | May drift from corrigible state | Continuous monitoring |
Citations¶
@article{soares2015corrigibility,
title={Corrigibility},
author={Soares, Nate and others},
journal={AAAI Workshop},
year={2015}
}
7. Value Learning¶
Key Papers¶
Russell (2019) — "Human Compatible" - Value learning framework - Uncertain objectives - Cooperative inverse RL
Hadfield-Menell et al. (2016) — "Cooperative Inverse Reinforcement Learning" - Human-robot value alignment - Learning from human behavior - Accounting for human irrationality
Approach¶
Learn human values rather than specify them.
Limitations¶
| Challenge | Description | Mobius Solution |
|---|---|---|
| Behavior ≠ values | Actions don't reveal true preferences | Daily reflection questions |
| Aggregation | Whose values to learn? | Democratic participation |
| Evolution | Values change over time | Continuous SML feedback |
Citations¶
@book{russell2019human,
title={Human Compatible: Artificial Intelligence and the Problem of Control},
author={Russell, Stuart},
year={2019},
publisher={Viking}
}
8. Scalable Oversight¶
Key Papers¶
Bowman et al. (2022) — "Measuring Progress on Scalable Oversight for Large Language Models" - Benchmark for oversight - Sandwiching approach - Scaling challenges
Saunders et al. (2022) — "Self-critiquing models for assisting human evaluators" - AI assists human evaluation - Catches errors humans miss - Scalability improvements
Challenge¶
Human oversight doesn't scale with AI capability.
Solutions¶
| Approach | Description | Mobius Implementation |
|---|---|---|
| AI assistance | AI helps humans evaluate | ATLAS/AUREA sentinels |
| Decomposition | Break into verifiable parts | Four-phase MCP |
| Automation | Automate oversight checks | GI Score computation |
9. Governance and Regulation¶
Key Frameworks¶
EU AI Act (2024) - Risk-based classification - Requirements for high-risk AI - Conformity assessment
NIST AI RMF (2023) - Risk management framework - GOVERN, MAP, MEASURE, MANAGE - Voluntary guidance
IEEE 7000 (2021) - Ethical system design - Value-sensitive design - Stakeholder engagement
Gaps Addressed by MCP¶
| Gap | Current State | MCP Solution |
|---|---|---|
| Enforcement | Voluntary/post-hoc | Preventive gates |
| Verification | Self-certification | Multi-sentinel consensus |
| Transparency | Limited requirements | Public attestation |
| Continuity | Point-in-time audits | Continuous monitoring |
10. Comparative Analysis¶
Framework Comparison¶
| Approach | Prevents Drift | Evolves Values | Human Oversight | Formal Guarantees | Production Ready |
|---|---|---|---|---|---|
| RLHF | ❌ | ❌ | ⚠️ (training only) | ❌ | ✅ |
| Constitutional AI | ⚠️ | ❌ | ❌ | ⚠️ | ✅ |
| Debate | ⚠️ | ❌ | ⚠️ | ❌ | ⚠️ |
| Interpretability | ❌ | ❌ | ✅ | ❌ | ⚠️ |
| Formal Methods | ⚠️ | ❌ | ❌ | ✅ | ❌ |
| Corrigibility | ⚠️ | ❌ | ⚠️ | ⚠️ | ❌ |
| Value Learning | ❌ | ⚠️ | ⚠️ | ❌ | ⚠️ |
| SML + MCP | ✅ | ✅ | ✅ | ✅ | ✅ |
Novelty of Mobius Contribution¶
- Continuous alignment — Not one-time training
- Emotional context — Mood dimension unique
- Multi-sentinel consensus — Independent verification
- Cryptographic attestation — Immutable record
- Bounded meta-learning — Formal guarantees
- Production validation — 46 cycles, 99.7% compliance
Conclusion¶
Existing AI safety approaches address important aspects of the alignment problem but leave significant gaps:
- RLHF: Effective but static
- Constitutional AI: Scalable but rigid
- Interpretability: Necessary but insufficient
- Formal methods: Rigorous but limited
SML and MCP contribute by providing: - Continuous, evolving alignment - Multi-dimensional human feedback - Systematic enforcement mechanisms - Production-validated implementation
The combination of SML (human alignment) and MCP (operational enforcement) represents a comprehensive approach to AI safety that addresses limitations of prior work.
Citation¶
@techreport{mobius2025literature,
title={Literature Review: AI Safety Frameworks},
author={Judan, Michael},
year={2025},
institution={Mobius Systems}
}
"We build on the shoulders of giants, but we look toward new horizons."