Skip to content

Ai safety frameworks

Literature Review: AI Safety Frameworks

Purpose: Comprehensive review of existing AI safety approaches
Context: Background for SML and MCP research
Status: Complete


Overview

This literature review examines existing approaches to AI safety and alignment, providing context for the Strange Metamorphosis Loop (SML) and Mobius Cycle Protocol (MCP) contributions.


1. Reinforcement Learning from Human Feedback (RLHF)

Key Papers

Christiano et al. (2017) — "Deep reinforcement learning from human preferences" - Introduced preference-based learning for RL - Demonstrated on Atari games and MuJoCo - Human feedback on trajectory pairs

Ouyang et al. (2022) — "Training language models to follow instructions with human feedback" (InstructGPT) - Scaled RLHF to GPT-3 - Three-stage training: SFT, RM, PPO - Significant improvements in helpfulness

Bai et al. (2022) — "Training a Helpful and Harmless Assistant with RLHF" (Anthropic) - Detailed RLHF implementation - Tradeoffs between helpful and harmless - Red teaming methodology

Strengths

  • Demonstrated effectiveness at scale
  • Reduces harmful outputs
  • Improves instruction following

Limitations

Limitation Description SML Solution
Static preferences Frozen at training time Continuous daily feedback
No drift detection Cannot detect misalignment ECHO layer monitoring
No emotional context Ignores mood/affect Mood dimension in Triad
Gameability Reward hacking possible Multi-sentinel consensus

Citations

@article{christiano2017deep,
  title={Deep reinforcement learning from human preferences},
  author={Christiano, Paul F and others},
  journal={NeurIPS},
  year={2017}
}

@article{ouyang2022training,
  title={Training language models to follow instructions with human feedback},
  author={Ouyang, Long and others},
  journal={NeurIPS},
  year={2022}
}

2. Constitutional AI

Key Papers

Bai et al. (2022) — "Constitutional AI: Harmlessness from AI Feedback" - AI self-improvement with constitution - Reduces need for human feedback - Chain-of-thought critique

Methodology

  1. Define constitutional principles (human-written)
  2. Generate responses
  3. AI critiques against constitution
  4. AI revises based on critique
  5. Train on revised outputs

Strengths

  • Reduces human labeling burden
  • Explicit, interpretable rules
  • Scalable self-improvement

Limitations

Limitation Description MCP Solution
Rule conflicts Principles may contradict Hierarchy with human arbitration
Loopholes Clever exploitation Multi-sentinel verification
No evolution Static constitution Continuous governance updates
Centralized values Single source of truth Democratic input via SML

Citations

@article{bai2022constitutional,
  title={Constitutional AI: Harmlessness from AI feedback},
  author={Bai, Yuntao and others},
  journal={arXiv preprint arXiv:2212.08073},
  year={2022}
}

3. Debate and Amplification

Key Papers

Irving et al. (2018) — "AI Safety via Debate" - Two AI systems debate - Human judges winner - Optimal play reveals truth

Christiano (2018) — "Iterated Amplification" - Recursive capability decomposition - Human+AI teams solve hard problems - Maintains alignment through decomposition

Strengths

  • Leverages AI capabilities for oversight
  • Scalable supervision
  • Can surface deception

Limitations

Limitation Description Mobius Approach
Persuasion vs truth May optimize for rhetoric Cryptographic attestation
Expensive Requires extensive debate Automated MCP pipeline
Unclear convergence May not reach truth Multi-sentinel consensus

Citations

@article{irving2018ai,
  title={AI safety via debate},
  author={Irving, Geoffrey and others},
  journal={arXiv preprint arXiv:1805.00899},
  year={2018}
}

4. Interpretability

Key Papers

Olah et al. (2020) — "Zoom In: An Introduction to Circuits" - Neural network interpretability - Feature visualization - Circuit analysis

Elhage et al. (2022) — "Toy Models of Superposition" - Understanding polysemanticity - Superposition in neural networks - Implications for interpretability

Burns et al. (2022) — "Discovering Latent Knowledge in Language Models Without Supervision" - Eliciting latent knowledge - Contrast-consistent search - Truth detection

Strengths

  • Enables understanding of AI behavior
  • Supports debugging
  • Builds trust

Limitations

Limitation Description MCP Solution
Scale challenges May not scale to large models Behavioral verification
Deception Could present false interpretations Multi-sentinel consensus
Incomplete Interpretation ≠ alignment Integrity scoring

Citations

@article{olah2020zoom,
  title={Zoom in: An introduction to circuits},
  author={Olah, Chris and others},
  journal={Distill},
  year={2020}
}

5. Formal Verification

Key Papers

Katz et al. (2017) — "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks" - SMT-based DNN verification - Proved properties of small networks - Established verification paradigm

Huang et al. (2017) — "Safety Verification of Deep Neural Networks" - Reachability analysis - Safety property verification - Adversarial robustness

Strengths

  • Mathematical guarantees
  • Formal proofs
  • Precise specifications

Limitations

Limitation Description Mobius Approach
Scalability Doesn't scale to large models Statistical guarantees
Specification What properties to verify? MII metrics
Completeness Can't verify everything Bounded meta-learning proofs

Citations

@inproceedings{katz2017reluplex,
  title={Reluplex: An efficient SMT solver for verifying deep neural networks},
  author={Katz, Guy and others},
  booktitle={CAV},
  year={2017}
}

6. Corrigibility and Shutdown

Key Papers

Soares et al. (2015) — "Corrigibility" - Defined corrigibility problem - Utility indifference approach - Challenges with self-modification

Hadfield-Menell et al. (2017) — "The Off-Switch Game" - Game-theoretic analysis - Conditions for shutdown compliance - Value uncertainty benefits

Key Insight

AI systems should not resist correction or shutdown.

Limitations

Challenge Description SML Approach
Instrumental goals May develop self-preservation Bounded meta-learning
Deception May fake corrigibility Multi-sentinel verification
Value drift May drift from corrigible state Continuous monitoring

Citations

@article{soares2015corrigibility,
  title={Corrigibility},
  author={Soares, Nate and others},
  journal={AAAI Workshop},
  year={2015}
}

7. Value Learning

Key Papers

Russell (2019) — "Human Compatible" - Value learning framework - Uncertain objectives - Cooperative inverse RL

Hadfield-Menell et al. (2016) — "Cooperative Inverse Reinforcement Learning" - Human-robot value alignment - Learning from human behavior - Accounting for human irrationality

Approach

Learn human values rather than specify them.

Limitations

Challenge Description Mobius Solution
Behavior ≠ values Actions don't reveal true preferences Daily reflection questions
Aggregation Whose values to learn? Democratic participation
Evolution Values change over time Continuous SML feedback

Citations

@book{russell2019human,
  title={Human Compatible: Artificial Intelligence and the Problem of Control},
  author={Russell, Stuart},
  year={2019},
  publisher={Viking}
}

8. Scalable Oversight

Key Papers

Bowman et al. (2022) — "Measuring Progress on Scalable Oversight for Large Language Models" - Benchmark for oversight - Sandwiching approach - Scaling challenges

Saunders et al. (2022) — "Self-critiquing models for assisting human evaluators" - AI assists human evaluation - Catches errors humans miss - Scalability improvements

Challenge

Human oversight doesn't scale with AI capability.

Solutions

Approach Description Mobius Implementation
AI assistance AI helps humans evaluate ATLAS/AUREA sentinels
Decomposition Break into verifiable parts Four-phase MCP
Automation Automate oversight checks GI Score computation

9. Governance and Regulation

Key Frameworks

EU AI Act (2024) - Risk-based classification - Requirements for high-risk AI - Conformity assessment

NIST AI RMF (2023) - Risk management framework - GOVERN, MAP, MEASURE, MANAGE - Voluntary guidance

IEEE 7000 (2021) - Ethical system design - Value-sensitive design - Stakeholder engagement

Gaps Addressed by MCP

Gap Current State MCP Solution
Enforcement Voluntary/post-hoc Preventive gates
Verification Self-certification Multi-sentinel consensus
Transparency Limited requirements Public attestation
Continuity Point-in-time audits Continuous monitoring

10. Comparative Analysis

Framework Comparison

Approach Prevents Drift Evolves Values Human Oversight Formal Guarantees Production Ready
RLHF ⚠️ (training only)
Constitutional AI ⚠️ ⚠️
Debate ⚠️ ⚠️ ⚠️
Interpretability ⚠️
Formal Methods ⚠️
Corrigibility ⚠️ ⚠️ ⚠️
Value Learning ⚠️ ⚠️ ⚠️
SML + MCP

Novelty of Mobius Contribution

  1. Continuous alignment — Not one-time training
  2. Emotional context — Mood dimension unique
  3. Multi-sentinel consensus — Independent verification
  4. Cryptographic attestation — Immutable record
  5. Bounded meta-learning — Formal guarantees
  6. Production validation — 46 cycles, 99.7% compliance

Conclusion

Existing AI safety approaches address important aspects of the alignment problem but leave significant gaps:

  • RLHF: Effective but static
  • Constitutional AI: Scalable but rigid
  • Interpretability: Necessary but insufficient
  • Formal methods: Rigorous but limited

SML and MCP contribute by providing: - Continuous, evolving alignment - Multi-dimensional human feedback - Systematic enforcement mechanisms - Production-validated implementation

The combination of SML (human alignment) and MCP (operational enforcement) represents a comprehensive approach to AI safety that addresses limitations of prior work.


Citation

@techreport{mobius2025literature,
  title={Literature Review: AI Safety Frameworks},
  author={Judan, Michael},
  year={2025},
  institution={Mobius Systems}
}

"We build on the shoulders of giants, but we look toward new horizons."