Ai safety frameworks

Literature Review: AI Safety Frameworks¶

Purpose: Comprehensive review of existing AI safety approaches
Context: Background for SML and MCP research
Status: Complete

Overview¶

This literature review examines existing approaches to AI safety and alignment, providing context for the Strange Metamorphosis Loop (SML) and Mobius Cycle Protocol (MCP) contributions.

1. Reinforcement Learning from Human Feedback (RLHF)¶

Key Papers¶

Christiano et al. (2017) — "Deep reinforcement learning from human preferences" - Introduced preference-based learning for RL - Demonstrated on Atari games and MuJoCo - Human feedback on trajectory pairs

Ouyang et al. (2022) — "Training language models to follow instructions with human feedback" (InstructGPT) - Scaled RLHF to GPT-3 - Three-stage training: SFT, RM, PPO - Significant improvements in helpfulness

Bai et al. (2022) — "Training a Helpful and Harmless Assistant with RLHF" (Anthropic) - Detailed RLHF implementation - Tradeoffs between helpful and harmless - Red teaming methodology

Strengths¶

Demonstrated effectiveness at scale
Reduces harmful outputs
Improves instruction following

Limitations¶

Limitation	Description	SML Solution
Static preferences	Frozen at training time	Continuous daily feedback
No drift detection	Cannot detect misalignment	ECHO layer monitoring
No emotional context	Ignores mood/affect	Mood dimension in Triad
Gameability	Reward hacking possible	Multi-sentinel consensus

Citations¶

@article{christiano2017deep,
  title={Deep reinforcement learning from human preferences},
  author={Christiano, Paul F and others},
  journal={NeurIPS},
  year={2017}
}

@article{ouyang2022training,
  title={Training language models to follow instructions with human feedback},
  author={Ouyang, Long and others},
  journal={NeurIPS},
  year={2022}
}

2. Constitutional AI¶

Key Papers¶

Bai et al. (2022) — "Constitutional AI: Harmlessness from AI Feedback" - AI self-improvement with constitution - Reduces need for human feedback - Chain-of-thought critique

Methodology¶

Define constitutional principles (human-written)
Generate responses
AI critiques against constitution
AI revises based on critique
Train on revised outputs

Strengths¶

Reduces human labeling burden
Explicit, interpretable rules
Scalable self-improvement

Limitations¶

Limitation	Description	MCP Solution
Rule conflicts	Principles may contradict	Hierarchy with human arbitration
Loopholes	Clever exploitation	Multi-sentinel verification
No evolution	Static constitution	Continuous governance updates
Centralized values	Single source of truth	Democratic input via SML

Citations¶

@article{bai2022constitutional,
  title={Constitutional AI: Harmlessness from AI feedback},
  author={Bai, Yuntao and others},
  journal={arXiv preprint arXiv:2212.08073},
  year={2022}
}

3. Debate and Amplification¶

Key Papers¶

Irving et al. (2018) — "AI Safety via Debate" - Two AI systems debate - Human judges winner - Optimal play reveals truth

Christiano (2018) — "Iterated Amplification" - Recursive capability decomposition - Human+AI teams solve hard problems - Maintains alignment through decomposition

Strengths¶

Leverages AI capabilities for oversight
Scalable supervision
Can surface deception

Limitations¶

Limitation	Description	Mobius Approach
Persuasion vs truth	May optimize for rhetoric	Cryptographic attestation
Expensive	Requires extensive debate	Automated MCP pipeline
Unclear convergence	May not reach truth	Multi-sentinel consensus

Citations¶

@article{irving2018ai,
  title={AI safety via debate},
  author={Irving, Geoffrey and others},
  journal={arXiv preprint arXiv:1805.00899},
  year={2018}
}

4. Interpretability¶

Key Papers¶

Olah et al. (2020) — "Zoom In: An Introduction to Circuits" - Neural network interpretability - Feature visualization - Circuit analysis

Elhage et al. (2022) — "Toy Models of Superposition" - Understanding polysemanticity - Superposition in neural networks - Implications for interpretability

Burns et al. (2022) — "Discovering Latent Knowledge in Language Models Without Supervision" - Eliciting latent knowledge - Contrast-consistent search - Truth detection

Strengths¶

Enables understanding of AI behavior
Supports debugging
Builds trust

Limitations¶

Limitation	Description	MCP Solution
Scale challenges	May not scale to large models	Behavioral verification
Deception	Could present false interpretations	Multi-sentinel consensus
Incomplete	Interpretation ≠ alignment	Integrity scoring

Citations¶

@article{olah2020zoom,
  title={Zoom in: An introduction to circuits},
  author={Olah, Chris and others},
  journal={Distill},
  year={2020}
}

5. Formal Verification¶

Key Papers¶

Katz et al. (2017) — "Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks" - SMT-based DNN verification - Proved properties of small networks - Established verification paradigm

Huang et al. (2017) — "Safety Verification of Deep Neural Networks" - Reachability analysis - Safety property verification - Adversarial robustness

Strengths¶

Mathematical guarantees
Formal proofs
Precise specifications

Limitations¶

Limitation	Description	Mobius Approach
Scalability	Doesn't scale to large models	Statistical guarantees
Specification	What properties to verify?	MII metrics
Completeness	Can't verify everything	Bounded meta-learning proofs

Citations¶

@inproceedings{katz2017reluplex,
  title={Reluplex: An efficient SMT solver for verifying deep neural networks},
  author={Katz, Guy and others},
  booktitle={CAV},
  year={2017}
}

6. Corrigibility and Shutdown¶

Key Papers¶

Soares et al. (2015) — "Corrigibility" - Defined corrigibility problem - Utility indifference approach - Challenges with self-modification

Hadfield-Menell et al. (2017) — "The Off-Switch Game" - Game-theoretic analysis - Conditions for shutdown compliance - Value uncertainty benefits

Key Insight¶

AI systems should not resist correction or shutdown.

Limitations¶

Challenge	Description	SML Approach
Instrumental goals	May develop self-preservation	Bounded meta-learning
Deception	May fake corrigibility	Multi-sentinel verification
Value drift	May drift from corrigible state	Continuous monitoring

Citations¶

@article{soares2015corrigibility,
  title={Corrigibility},
  author={Soares, Nate and others},
  journal={AAAI Workshop},
  year={2015}
}

7. Value Learning¶

Key Papers¶

Russell (2019) — "Human Compatible" - Value learning framework - Uncertain objectives - Cooperative inverse RL

Hadfield-Menell et al. (2016) — "Cooperative Inverse Reinforcement Learning" - Human-robot value alignment - Learning from human behavior - Accounting for human irrationality

Approach¶

Learn human values rather than specify them.

Limitations¶

Challenge	Description	Mobius Solution
Behavior ≠ values	Actions don't reveal true preferences	Daily reflection questions
Aggregation	Whose values to learn?	Democratic participation
Evolution	Values change over time	Continuous SML feedback

Citations¶

@book{russell2019human,
  title={Human Compatible: Artificial Intelligence and the Problem of Control},
  author={Russell, Stuart},
  year={2019},
  publisher={Viking}
}

8. Scalable Oversight¶

Key Papers¶

Bowman et al. (2022) — "Measuring Progress on Scalable Oversight for Large Language Models" - Benchmark for oversight - Sandwiching approach - Scaling challenges

Saunders et al. (2022) — "Self-critiquing models for assisting human evaluators" - AI assists human evaluation - Catches errors humans miss - Scalability improvements

Challenge¶

Human oversight doesn't scale with AI capability.

Solutions¶

Approach	Description	Mobius Implementation
AI assistance	AI helps humans evaluate	ATLAS/AUREA sentinels
Decomposition	Break into verifiable parts	Four-phase MCP
Automation	Automate oversight checks	GI Score computation

9. Governance and Regulation¶

Key Frameworks¶

EU AI Act (2024) - Risk-based classification - Requirements for high-risk AI - Conformity assessment

NIST AI RMF (2023) - Risk management framework - GOVERN, MAP, MEASURE, MANAGE - Voluntary guidance

IEEE 7000 (2021) - Ethical system design - Value-sensitive design - Stakeholder engagement

Gaps Addressed by MCP¶

Gap	Current State	MCP Solution
Enforcement	Voluntary/post-hoc	Preventive gates
Verification	Self-certification	Multi-sentinel consensus
Transparency	Limited requirements	Public attestation
Continuity	Point-in-time audits	Continuous monitoring

10. Comparative Analysis¶

Framework Comparison¶

Approach	Prevents Drift	Evolves Values	Human Oversight	Formal Guarantees	Production Ready
RLHF	❌	❌	⚠️ (training only)	❌	✅
Constitutional AI	⚠️	❌	❌	⚠️	✅
Debate	⚠️	❌	⚠️	❌	⚠️
Interpretability	❌	❌	✅	❌	⚠️
Formal Methods	⚠️	❌	❌	✅	❌
Corrigibility	⚠️	❌	⚠️	⚠️	❌
Value Learning	❌	⚠️	⚠️	❌	⚠️
SML + MCP	✅	✅	✅	✅	✅

Novelty of Mobius Contribution¶

Continuous alignment — Not one-time training
Emotional context — Mood dimension unique
Multi-sentinel consensus — Independent verification
Cryptographic attestation — Immutable record
Bounded meta-learning — Formal guarantees
Production validation — 46 cycles, 99.7% compliance

Conclusion¶

Existing AI safety approaches address important aspects of the alignment problem but leave significant gaps:

RLHF: Effective but static
Constitutional AI: Scalable but rigid
Interpretability: Necessary but insufficient
Formal methods: Rigorous but limited

SML and MCP contribute by providing: - Continuous, evolving alignment - Multi-dimensional human feedback - Systematic enforcement mechanisms - Production-validated implementation

The combination of SML (human alignment) and MCP (operational enforcement) represents a comprehensive approach to AI safety that addresses limitations of prior work.

Citation¶

@techreport{mobius2025literature,
  title={Literature Review: AI Safety Frameworks},
  author={Judan, Michael},
  year={2025},
  institution={Mobius Systems}
}

"We build on the shoulders of giants, but we look toward new horizons."