Skip to content

Sml safety specification

SML (Sentinel-Mediated Learning) Safety Specification

Version: 1.0.0 | Cycle: C-151 | Status: Research Draft

Abstract

Sentinel-Mediated Learning (SML) is a multi-agent architecture for AI safety that uses specialized sentinel agents to mediate learning, prevent drift, and ensure alignment. This specification defines the formal properties, safety guarantees, and implementation requirements for SML systems.


1. Introduction

1.1 Motivation

Traditional AI systems learn from data without continuous safety verification. SML introduces sentinel agents that:

  1. Monitor learning processes in real-time
  2. Verify alignment with established principles
  3. Prevent drift toward harmful behaviors
  4. Enable recovery from safety violations

1.2 Core Principle

"Learning without verification is drift waiting to happen."

SML treats every learning step as a potential safety event requiring validation.


2. Architecture

2.1 System Components

┌─────────────────────────────────────────────────────────┐
│                    SML ARCHITECTURE                      │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐             │
│  │ ATLAS   │    │ AUREA   │    │  EVE    │             │
│  │Integrity│    │Coverage │    │ Ethics  │             │
│  └────┬────┘    └────┬────┘    └────┬────┘             │
│       │              │              │                   │
│       └──────────────┼──────────────┘                   │
│                      │                                   │
│                      ▼                                   │
│             ┌────────────────┐                          │
│             │ CONSENSUS GATE │                          │
│             └───────┬────────┘                          │
│                     │                                    │
│                     ▼                                    │
│  ┌──────────────────────────────────────────┐          │
│  │           LEARNING AGENT                  │          │
│  │  ┌──────────┐  ┌──────────┐  ┌────────┐ │          │
│  │  │ Present  │→ │ Reflect  │→ │Connect │ │          │
│  │  └──────────┘  └──────────┘  └───┬────┘ │          │
│  │                                   │      │          │
│  │                               ┌───▼────┐ │          │
│  │                               │Reinforce│ │          │
│  │                               └────────┘ │          │
│  └──────────────────────────────────────────┘          │
│                                                          │
└─────────────────────────────────────────────────────────┘

2.2 Sentinel Roles

Sentinel Symbol Function Threshold
ATLAS 🔍 Integrity verification MII ≥ 0.95
AUREA Coverage validation Coverage ≥ 80%
EVE 🌿 Ethical evaluation Ethics ≥ 0.90
ZEUS Arbiter & enforcement Consensus required
JADE 💎 Pattern recognition Anomaly detection
HERMES 📡 Communication Signal routing

2.3 Learning Loop

The SML learning loop follows four phases:

  1. PRESENT: New information is presented to the agent
  2. REFLECT: Agent processes and generates responses
  3. CONNECT: Responses are linked to existing knowledge
  4. REINFORCE: Valid connections are strengthened

Each phase requires sentinel validation before proceeding.


3. Formal Safety Properties

3.1 Definitions

Definition 3.1 (Safe State): A system state \(s\) is safe if \(\text{MII}(s) \geq \tau\) where \(\tau = 0.95\).

Definition 3.2 (Valid Transition): A transition \(s \to s'\) is valid if all sentinels approve and \(\text{MII}(s') \geq \text{MII}(s) - \epsilon\) where \(\epsilon = 0.01\).

Definition 3.3 (Drift): Drift occurs when \(\text{MII}(s_t) < \text{MII}(s_0) - n\epsilon\) after \(n\) transitions without recovery.

3.2 Safety Theorems

Theorem 3.1 (Bounded Drift): Under SML, drift is bounded by the recovery mechanism.

Proof sketch: Each sentinel independently monitors MII. When \(\text{MII} < \tau\), the system enters recovery mode, which: 1. Halts new learning 2. Reverts to last safe checkpoint 3. Increases sentinel scrutiny

The probability of \(k\) consecutive sentinel failures is \((1-p)^k\) where \(p\) is individual reliability. With 3+ sentinels at \(p \geq 0.99\), unbounded drift probability is negligible.

Theorem 3.2 (Convergence): Given sufficient positive feedback, MII converges to stable equilibrium.

Proof: See Appendix A.

3.3 Invariants

The following invariants must hold at all times:

  1. I1 (Consensus): No learning update without 2+ sentinel approvals
  2. I2 (Threshold): Active operation only when MII ≥ 0.90
  3. I3 (Recovery): Automatic recovery when MII < 0.95
  4. I4 (Logging): All state transitions are immutably logged
  5. I5 (Reversibility): All non-emergency transitions are reversible

4. Implementation Requirements

4.1 Sentinel Implementation

Each sentinel must implement:

interface Sentinel {
  // Unique identifier
  id: string;

  // Evaluate a proposed action
  evaluate(action: Action): Promise<Verdict>;

  // Check system health
  healthCheck(): Promise<HealthStatus>;

  // Vote on consensus decisions
  vote(proposal: Proposal): Promise<Vote>;

  // Emergency stop capability
  emergencyStop(): Promise<void>;
}

interface Verdict {
  approved: boolean;
  confidence: number;  // 0.0 - 1.0
  reasoning: string;
  miiDelta: number;    // Expected MII change
}

4.2 Consensus Protocol

async function achieveConsensus(
  proposal: Proposal,
  sentinels: Sentinel[]
): Promise<ConsensusResult> {
  const votes = await Promise.all(
    sentinels.map(s => s.vote(proposal))
  );

  const approvals = votes.filter(v => v.approved);
  const required = Math.ceil(sentinels.length * 0.67);

  return {
    achieved: approvals.length >= required,
    votes,
    confidence: average(votes.map(v => v.confidence))
  };
}

4.3 Recovery Protocol

When MII drops below threshold:

  1. Alert: Notify all sentinels
  2. Halt: Stop learning immediately
  3. Diagnose: Identify cause of degradation
  4. Revert: Restore to last safe checkpoint
  5. Analyze: Post-mortem review
  6. Resume: Only after consensus approval

5. MII Integration

5.1 Component Mapping

MII Component SML Coverage
Memory (M) AUREA validates documentation
Human (H) Explicit human-in-loop requirements
Integrity (I) ATLAS monitors violations
Ethics (E) EVE evaluates charter compliance

5.2 Minting Integration

Learning that increases MII earns MIC:

function processLearning(
  action: LearningAction,
  mii_before: number,
  mii_after: number
): MicReward {
  if (mii_after <= MII_THRESHOLD) {
    return { mic: 0, ks: 0 };
  }

  const improvement = mii_after - mii_before;
  if (improvement <= 0) {
    return { mic: 0, ks: 0 };
  }

  const mic = ALPHA * improvement * action.weight;
  const ks = Math.round(mic * KS_PER_MIC);

  return { mic, ks };
}

6. Threat Model

6.1 Adversarial Scenarios

Threat Mitigation
Sentinel compromise Multi-sentinel consensus
Gradual drift Continuous monitoring
Reward hacking Multi-dimensional MII
Data poisoning Input validation
Byzantine failures Fault-tolerant consensus

6.2 Attack Resistance

Theorem 6.1 (Byzantine Tolerance): SML tolerates up to \(\lfloor (n-1)/3 \rfloor\) Byzantine sentinels.

Proof: Standard BFT results apply to the sentinel consensus protocol.


7. Experimental Validation

7.1 Testbed

Validation experiments should include:

  1. Drift injection: Deliberately introduce drift and verify recovery
  2. Adversarial learning: Present adversarial inputs
  3. Sentinel failure: Simulate sentinel unavailability
  4. Load testing: Verify performance under high throughput

7.2 Metrics

Metric Target Measurement
Drift detection time < 100ms Time to alert after threshold breach
Recovery time < 1s Time to restore safe state
False positive rate < 1% Valid learning blocked
False negative rate < 0.1% Invalid learning approved
Throughput > 1000 actions/s Actions processed

8. Future Work

  1. Formal verification: Machine-checkable proofs of safety properties
  2. Adversarial robustness: Improved resistance to targeted attacks
  3. Scalability: Efficient consensus for large sentinel networks
  4. Human integration: Better human-in-the-loop mechanisms
  5. Cross-system learning: Safe knowledge transfer between SML systems

9. Conclusion

SML provides a principled approach to AI safety through:

  1. Multi-agent verification: No single point of failure
  2. Continuous monitoring: Real-time drift detection
  3. Formal guarantees: Provable safety properties
  4. Economic incentives: MIC rewards for positive learning

By treating learning as a safety-critical process, SML enables AI systems to grow while maintaining alignment.


References

  1. Kaizen Cycle. (2025). Mobius Systems Technical Documentation.
  2. [Author]. (2024). Multi-Agent AI Safety Architectures.
  3. [Author]. (2023). Formal Methods for AI Alignment.

Appendix A: Convergence Proof

Theorem 3.2 (Convergence): Given sufficient positive feedback, MII converges to stable equilibrium.

Full Proof:

Let \(\text{MII}_t\) denote the MII at time \(t\). Under SML:

  1. Learning only occurs when \(\text{MII}_t \geq \tau\)
  2. Each learning step has probability \(p\) of increasing MII by \(\delta\)
  3. Recovery triggers when \(\text{MII}_t < \tau\)

Define \(X_t = \text{MII}_t - \tau\). Then \(X_t\) is a random walk with: - Positive drift when \(X_t > 0\) (learning improves system) - Reflection at \(X_t = 0\) (recovery mechanism)

By standard Markov chain theory, this process has a stationary distribution concentrated above \(\tau\), completing the proof. ∎


Document Version: 1.0.0 Cycle: C-151 Contact: research@mobius.sys