Sml safety specification
SML (Sentinel-Mediated Learning) Safety Specification¶
Version: 1.0.0 | Cycle: C-151 | Status: Research Draft
Abstract¶
Sentinel-Mediated Learning (SML) is a multi-agent architecture for AI safety that uses specialized sentinel agents to mediate learning, prevent drift, and ensure alignment. This specification defines the formal properties, safety guarantees, and implementation requirements for SML systems.
1. Introduction¶
1.1 Motivation¶
Traditional AI systems learn from data without continuous safety verification. SML introduces sentinel agents that:
- Monitor learning processes in real-time
- Verify alignment with established principles
- Prevent drift toward harmful behaviors
- Enable recovery from safety violations
1.2 Core Principle¶
"Learning without verification is drift waiting to happen."
SML treats every learning step as a potential safety event requiring validation.
2. Architecture¶
2.1 System Components¶
┌─────────────────────────────────────────────────────────┐
│ SML ARCHITECTURE │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ ATLAS │ │ AUREA │ │ EVE │ │
│ │Integrity│ │Coverage │ │ Ethics │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ CONSENSUS GATE │ │
│ └───────┬────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ LEARNING AGENT │ │
│ │ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │
│ │ │ Present │→ │ Reflect │→ │Connect │ │ │
│ │ └──────────┘ └──────────┘ └───┬────┘ │ │
│ │ │ │ │
│ │ ┌───▼────┐ │ │
│ │ │Reinforce│ │ │
│ │ └────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
2.2 Sentinel Roles¶
| Sentinel | Symbol | Function | Threshold |
|---|---|---|---|
| ATLAS | 🔍 | Integrity verification | MII ≥ 0.95 |
| AUREA | ✨ | Coverage validation | Coverage ≥ 80% |
| EVE | 🌿 | Ethical evaluation | Ethics ≥ 0.90 |
| ZEUS | ⚡ | Arbiter & enforcement | Consensus required |
| JADE | 💎 | Pattern recognition | Anomaly detection |
| HERMES | 📡 | Communication | Signal routing |
2.3 Learning Loop¶
The SML learning loop follows four phases:
- PRESENT: New information is presented to the agent
- REFLECT: Agent processes and generates responses
- CONNECT: Responses are linked to existing knowledge
- REINFORCE: Valid connections are strengthened
Each phase requires sentinel validation before proceeding.
3. Formal Safety Properties¶
3.1 Definitions¶
Definition 3.1 (Safe State): A system state \(s\) is safe if \(\text{MII}(s) \geq \tau\) where \(\tau = 0.95\).
Definition 3.2 (Valid Transition): A transition \(s \to s'\) is valid if all sentinels approve and \(\text{MII}(s') \geq \text{MII}(s) - \epsilon\) where \(\epsilon = 0.01\).
Definition 3.3 (Drift): Drift occurs when \(\text{MII}(s_t) < \text{MII}(s_0) - n\epsilon\) after \(n\) transitions without recovery.
3.2 Safety Theorems¶
Theorem 3.1 (Bounded Drift): Under SML, drift is bounded by the recovery mechanism.
Proof sketch: Each sentinel independently monitors MII. When \(\text{MII} < \tau\), the system enters recovery mode, which: 1. Halts new learning 2. Reverts to last safe checkpoint 3. Increases sentinel scrutiny
The probability of \(k\) consecutive sentinel failures is \((1-p)^k\) where \(p\) is individual reliability. With 3+ sentinels at \(p \geq 0.99\), unbounded drift probability is negligible.
Theorem 3.2 (Convergence): Given sufficient positive feedback, MII converges to stable equilibrium.
Proof: See Appendix A.
3.3 Invariants¶
The following invariants must hold at all times:
- I1 (Consensus): No learning update without 2+ sentinel approvals
- I2 (Threshold): Active operation only when MII ≥ 0.90
- I3 (Recovery): Automatic recovery when MII < 0.95
- I4 (Logging): All state transitions are immutably logged
- I5 (Reversibility): All non-emergency transitions are reversible
4. Implementation Requirements¶
4.1 Sentinel Implementation¶
Each sentinel must implement:
interface Sentinel {
// Unique identifier
id: string;
// Evaluate a proposed action
evaluate(action: Action): Promise<Verdict>;
// Check system health
healthCheck(): Promise<HealthStatus>;
// Vote on consensus decisions
vote(proposal: Proposal): Promise<Vote>;
// Emergency stop capability
emergencyStop(): Promise<void>;
}
interface Verdict {
approved: boolean;
confidence: number; // 0.0 - 1.0
reasoning: string;
miiDelta: number; // Expected MII change
}
4.2 Consensus Protocol¶
async function achieveConsensus(
proposal: Proposal,
sentinels: Sentinel[]
): Promise<ConsensusResult> {
const votes = await Promise.all(
sentinels.map(s => s.vote(proposal))
);
const approvals = votes.filter(v => v.approved);
const required = Math.ceil(sentinels.length * 0.67);
return {
achieved: approvals.length >= required,
votes,
confidence: average(votes.map(v => v.confidence))
};
}
4.3 Recovery Protocol¶
When MII drops below threshold:
- Alert: Notify all sentinels
- Halt: Stop learning immediately
- Diagnose: Identify cause of degradation
- Revert: Restore to last safe checkpoint
- Analyze: Post-mortem review
- Resume: Only after consensus approval
5. MII Integration¶
5.1 Component Mapping¶
| MII Component | SML Coverage |
|---|---|
| Memory (M) | AUREA validates documentation |
| Human (H) | Explicit human-in-loop requirements |
| Integrity (I) | ATLAS monitors violations |
| Ethics (E) | EVE evaluates charter compliance |
5.2 Minting Integration¶
Learning that increases MII earns MIC:
function processLearning(
action: LearningAction,
mii_before: number,
mii_after: number
): MicReward {
if (mii_after <= MII_THRESHOLD) {
return { mic: 0, ks: 0 };
}
const improvement = mii_after - mii_before;
if (improvement <= 0) {
return { mic: 0, ks: 0 };
}
const mic = ALPHA * improvement * action.weight;
const ks = Math.round(mic * KS_PER_MIC);
return { mic, ks };
}
6. Threat Model¶
6.1 Adversarial Scenarios¶
| Threat | Mitigation |
|---|---|
| Sentinel compromise | Multi-sentinel consensus |
| Gradual drift | Continuous monitoring |
| Reward hacking | Multi-dimensional MII |
| Data poisoning | Input validation |
| Byzantine failures | Fault-tolerant consensus |
6.2 Attack Resistance¶
Theorem 6.1 (Byzantine Tolerance): SML tolerates up to \(\lfloor (n-1)/3 \rfloor\) Byzantine sentinels.
Proof: Standard BFT results apply to the sentinel consensus protocol.
7. Experimental Validation¶
7.1 Testbed¶
Validation experiments should include:
- Drift injection: Deliberately introduce drift and verify recovery
- Adversarial learning: Present adversarial inputs
- Sentinel failure: Simulate sentinel unavailability
- Load testing: Verify performance under high throughput
7.2 Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| Drift detection time | < 100ms | Time to alert after threshold breach |
| Recovery time | < 1s | Time to restore safe state |
| False positive rate | < 1% | Valid learning blocked |
| False negative rate | < 0.1% | Invalid learning approved |
| Throughput | > 1000 actions/s | Actions processed |
8. Future Work¶
- Formal verification: Machine-checkable proofs of safety properties
- Adversarial robustness: Improved resistance to targeted attacks
- Scalability: Efficient consensus for large sentinel networks
- Human integration: Better human-in-the-loop mechanisms
- Cross-system learning: Safe knowledge transfer between SML systems
9. Conclusion¶
SML provides a principled approach to AI safety through:
- Multi-agent verification: No single point of failure
- Continuous monitoring: Real-time drift detection
- Formal guarantees: Provable safety properties
- Economic incentives: MIC rewards for positive learning
By treating learning as a safety-critical process, SML enables AI systems to grow while maintaining alignment.
References¶
- Kaizen Cycle. (2025). Mobius Systems Technical Documentation.
- [Author]. (2024). Multi-Agent AI Safety Architectures.
- [Author]. (2023). Formal Methods for AI Alignment.
Appendix A: Convergence Proof¶
Theorem 3.2 (Convergence): Given sufficient positive feedback, MII converges to stable equilibrium.
Full Proof:
Let \(\text{MII}_t\) denote the MII at time \(t\). Under SML:
- Learning only occurs when \(\text{MII}_t \geq \tau\)
- Each learning step has probability \(p\) of increasing MII by \(\delta\)
- Recovery triggers when \(\text{MII}_t < \tau\)
Define \(X_t = \text{MII}_t - \tau\). Then \(X_t\) is a random walk with: - Positive drift when \(X_t > 0\) (learning improves system) - Reflection at \(X_t = 0\) (recovery mechanism)
By standard Markov chain theory, this process has a stationary distribution concentrated above \(\tau\), completing the proof. ∎
Document Version: 1.0.0 Cycle: C-151 Contact: research@mobius.sys