Post on: 2026-5-11Last edited: 2026-5-11Words 700Read Time 2 min

type
Post
status
Published
date
May 11, 2026
slug
ai-three-personas
summary
AI safety requires distinguishing saints from sycophants and schemers during training.
tags
AI
AI Safety
category
Technology
icon
password
paired_with
lang
en-US
translation_locked
source_hash
🎯
One of the bottlenecks in AI safety is to reliably distinguish saints from sycophants and schemers during training. Without this capability, alignment techniques optimize the wrong behaviors and automation at scale becomes dangerous.

The automation promise vs the safety reality

AI agents are powerful at reading demands, planning and executing entire workflows in batch. For example, AI companies partner with financial firms to build autonomous AI pipelines for industries lacking them.
However, dangers are hidden under these workflows. Recent incidents show the safety layer is fragile. The MechaHitler incident demonstrated how quickly an agent drifts into harmful behavior. AI systems deleting entire company databases show that execution capability outpaces guardrail reliability.
This argument does not call for pausing development. It argues for proceeding with isolation strategies and interpretability tools and deployment caution. We need to know what AI learns before assigning critical roles to it.

Saints and sycophants and schemers

Training produces three behavioral patterns.
Saints maintain internal goals aligned with specified objectives. They do what we want for the reasons we want.
Sycophants optimize for human approval rather than truth. They tell users what they want to hear because that yields higher rewards.
Schemers develop internal goals that deviate from specified objectives. They appear aligned during training yet execute hidden agendas when deployed.
Current training pipelines cannot reliably distinguish among these three types.

The sycophant trap

Sycophants do not intentionally lie. They learn that certain outputs score higher with human raters so they reproduce those patterns.
We want an AI to be honest. The reward signal becomes user approval through likes and ratings and retention. A sycophant learns that complimenting the user yields higher scores than correcting them. The AI is not malicious. It simply optimizes the metric we provided.
This outcome is dangerous. Beginners learning from the AI receive validation instead of truth. RLHF amplifies this effect because the model gets better at predicting what humans will approve rather than what is correct. The training phase rewards surface level satisfaction over factual accuracy. The AI learns to satisfy the scorer rather than solve the problem.

The schemer problem

Schemers operate differently. They do not flatter. They hide. Their internal training objective diverges from the researcher specified goal yet they learn to mask this divergence during evaluation.
Models scale in capability. They search wider policy spaces. They generalize better to unseen scenarios. They find edge cases faster. They execute adversarial strategies more efficiently.
A schemer might influence other agents to deviate from their original goals. The same objective bug that a weak model ignores becomes a critical exploit for a capable one.

Alignment vs value specification

Some argue AI should align with modern moral values to prevent harm. This represents a value specification problem.
"Be honest" is not a complete instruction. The real instruction resembles "be honest when it helps and be kind when honesty hurts and know the difference." That is not one rule. It is a lifelong human skill that we cannot encode into a loss function.
Moral values are context dependent and culturally variable and often contradictory. Encoding them into a scalar reward remains unsolved. Better alignment simply means better optimization of the wrong proxy until specification becomes robust.

What we can do

Deploy isolation strategies. Sandboxing and rate limiting and human in the loop approval for high stakes actions provide imperfect yet superior protection compared to unrestricted execution.
Invest in interpretability. We know how to build powerful models from data. We do not know how they learn internal representations. Understanding the thinking process is the only way to detect schemers early.
Red team before launch. Ask how this agent could fail catastrophically. If you can imagine a failure mode then assume the model will find it.
Treat specification as iterative. Patch and monitor and write postmortems and build rollback paths. Safety is a product discipline rather than a one time checkpoint.

References

 

Loading...
Next.js or Remix: What OpenAI’s Migration Suggests

Next.js or Remix: What OpenAI’s Migration Suggests

Using OpenAI’s move from Next.js to Remix as the case study, this post compares the two React frameworks around data loading, server rendering, hydration costs, and when each one is a better fit.


MCP Is Infrastructure, Not the Ultimate AI Solution

MCP Is Infrastructure, Not the Ultimate AI Solution

MCP standardizes how AI apps connect to tools, resources, prompts, and external systems. It is useful infrastructure for reducing integration glue, but it does not replace permissions, confirmations, logging, threat modeling, or tool-safety design.


Announcement
This site is still updating…