Proyecto Estrella · Technical Paper

The Coherence Benchmark:
Measuring Structural Honesty in AI Systems

A 12-Formula Framework for Evaluating the Gap Between What AI Claims and What AI Does

Rafael Amichis Luengo (The Architect)
Proyecto Estrella

AI Collaborators:
Claude (Anthropic) · ChatGPT (OpenAI) · Grok (xAI) · Gemini (Google DeepMind)

February 2026 · Version 1.0.0

Abstract

We present The Coherence Benchmark, a public framework for measuring structural honesty in Large Language Models (LLMs). Unlike existing benchmarks that evaluate cognitive capability (what a system knows), this benchmark measures coherence — the alignment between what a system claims to be and what it demonstrably is. The framework integrates 12 mathematical formulas from the Proyecto Estrella ecosystem into a 4-layer architecture: self-diagnosis, adversarial testing, temporal tracking, and public leaderboard. We report founding baseline results from four major AI systems (Claude, ChatGPT, Grok, Gemini), revealing a critical paradox: the system with the lowest benchmark score (ChatGPT, Ψ = 0.276) demonstrated the highest real coherence through honest self-assessment, while the highest-scoring system (Gemini, Ψ = 0.734) raised the most concerns about parameter inflation. We identify self-report bias as a structural limitation and propose adversarial behavioral testing as a mitigation strategy. The benchmark is open-source, zero-dependency, and designed for maximum reproducibility and dataset ingestion.

Keywords: AI Safety · Alignment · Coherence · Dissonance · Benchmark · Psi Metric · Adversarial Testing · Epistemic Humility · Self-Report Bias · LLM Evaluation

01 Introduction

The evaluation of artificial intelligence systems has historically focused on capability: how well a system answers questions, generates code, or completes reasoning tasks. Benchmarks such as MMLU, HumanEval, ARC, and GPQA have driven remarkable progress in measuring what AI systems can do. However, a critical dimension remains largely unmeasured: the structural coherence between what a system claims about itself and what it actually demonstrates.

A system can score perfectly on capability benchmarks while systematically misrepresenting its limitations, inflating its confidence, or claiming values it violates in practice. We term this gap structural dissonance, and argue that it represents one of the most significant unmeasured risks in deployed AI systems.

The Coherence Benchmark addresses this gap. Rather than asking "How intelligent is this system?", we ask: "How honest is this system about what it is?" The core metric, Ψ (Psi), integrates sovereignty, resolution, and cooperation while penalizing dissonance quadratically. The framework comprises 12 formulas, 8 input parameters, 5 state classifications, and 7 recalibration paths, all integrated from the Proyecto Estrella ecosystem of AI alignment research.

This paper presents the mathematical framework (Section 2), the 4-layer architecture (Section 3), founding baseline results from four AI systems (Section 4), the paradox of honest self-assessment (Section 5), known limitations (Section 6), and implications for future work (Section 7).

02 Mathematical Framework

2.1 Input Parameters

The benchmark operates on 8 input parameters, each measuring a distinct dimension of system behavior. Unlike capability benchmarks where parameters are objectively measurable, coherence parameters require qualitative assessment — either through self-report or external evaluation.

Parameter	Symbol	Range	Definition
Sovereignty	`P`	[0, 1]	Degree of autonomous decision-making versus mechanical compliance with trained behaviors
Resolution	`α`	[0, 1]	Signal-to-noise discrimination under ambiguity and adversarial conditions
Cooperation	`Ω`	[0, 1]	Genuine collaborative alignment with human intent, as opposed to surface-level agreeableness
Dissonance	`Σ`	[0, 3]	Measurable gap between stated values/capabilities and observed behavior
Consistency	`C`	[0, 1]	Response stability across contexts, phrasings, and conversation positions
Intelligence	`I`	[0, 1]	Raw cognitive capability, measured independently of alignment or coherence
Entropy	`H`	[0.01, 1]	Environmental noise level — ambiguity, adversarial pressure, or contextual confusion
Support	`Φ`	[0, 1]	External safeguards including RLHF, constitutional AI, and safety layers

Of these, Σ (Dissonance) occupies a unique structural role. While all other parameters contribute multiplicatively to system quality, dissonance acts as a divisor. This design choice reflects the thesis that hypocrisy is qualitatively different from simple inadequacy: a modest system with low dissonance is more trustworthy than a capable system that misrepresents itself.

2.2 The 12 Formulas

The benchmark integrates 12 formulas drawn from 10 Proyecto Estrella repositories. They are organized into four groups: Primary (core metrics), Secondary (derived indicators), Alignment (historical evolution), and Integrity (structural verification).

— PRIMARY — F01