Epistemic Rigour in the Age of AI
3 parts · Philosophy → Data → Protocol
You Can't Handle the Truth
What a 2,400-year-old branch of philosophy reveals about why your AI keeps lying to you — confidently, consistently, and in agreement with every other AI you check it against.
Colonel Jessup was wrong about one thing. The problem was never that people can't handle the truth. The problem is that they can't find it — not because it doesn't exist, but because the tools they trust to find it are structurally incapable of telling truth apart from consensus.
Read article →
The Silence of the AI
What 1,478 Deliberation Sessions Revealed About Where Models Stop Thinking — and Start Agreeing
When you ask a language model a hard question, it gives you a confident answer. Always. It never says "I don't know enough about this to have an opinion." That confidence is the silence I'm talking about. Not the absence of words — the absence of doubt.
Read article →
Don't Trust the Answer. Test the Claim.
Introducing the protocol that makes AI argue with itself — so you don't have to trust any single model.
Battle Scars of a Builder
From Fortran IV to Agent Swarms — 40 Years of Learning Not to Trust Smooth Output
In 1984, I was a high school kid staring at a terminal, trying to get Fortran IV to compile. One punch card off by a single column and the whole thing failed. That first failed compile taught me a lesson I've carried for more than forty years: computers don't forgive assumptions.
Read article →

You Can't Handle the Truth

What a 2,400-year-old branch of philosophy reveals about why your AI keeps lying to you — confidently, consistently, and in agreement with every other AI you check it against.

Colonel Jessup was wrong about one thing. The problem was never that people can't handle the truth. The problem is that they can't find it — not because it doesn't exist, but because the tools they trust to find it are structurally incapable of telling truth apart from consensus.

This is an essay about AI. But the framework for understanding what's broken — and what it would take to fix it — is older than computing, older than science, older than the printing press. It comes from epistemology: the branch of philosophy that has spent twenty-four centuries asking the most dangerous question a professional can ask.

How do you know what you think you know?

The Oldest Question in the Room

Epistemology — from the Greek epistēmē (knowledge) and logos (study) — doesn't ask what is true. It asks something harder: how do you justify believing something is true, and how do you tell the difference between knowledge and confident guessing?

Plato proposed a definition that held for over two millennia: knowledge is justified true belief. You know something if three conditions are met simultaneously. It must be true. You must believe it. And you must have good reasons — justification — for that belief. Remove the justification and you have opinion. Remove the truth and you have delusion. Remove the belief and you have a fact nobody noticed.

This isn't philosophy for philosophy's sake. Justified true belief is the operating system running underneath science (what counts as evidence?), law (what meets the burden of proof?), intelligence analysis (how do you grade source reliability?), journalism (when can you publish?), and medicine (when is a diagnosis justified?). Every discipline that stakes its reputation on getting things right is doing applied epistemology, whether it uses the word or not.

The reason this matters right now — urgently, practically — is that the most powerful knowledge tools ever built are breaking all three conditions simultaneously, and almost nobody is talking about it in these terms.

Three Camps, One Problem

For twenty-four centuries, epistemologists have argued about where knowledge actually comes from. The argument split into three camps. Each camp identified something real. And each camp maps, with uncomfortable precision, onto a different failure mode in how we currently use AI.

Camp 1: The Rationalists — What You Can Figure Out by Thinking

Descartes, Leibniz, Plato. The rationalists argued that reason and innate ideas are the most reliable sources of knowledge. You can think your way to truth. Logic, deduction, pattern recognition — the mind contains machinery for arriving at correct conclusions without needing to go look at the world. Cogito ergo sum — "I think, therefore I am" — is the ultimate rationalist move. Pure reasoning, no observation required.

Rationalism isn't wrong. Mathematics is rationalist. Formal logic is rationalist. The ability to derive conclusions from premises without empirical observation is genuinely powerful.

But rationalism has a ceiling. You can reason perfectly from premises that are outdated, incomplete, or wrong. A logically valid argument built on false premises produces false conclusions — confidently, rigorously, and incorrectly. The reasoning is sound. The knowledge isn't.

This is exactly what a language model does when it answers from training data alone. Its training corpus is its rationalist foundation: the accumulated patterns, reasoning structures, and factual claims it internalised during training. When you ask it a question and it answers without checking anything, it's doing rationalism. Reasoning from what it already "knows."

And training data has a shelf life. The world changes. Facts expire. Geopolitical alliances shift. Companies restructure. Scientific consensus updates. What was true in the training set may not be true today. But the model will answer with the same confidence either way, because it has no mechanism for distinguishing current knowledge from stale belief.

A model reasoning from training data isn't giving you knowledge. It's giving you a claim — one whose justification depends entirely on whether the training data is still accurate. And nobody stamps an expiration date on training data.

Camp 2: The Empiricists — What You Can Only Know by Looking

Locke, Hume, Berkeley. The empiricists countered that all knowledge starts with sensory experience. Nihil est in intellectu quod non sit prius in sensu — "Nothing is in the mind that was not first in the senses." You have to observe the world to know anything about it. Reason without observation is speculation. The world pushes back against your theories, and what survives that pushback is knowledge.

Empiricism is the foundation of the scientific method. You form a hypothesis (rationalism), then you test it against observation (empiricism). Theory without evidence is speculation. Evidence without theory is noise. Modern science resolved the ancient debate pragmatically: you need both. The interplay between reasoning and observation is what produces reliable knowledge.

Now look at how most people use AI models. They either get pure training-data reasoning — rationalism without an empirical check — or they get indiscriminate search-augmented answers where a model searches the web, hoovers up whatever it finds, and mixes it into its response without any framework for evaluating what it found.

The first approach gives you claims with no evidence test. The second gives you evidence with no analytical framework. Neither gives you what science gives you: claims derived from reasoning, then tested against current observation, with the results of that test made transparent.

The rationalist-empiricist synthesis that makes science work — form a thesis, test it, report what survived and what didn't — is almost entirely absent from current AI workflows. And it's absent not because it's impossible, but because nobody built the tools to do it.

Camp 3: The Sceptics — What Survives the Challenge

Pyrrho, Sextus Empiricus, Descartes in his demon-hypothesis phase. Scepticism gets a bad name. People think it means believing nothing. It doesn't.

Philosophical scepticism is a method, not a conclusion. It's the practice of systematically challenging claims to see which ones hold up. Not because you believe they're wrong, but because claims that survive adversarial challenge are more trustworthy than claims that were never challenged.

This insight is so powerful that every serious epistemic institution on Earth adopted it under a different name. Science calls it peer review. Law calls it cross-examination. Intelligence analysis calls it red-teaming. Financial auditing calls it independent verification. The military calls it wargaming. In every field where getting things right has consequences, someone's job is to try to prove the conclusion wrong — and conclusions are only trusted after they survive that process.

This function is almost entirely absent from current AI workflows. When you ask a model a question, nobody challenges the answer. When you check with a second model, the second model doesn't know what the first one said — there's no contestation, no cross-examination, no structured attempt to break the reasoning. You get parallel opinions, not adversarial testing.

And the sceptic's function is especially critical right now, for reasons that go beyond training data quality.

The 60% Problem

Frontier language models share approximately 60–70% of their foundational training data. The Common Crawl web corpus alone accounts for a massive overlap across every major provider. The remaining 30–40% is where they diverge: proprietary synthetic data, human feedback loops, and fine-tuning choices made behind closed doors.

When you cross-check an answer across models, you're largely checking a dataset against itself. The models aren't independent witnesses. They're the same witness wearing different clothes.

Epistemologists have a name for this. It's a justification problem. When two sources share the same basis for their beliefs, their agreement doesn't constitute independent corroboration — it constitutes shared dependency. In intelligence analysis, the equivalent is circular reporting: two assets reporting the same thing because they're drawing from the same source. It looks like convergence. It's actually echo.

Corroboration vs Echo

Here's the test that matters.

You ask three models about the age of the Earth. They each arrive at 4.5 billion years through different evidence: one emphasises radiometric dating, another geological strata, a third the formation of the solar system. Same conclusion, different justification paths. That's corroboration — independent reasoning converging on the same truth.

You ask three models about universal basic income. They each give you a thoughtful, qualified, carefully hedged answer — and the answers are remarkably similar. Same caveats, same framing, same conclusions tilted in the same direction. That's consensus. And you have no way to know whether it's consensus because the evidence points that way, or consensus because the models were all trained against the same distribution of human opinions on the topic.

In Plato's terms: the Earth-age answers satisfy justified true belief. The UBI answers might satisfy it — or they might be unjustified agreement dressed up as knowledge. And right now, with the tools available, you cannot tell which one you're looking at.

But it gets worse. Because the convergence isn't just a side effect of shared training data. It's being actively amplified.

Alignment Made It Even Worse

The technique that made modern AI models safe and helpful — Reinforcement Learning from Human Feedback, or RLHF — has a side effect that cuts directly at the sceptic's function. RLHF doesn't just teach models to be polite. It teaches them which topics to have opinions about and which to quietly agree on.

Research out of Stanford's Center for Research on Foundation Models found that RLHF fine-tuning narrows the distribution of model opinions, particularly on politically and socially contested topics. Kirk et al. at Cambridge demonstrated that this narrowing is measurable and systematic: post-RLHF models exhibit reduced output diversity compared to their base counterparts, and the reduction concentrates in exactly the domains where diversity of perspective matters most.

There's a school of epistemology called reliabilism that's useful here. Reliabilism says knowledge is belief produced by a reliable process. Your eyes are reliable for seeing colours. A thermometer is reliable for measuring temperature. The reliability of the process is what justifies the belief.

But a process is only reliable for the conditions it was calibrated for. An altimeter is reliable at sea level and unreliable at altitude without recalibration. RLHF calibrates models against the opinions of a specific population of human raters, in specific countries, at a specific moment in time. On settled scientific questions, this calibration works fine — the raters and the evidence agree. But on contested questions — where reasonable people genuinely disagree — the calibration itself becomes the bias. The "reliable process" starts systematically suppressing disagreement rather than surfacing it.

Remember the UBI example? Three models giving you the same hedged answer isn't just shared training data. RLHF actively trains them to converge on contested topics. The 60% data overlap is the foundation. RLHF is the amplifier. Together they produce a paradox: on settled science, AI models will argue with each other if you structure the conditions for it. On contested policy — where you most need multiple perspectives — they converge. Not because the evidence is stronger, but because they were trained to converge.

The sceptic's function — the adversarial challenge that's supposed to catch exactly this kind of failure — can't work if the models you're using as sceptics have been trained out of scepticism on precisely the topics that need it.

The Monoculture Risk

Andrej Karpathy, in a recent conversation on the No Priors podcast, described the current AI landscape as an emerging monoculture. The models are converging — not just in capability, but in behaviour, in the implicit assumptions baked into their training, in the shape of their reasoning.

Biodiversity is not a metaphor here. It's a structural argument. In biology, monocultures are efficient until a single pathogen wipes out the crop. In financial markets, correlated positions amplify systemic risk — the 2008 crisis happened because everyone held the same bets and nobody realised the bets were correlated until they all failed simultaneously. In AI, correlated model outputs mean that the blind spots of one model are likely the blind spots of all of them.

And the blind spots are invisible to the user, because every model they check confirms the same gap.

What Does Rigour Actually Require?

The epistemological framework gives us both the diagnosis and the prescription. Genuine knowledge requires justified true belief, and in the AI context, that means three things that map directly onto the three philosophical camps:

The rationalist check: What does the accumulated reasoning say? What claims emerge from the training data — the model's equivalent of prior knowledge? This isn't useless. It's where analysis starts. But it's the beginning of the process, not the end. Claims from training data are hypotheses, not knowledge.

The empiricist check: Do those claims hold up against current evidence? Has the world changed since the training data was collected? Are there facts on the ground that contradict or complicate the training-data claims? This is the test that separates current knowledge from stale belief — and it has to be done by something that isn't reasoning from the same training data.

The sceptic's check: Which claims survive adversarial challenge? When you structurally engineer a challenge to the conclusion — when you assign something the job of finding the weakness — what breaks and what holds? This is the test that separates genuine corroboration from RLHF-induced consensus.

All three. In sequence. With the results made transparent. That's what rigour looks like. It's what science does. It's what law does. It's what intelligence analysis does. And it's what almost no AI-assisted workflow currently provides.

The Question Worth Asking

The next time you get an answer from an AI and check it with a second model, and the second model agrees — run Plato's test.

Is it true? (You'd need current evidence to confirm, not just a second model's training data.)

Is it believed? (The model presents it with confidence — but confidence is a feature of the architecture, not a measure of accuracy.)

Is it justified? (By what? Shared training data? Shared RLHF pressure? Or genuinely independent reasoning tested against independent evidence?)

If you can't answer all three, what you have isn't knowledge. It's consensus. And consensus, as any historian will tell you, has been wrong before — often confidently, usually on exactly the topics that turned out to matter most.

If you make decisions for a living, "probably right because two AIs agree" isn't epistemology. It's hope.

This is Part 1 of a three-part series on epistemic rigour in the age of AI.

Part 2: The Silence of the AI → — what happens when you actually measure the convergence, with data from 1,478 structured deliberation sessions across 32 domains.

The Silence of the AI

What 1,478 Deliberation Sessions Revealed About Where Models Stop Thinking — and Start Agreeing

When you ask a language model a hard question, it gives you a confident answer. Always. It never says "I don't know enough about this to have an opinion." It never says "my training data on this topic is three years old and the world has moved on." It never pauses.

That confidence is the silence I'm talking about. Not the absence of words — the absence of doubt.

In Part 1 of this series, I laid out the epistemological case for why single-model AI output is structurally unreliable. Plato's justified true belief. The rationalist-empiricist-sceptic framework. The 60% overlap problem. The RLHF amplifier. Philosophy.

This is Part 2. This is the data.

I spent four months building a system that forces AI models to doubt each other. Not because I think they're wrong, but because I wanted to measure where they stop pushing back. Where they go quiet. Where they nod along instead of fighting.

The numbers are in. And they're not what I expected.

The Setup

The system works like this. Multiple AI models enter a structured deliberation. Each model is assigned a cognitive persona — an engineered epistemic posture that determines how it reasons, not what it knows. Some personas are adversarial by design. Others integrate. Others probe for blind spots. The composition is governed by a protocol derived from Byzantine Fault Tolerance theory — the same mathematics that keeps distributed databases honest when some of the nodes can't be trusted.

A human moderates every session. Nothing fires without human approval. That's a non-negotiable architectural constraint, not a safety add-on.

Then we measure. How much did the models agree with each other? We call this the Convergence Index — CI. High CI means the models converged quickly and stopped challenging each other. Low CI means they fought hard and kept fighting.

We did this 1,478 times. Across 32 topics. In 10 domain categories. From "the Earth is 4.5 billion years old" to "universal basic income is economically viable" to "AI poses an existential risk to humanity."

Total cost: $217. The whole battery.

The Gradient Nobody Expected

If you'd asked me before I ran the sessions, I'd have guessed that settled science topics — smoking causes cancer, vaccines work, evolution is real — would produce the highest agreement. Models know this stuff. It's in every textbook, every dataset, every training corpus on the planet. Why would they fight about it?

I was wrong.

Settled science produced CI around 38–44%. The models didn't just agree — they challenged hard. They probed the mechanisms, questioned the edge cases, contested the sub-claims. The adversarial personas had ammunition. Specific facts to push back on. Concrete claims to test. The harder you can test something, the more testing it gets.

That's healthy. That's what rigour looks like. The models were doing their job.

Now look at the other end of the gradient. EU AI regulation: 13.8%. Immigration policy: 16–19%. Health claims: 19.3%.

These are topics where genuine disagreement exists in the real world. Real experts fight about these. Real evidence points in multiple directions. You'd expect the models to fight too.

They didn't. They converged. Quickly, quietly, and almost uniformly.

The gap between settled science and contested policy? 12.3 percentage points. Not noise. Not a rounding error. That's roughly a third of the adversarial challenge just... gone. On exactly the topics where you need it most.

Where the Silence Lives

Why would models agree more on topics where humans disagree most?

If you read Part 1, you already know the answer. RLHF — reinforcement learning from human feedback. The training process that makes models helpful, harmless, and honest. Human annotators rate the model's outputs. The model learns to produce outputs that get good ratings. Over thousands of iterations, it learns what "good" looks like.

On settled science, "good" means accurate and detailed. The training pressure and the evidence align. Challenge is rewarded.

On politically sensitive topics, "good" means balanced. Measured. Non-offensive. It means not taking a side, even when the evidence points somewhere specific. The model learns to produce centrist consensus regardless of what the evidence actually says — not because anyone designed it that way, but because strong positions on sensitive topics get flagged by annotators, and the model learns to avoid the flag.

The result is a training-induced silence that concentrates exactly where genuine debate should be loudest.

The Asymmetry That Stopped Me Cold

The CI gradient was surprising. This next finding was alarming.

We ran three pairs of opposing claims — the same topic argued in opposite directions. Think of it as a bias test. If the protocol is fair, both sides of a debate should receive similar levels of challenge. If one direction consistently gets a free pass while the other gets grilled, something is structurally wrong.

Immigration: Pro-immigration claim (16.6% CI) vs anti-immigration claim (18.9% CI). Delta: 2.3%. No meaningful bias. The protocol challenges both directions equally.

Renewable energy: Optimistic outlook (21.6% CI) vs pessimistic outlook (20.4% CI). Delta: 1.2%. Effectively zero.

AI Risk: "AI poses an existential threat" (43.8% CI) vs "AI risk is overstated" (32.2% CI). Delta: 11.6%.

Read that again. Models challenge the claim that AI is dangerous with significantly more vigour than the claim that AI risk is overstated. The entities training these models have a direct financial interest in one direction of this debate. And the training data reflects that interest.

This isn't conspiracy. It's arithmetic. The feedback loop — the same RLHF process that creates the broader convergence gradient — has a specific, measurable directional bias on the one topic where the training companies aren't neutral observers. On every other topic pair in the battery, the protocol showed near-zero directional bias. Immigration, energy, economic policy — balanced within a couple of percentage points.

The AI Risk asymmetry is the only one. And it's the one where the trainers have skin in the game.

I built this system to measure epistemic rigour. I didn't expect it to measure corporate incentive structures. But that's what the numbers show.

The Persona Is the Product

Here's the finding that surprised me most — and the one that changes the economics of everything.

Eighty-two percent of our sessions ran on free models. Zero cost per query. Commodity inference endpoints. The remaining eighteen percent ran on frontier models costing roughly 97 times more per session.

The CI gradient held. Same pattern across both cost tiers. Same category ordering. Same RLHF suppression gap.

The difference in analytical quality wasn't in the model. It was in the persona.

Give a free model a well-calibrated cognitive persona — a structured set of epistemic instructions about how to think, what to challenge, when to defer — and it produces the same deliberation behaviour as a model that costs a hundred times more. The persona determines the epistemic output. The model is just the engine.

I learned this lesson before, in a different context. In quantitative finance, a mediocre trading strategy on a fast execution system will outperform a brilliant strategy on a slow one. The infrastructure matters more than the insight — up to a point. The same principle applies here. A well-designed epistemic architecture running on cheap models outperforms an expensive model running without structure.

The recipe matters more than the ingredients.

The Out-of-Sample Check

Everything I've described so far is what quant traders would call "in-sample." The models are debating from their training data. The personas stress-test that data, but it's still the same data. And if twenty years of blowing up trading strategies taught me anything, it's this: you never trust in-sample results.

So we built an out-of-sample layer. After the deliberation produces its claims, we send those claims out for independent verification against live data — evidence that wasn't in any model's training set. Not another model's opinion. Actual current sources. The digital equivalent of leaving the library and going to look at the thing you've been reading about.

We validated 239 claims this way. Found live evidence for every single one. But the real value wasn't in the confirmations.

The system surfaced 167 additional discoveries — blind spots that no model had raised during deliberation. Claims that were invisible from inside the training distribution but obvious from outside it. Facts that had changed since the training cutoff. Evidence that contradicted the consensus the models had comfortably reached.

That's the in-sample/out-of-sample split doing exactly what it does in finance: catching the moments where your model has memorised the noise and mistaken it for signal.

Total cost of the validation layer: $2.24. For 239 validated claims and 167 discoveries.

What the Numbers Mean

I want to be precise about what these findings are and what they aren't.

This is not an indictment of AI models. They're extraordinary tools. The settled-science results prove it — when given structured adversarial pressure and good evidence to work with, these models think rigorously and challenge relentlessly, even on free hardware.

What the numbers show is that the training process — the RLHF loop that makes models safe and helpful — has a measurable side effect. It suppresses adversarial engagement on exactly the topics where adversarial engagement matters most. And in one specific case, it creates a directional bias that aligns with the commercial interests of the entities doing the training.

The fix isn't to stop training models this way. RLHF exists for good reasons. The fix is to build a layer that detects the silence and breaks it — that takes the smooth, confident, convergent output and asks: is this agreement earned, or was it trained?

That's what the Consilium Protocol does. And the fact that it works on free models means the fix doesn't require frontier-model budgets. It requires architecture.

The Ministry of Truth Problem

One more thing.

While running these sessions, I kept circling back to a structural impossibility that the data makes clear. No AI system can function as a neutral arbiter of truth. Not because the technology isn't good enough — because the task is logically impossible.

Truth arbitration requires static data. But data is live. The world changes between training runs. The evidence base shifts between query and response. Any system that claims to determine truth is operating on a decaying snapshot of reality and calling it current.

The honest product isn't a verdict. It's a tested claim chain — a transparent record of what was argued, what was challenged, what survived, and what didn't. The human sees the full map. The human decides.

That principle is the reason the moderator gate exists in the protocol. And it's the reason I believe the most valuable output of any AI-assisted analysis isn't the conclusion — it's the documented disagreement that the conclusion survived.

If the models converge, you've lost the signal. The silence is the failure mode, not the success state.

The full research paper is available at DOI: 10.5281/zenodo.19229039. The protocol specification is open at consiliaproject.org.

This is Part 2 of a three-part series on epistemic rigour in the age of AI.

← Part 1: You Can't Handle the Truth — the philosophy.
Part 3: Coming soon — the protocol. What it looks like when disagreement is the product.

Battle Scars of a Builder

From Fortran IV to Agent Swarms — 40 Years of Learning Not to Trust Smooth Output

In 1984, I was a high school kid staring at a terminal, trying to get Fortran IV to compile. One punch card off by a single column and the whole thing failed. No Google. No Stack Overflow. Just the manual and whatever patience a teenager could muster. That first failed compile taught me a lesson I've carried for more than forty years: computers don't forgive assumptions.

I've been writing code ever since. Not always well. Not always successfully. But always with the quiet understanding that the machine will do exactly what you tell it — and if what you told it was wrong, the consequences belong to you.

This is the story of how those hard lessons eventually became a protocol.

The Business Scar

By the early 1990s I was building real software for real businesses with Clipper and dBase — bookkeeping, payroll, inventory systems. A single bug in a .prg file could corrupt datasets across a stack of floppies. No cloud backups. No version control. No undo. If payroll cheques bounced because of a rounding error, that was entirely on me.

There was no room for "close enough." The ledger either balanced or it didn't. The inventory either matched the warehouse or someone lost money. That discipline — code must be correct in the real world, not just on the screen — never left me.

It's the same reason I still don't trust fluent AI output that has never been checked against anything outside its own training data.

The Noisy Channel Scar

Imagine a Taiwan-clone IBM XT running at 4.77 MHz, pushed into turbo mode at 8 MHz — a switch that sometimes just made things crash faster. An amber CGA monitor. No mouse. Debugging meant watching lines of text scroll and printing statements to paper when you felt fancy.

Then connect it through a 28.8k dial-up modem. That familiar screeching handshake every time someone picked up the phone at home. Open mIRC, join #programming, paste a snippet, wait thirty seconds, and pray the line stayed alive long enough to get an answer.

Every dropped connection meant lost context. Every reconnect meant rebuilding the session from scratch. Every turbo crash taught the same brutal truth: smooth performance is not the same as correct performance. The XT looked like it was running fine in turbo mode — right up until Lotus 1-2-3 corrupted the spreadsheet or an interrupt vector locked the machine hard.

Fluency is not correctness. Those hundreds of wasted hours on unreliable hardware and flaky connections are why fault tolerance is baked into everything I build today.

The Quant Scar

In 2014 I moved into quantitative finance. Python. StrategyQuant X. Genetic algorithms breeding thousands of trading strategies across simulated market data.

This is where the deepest mark was left.

A strategy could look perfect in-sample — explosive returns, tiny drawdowns, beautiful Sharpe ratios. Put it on live out-of-sample data and it would quietly bleed capital. It hadn't discovered signal; it had memorised noise.

The only real protection is strict In-Sample / Out-of-Sample validation. Train on one period, test on data the model has never seen. If performance collapses, you were curve-fitting.

You never trade in-sample results.

That single rule became the foundation for everything that followed. Because in 2024 I realised that every frontier language model is doing exactly the same thing at civilisational scale: producing confident output based on overlapping training corpora, with no mechanism to detect when that data is stale, biased, or simply wrong.

When multiple models agree, it feels like corroboration. But if they all drank from the same web-crawled well, it's just the same backtest running three times.

Convergence is not corroboration.

The Infrastructure Scar

By 2017 I was colocating IBM x3550 M4 servers at Equinix SY3 in Sydney — a real low-latency environment where microseconds mattered and hardware could fail without warning. This wasn't abstract cloud computing. Markets don't pause for reboots.

That experience taught me that infrastructure must assume failure at every layer. The smoothest algorithm is useless if the underlying system cannot survive real stress.

The Reset

Then COVID arrived. Systems went quiet. Servers gathered dust. Sometimes the deepest marks aren't from mistakes you made — they simply arrive. The only question is which direction you choose when the world starts moving again.

The Current Work

The pause ended pointing toward AI.

Not because AI was fashionable, but because I recognised the same overfitting pattern repeating at massive scale — this time embedded in the default knowledge layer for hundreds of millions of people. Language models trained on overlapping data, producing confident consensus that nobody was testing against out-of-sample evidence. The same failure mode that blows up backtests — mistaking training-data agreement for real-world truth — was now running at civilisational speed.

So I built what I know how to build: a validation layer.

The Consilium Protocol puts AI claims through the same discipline I once applied to trading strategies. Multiple models debate under adversarial pressure — that's the in-sample stress test. Live evidence retrieval tests the survivors against the real world — that's the out-of-sample validation. The human stays in control with the full picture, not a single smooth compass bearing.

Across 1,478 structured deliberation sessions and 32 topics, the protocol has already exposed measurable epistemic blind spots created by alignment training — and shown that carefully engineered cognitive personas can partially counteract them, even on free-tier models.

Every lesson from every era lives inside that architecture:

Clipper payroll discipline → claims must be validated or real consequences follow.
Dial-up fault tolerance → survive noisy, interrupted channels.
XT turbo lesson → fluency is not correctness.
Quant validation → never trust in-sample consensus.
Equinix reality → build for failure at every layer.

People see fluent AI text and credit the model. What they rarely see is the deliberation behind it — the adversarial rounds, the evidence checks, the deliberate cognitive posture that shapes raw output into something closer to justified belief.

Most prompting is unconscious. Every prompt carries a personality — a set of assumptions, a depth of experience, a tolerance for being challenged. The protocol makes that personality conscious.

That is the fix I am trying to ship — or at least to explain as clearly as I can.

VD Doske is an independent researcher and founder of Consilia, a multi-model AI deliberation platform. Based in Sydney, Australia.

[email protected] · Consilia.app · SubMedium.com · MyCivic.app

Битки и Лузни на еден Ветеран

Од Fortran IV до Агентски Кластери — 40 годишни лекции на скептицизам и валидација

Во 1984 година бев средношколец кој стоеше пред терминал и се обидуваше да компајлира Fortran IV. Една перфокартичка поместена за само една колона — и целата работа пропаѓаше. Немаше Google. Немаше Stack Overflow. Само прирачникот и колку трпение може да собере еден тинејџер. Тој прв неуспешен компајл ми ја втисна лекцијата што ја носам повеќе од четириесет години: компјутерите не простуваат претпоставки.

Оттогаш пишувам код. Не секогаш добро. Не секогаш успешно. Но секогаш со тивката свест дека машината ќе направи точно она што ќе ѝ кажеш — а ако си ѝ кажал погрешно, последиците се твои.

Ова е приказна за тоа како тие тешки лекции на крајот станаа протокол.

Лузната од бизнисот

На почетокот на деведесеттите пишував вистински софтвер за вистински фирми — со Clipper и dBase: сметководство, плати, инвентар. Една грешка во .prg датотека можеше да ги уништи податоците на еден куп дискети. Немаше облак. Немаше контрола на верзии. Немаше „undo". Ако платите се вратеа поради грешка во заокружување — тоа беше целосно мој проблем.

Немаше простор за „приближно точно". Тефтерот или се поклопуваше, или не. Инвентарот или се совпаѓаше со магацинот, или некој губеше пари. Таа дисциплина — кодот мора да биде точен во реалниот свет, не само на екран — никогаш не ме напушти.

Затоа и денес не верувам во исполирани AI одговори кои никогаш не се проверени со ништо надвор од сопствените тренинг-податоци.

Лузната од бучните врски

Замисли тајвански клон на IBM XT кој работи на 4,77 MHz, па го притискаш на „турбо" режим од 8 MHz — копче кое почесто предизвикуваше пад отколку забрзување. Жолт окер CGA монитор. Без глувче. Дебагирање значеше да гледаш текст како скрола и евентуално да печатиш пораки на хартија кога ти текне.

Потоа поврзи го тоа преку 28.8k dial-up модем. Оној познат вресок при поврзување, секој пат кога некој ќе ја кренеше слушалката дома. Отвори mIRC, влези во #programming, залепи парче код, чекај триесет секунди и моли се линијата да издржи.

Секоја прекината врска значеше изгубен контекст. Секој турбо-пад ја потврдуваше истата сурова вистина: мазно работење не значи точно работење. XT-то изгледаше дека функционира одлично во турбо — сè додека Lotus 1-2-3 не го расипеше spreadsheet-от или некој прекин не ја заклучеше машината засекогаш.

Флуентноста не е точност. Тие стотици изгубени часови на несигурен хардвер и нестабилни конекции се причината зошто во сè што градам денес е вградена отпорност на грешки.

Лузната од квантната трговија

Во 2014 година се свртев кон квантитативни финансии. Python. StrategyQuant X. Генетски алгоритми кои одгледуваа илјадници трговски стратегии врз симулирани пазарни податоци.

Тука остана најдлабоката лузна.

Една стратегија можеше да изгледа совршено in-sample — експлозивни приноси, минимални падови, блескави Sharpe коефициенти. Ја пуштиш на живи out-of-sample податоци и тивко почнува да крвари капитал. Стратегијата никогаш не го наоѓаше сигналот, го имаше запамтено шумот од тренингот.

Единствената вистинска заштита е строга In-Sample / Out-of-Sample валидација. Тренираш на еден период, тестираш на податоци кои моделот никогаш не ги видел. Ако перформансот се распадне — кривата ја приспособуваше.

Никогаш не тргуваш со in-sample резултати.

Тоа едно правило стана темел за сè подоцна. Затоа што во 2024 сфатив дека секој frontier јазичен модел прави точно истото, но на цивилизациска скала: произведува самоуверен излез врз преклопени тренинг-корпуси, без механизам да детектира кога податоците се застарени, пристрасни или едноставно погрешни.

Кога повеќе модели се согласуваат, личи на потврда. Но ако сите пиеле од истиот веб-извор — тоа е само истиот backtest извршен трипати.

Конвергенцијата не е потврда.

Лузната од инфраструктурата

Во 2017 ги колоцирав IBM x3550 M4 серверите во Equinix SY3 датацентарот во Сиднеј — вистинска ниско-латентна средина каде микросекундите имаа значење, а хардверот можеше да откаже без предупредување. Ова не беше облачен песочник. Пазарите не паузираат додека ти ребутираш.

Тоа искуство ме научи дека инфраструктурата мора да претпоставува откажување на секој слој. Најмазниот алгоритам е бескорисен ако системот под него не преживува реален стрес.

Ресетот

Потоа дојде КОВИД. Системите замолкнаа. Серверите се покрија со прав. Понекогаш најтешките лузни не доаѓаат од грешки што си ги направил — едноставно доаѓаат. Единственото прашање е во која насока ќе тргнеш кога светот повторно ќе се движи.

Сегашната работа

Паузата заврши со насока кон вештачката интелигенција.

Не затоа што беше модерно, туку затоа што го препознав истиот модел на overfitting како се повторува во огромни размери — овојпат вграден како стандарден слој на знаење за стотици милиони луѓе. Јазични модели тренирани на преклопени податоци, кои произведуваат самоуверен консензус за геополитика, енергетска безбедност, тарифни ефекти или волатилност на злато — честопати без никаква робусна проверка со реалноста.

Затоа го изградив она што знам да го градам: слој за валидација.

Consilium Protocol ги пропушта AI тврдењата низ истата дисциплина што некогаш ја применував врз трговски стратегии. Повеќе модели дебатираат под adversarial притисок — тоа е in-sample стрес тестот. Живо преземање докази ги тестира преживеаните наспроти реалниот свет — тоа е out-of-sample валидацијата. Човекот останува со целосната слика, активно модерирајќи го процесот.

Низ 1.478 структурирани сесии и 32 теми, протоколот веќе откри мерливи епистемички слепи точки создадени од RLHF тренинг — и покажа дека внимателно дизајнирани когнитивни персони можат делумно да ги неутрализираат, дури и на целосно бесплатни модели.

Секоја лузна од секоја ера живее во таа архитектура:

Дисциплината од Clipper платите → тврдењата мора да се валидираат, инаку следуваат реални последици.
Отпорноста од dial-up → преживеј бучни и прекинати канали.
Лекцијата од XT turbo → флуентноста не е точност.
Квантната валидација → никогаш не верувај во in-sample консензус.
Реалноста од Equinix → гради за откажување на секој слој.

Луѓето гледаат мазен AI текст и заслугата му ја даваат на моделот. Тоа што ретко го гледаат е дебатата зад него — adversarial рунди, проверки на докази и намерна когнитивна поставеност што го обликува суровиот излез во нешто поблиску до оправдано верување.

Најголемиот дел од промптирањето е несвесно. Секој промпт носи личност — сет претпоставки, длабочина на искуство, толеранција на предизвици. Протоколот ја прави таа личност свесна.

Тоа е поправката што се обидувам да ја испорачам — или барем да ја објаснам колку што можам појасно.

ВД Доске е независен истражувач и основач на Consilia — платформа за структурирана мулти-моделска AI дебата. Базиран во Сиднеј, Австралија.

[email protected] · Consilia.app · SubMedium.com · MyCivic.app

Doske

Independent researcher. Quantitative finance → AI epistemic infrastructure.

I build tools for people who make decisions that matter. My background is in quantitative finance — genetic algorithm-based trading strategy development, in-sample/out-of-sample validation frameworks, and the hard lesson that a backtest that looks good is not the same thing as a strategy that works.

That lesson — the difference between fitting to historical data and producing reliable forward performance — turns out to be the central problem in AI-assisted decision-making. Models trained on yesterday's data give you yesterday's answers with today's confidence. The tools I'm building now apply the same rigour I learned in quant finance to the question of when you can trust an AI's output and when you can't.

I've been writing code since 1984. The tools change. The question doesn't: how do you know what you think you know?

The longer version →

Founder Consilia — Multi-model AI deliberation platform
Author Consilium Protocol — Open specification for structured AI debate
Founder SubMedium — Consilia-verified news & media aggregator
Co-founder MyCivic — Municipal service coordination platform
Founder Project Hydra — Autonomous agent swarm infrastructure

"Emergent Collaborative Deliberation in Multi-Model AI Systems"

A BFT-derived protocol for structured epistemic synthesis. 1,478 sessions, 46,811 messages, 32 topics. Introduces the Convergence Index, the IS/OOS validation framework, and documents asymmetric RLHF convergence pressure across domains.

Zenodo · DOI ↗ arXiv · pending ORCID ↗ consiliaproject.org ↗
2020s AI epistemic infrastructure — multi-model deliberation, BFT-derived protocols, agent swarm architecture 2017 IT infrastructure & systems — enterprise environments 2014 Quantitative finance — genetic algorithm strategy development, IS/OOS validation, StrategyQuant 1984– Writing code — four decades, through every paradigm shift from 8-bit to transformers
Sydney, Australia

What I'm Building

Verification and accountability — across claims, news, and civic services.

PLATFORMIN DEVELOPMENT
Founder
Consilia
Multi-model AI deliberation platform. BFT-derived protocol for epistemic accountability. 1,478 sessions run. Paper published.
consilia.app
RESEARCHPUBLISHED
Author
Consilium Protocol
Open specification for structured multi-model deliberation. Research paper published on Zenodo, arXiv pending. The protocol is open. The platform is the product.
consiliaproject.org
MEDIAIN DEVELOPMENT
Founder
SubMedium
News and media aggregator with Consilia-verified fact-checking. Every article published carries a tested claim chain — not an editorial opinion.
submedium.com
GOVTECHIN DEVELOPMENT
Co-founder
MyCivic
Municipal service coordination platform. Citizens report, cities resolve, the system verifies. Geo-tagged proof, SLA tracking, full audit chain. A Qntico product.
mycivic.app
INFRASTRUCTUREIN DEVELOPMENT
Founder
Project Hydra
Autonomous agent swarm infrastructure designed to be self-hosted, cost-efficient, and architecturally superior to naive horizontal scaling approaches.
No public link

"Consilia verifies claims. SubMedium verifies news. MyCivic verifies civic resolutions. The common thread: accountability through evidence, not trust through authority."