The Boardroom: When AI Agents Need to Disagree

TLDR: Skip to the Demo Results section to see AI agents actually debate whether to implement unlimited PTO

I have always been curious about what happens inside corporate executive meetings. Not the boring status updates, but the real policy discussions—the ones where the CFO pushes back on the CEO’s ambitious plans, where HR raises concerns about employee morale, and where Legal flags compliance risks nobody thought about.

These meetings fascinate me because they represent something we rarely see: productive disagreement. Different perspectives, each valid in their domain, colliding and shaping each other until a decision emerges. As someone who’s spent years in platform engineering leadership, I’ve sat through plenty of these. The best outcomes came when people challenged each other, not when everyone nodded along.

So naturally, I wondered: could I simulate this with AI agents?

The Problem with “Yes-Men” Agents

My secondary motivation here is more technical. I wanted to understand how AI agents communicate with each other when tackling complex problems. Most multi-agent frameworks I’ve seen follow one of two patterns:

Simultaneous independent thinking – Each agent forms a complete thought, then a “leader” synthesizes them
Sequential delegation – One agent passes work to the next in a pipeline

Both approaches have a fatal flaw: agents don’t actually debate. They just contribute their piece and move on. There’s no back-and-forth, no challenging assumptions, no “wait, but what about…” moments that make real discussions productive.

Dev Note

Right as I was building this, Andrew Karpathy released a similar project on GitHub exploring simultaneous multi-agent systems. His approach was elegant, but I wanted something different—agents that actually respond to each other, not just to the problem.

What I Wanted to Build

Here was my vision:

Generate corporate policy through constructive discussion – Not just collecting viewpoints, but simulating the debate
Make the thought process visible – Users should see how the decision was reached, not just the conclusion
Agents with actual opinions – Biased perspectives that need to defend their position with data and experience
Fair speaking time – Like a good facilitator, prevent any agent from dominating
No immediate agreement – The worst meetings are where everyone says “sounds good to me” after the first suggestion

The challenge? Most AI agent frameworks aren’t designed for this. They optimize for task completion, not deliberation.

The Coding Journey (Or: How I Got This Wrong Three Times)

Attempt 1: Existing Agent-to-Agent Frameworks

My first instinct was to use an existing framework. I asked Claude (yes, I’m using AI to build AI systems—very meta) to suggest approaches. It recommended combining Agno with a custom data bus network for simultaneous running.

This… didn’t work. The agents could communicate, but they were essentially broadcasting into the void. No real interaction.

Attempt 2: Google’s A2A Framework

I tried Google’s Agent-to-Agent framework next, thinking maybe they’d solved this problem. Nope. Still optimized for delegation and task-passing, not discussion.

Attempt 3: Building from Primitives

After more prompting (honestly, the AI coding assistant was helpful here, even if it kept suggesting the wrong frameworks), I realized I needed to drop down to LangGraph primitives and build the discussion architecture myself.

The breakthrough was treating it like a state machine with a speaking queue:

All agents listen to whoever is currently speaking
Each agent calculates a “relevance score” based on keywords and their domain
The agent with highest score speaks next
Repeat until consensus or max turns

Here’s the core architecture that emerged:

			
# Pre-configured agent configs for common personas
FOUNDER_CONFIG = AgentConfig(
    persona="founder",
    name="Founder Agent",
    role="CEO pushing for ambitious, growth-oriented decisions",
    description="Challenges conservative thinking, emphasizes competitive advantage",
    keywords=["strategy", "vision", "growth", "competitive", "innovation", "founder", "ceo"],
)
HR_CONFIG = AgentConfig(
    persona="hr",
    name="HR Agent",
    role="Head of People advocating for employee well-being",
    description="Raises morale concerns, ensures fairness and practical implementation",
    keywords=["employee", "retention", "morale", "culture", "fairness", "hr", "people"],
)
FINANCE_CONFIG = AgentConfig(
    persona="finance",
    name="Finance Agent",
    role="CFO protecting company budget and financial health",
    description="Questions costs, demands ROI justification, challenges expensive ideas",
    keywords=["cost", "budget", "financial", "roi", "revenue", "finance"],
)
LEGAL_CONFIG = AgentConfig(
    persona="legal",
    name="Legal Agent",
    role="General Counsel minimizing legal and compliance risk",
    description="Flags regulatory issues, ensures legal defensibility, protects from liability",
    keywords=["compliance", "legal", "risk", "liability", "regulatory"],
)

		

The First Big Problem: Everyone Agreed Too Fast

I ran my first test discussion: “Should we implement unlimited PTO?”

All four agents spoke once. Then they all agreed. Total discussion: 4 turns.

This was hilariously unrealistic. Real executives don’t roll over that easily. I needed my agents to be more… opinionated.

The fix: I rewrote their system prompts to be skeptical and biased toward their domain. Finance became obsessed with ROI. Legal saw compliance landmines everywhere. HR worried about fairness and morale. The Founder pushed for competitive advantage.

Better, but still not quite right.

The Second Big Problem: Token Explosion

As the discussion got longer and more realistic, I hit this error:

			
ERROR    Rate limit error from OpenAI API: Error code: 429 - {'error':          
         {'message': 'Request too large for gpt-4o in organization              
         org-8WqgqXQayVA4AUqhmfRtQNK4 on tokens per min (TPM): Limit 30000,     
         Requested 49012. The input or output tokens must be reduced in order to
         run successfully. Visit https://platform.openai.com/account/rate-limits
         to learn more.', 'type': 'tokens', 'param': None, 'code':              
         'rate_limit_exceeded'}} 

		

Each agent needed context from the entire conversation history to respond intelligently. But 6+ agents × 10+ turns = way too many tokens.

The fix: Context compression. Each agent now only stores the last response from the previous speaker, plus a running summary. It’s lossy, but it works.

Dev Note

This mirrors how real humans handle long meetings—we don’t remember every word, just the gist and the most recent point being debated.

The Third Big Problem: Logging Everything

When your discussion happens at AI speed, you need visibility into what’s happening. I added comprehensive logging to track:

Who spoke when
Their relevance scores
Why they were selected to speak
What tools they accessed

This turned out to be critical for debugging why certain agents were dominating or staying silent.

Key Innovations: The “Fairness” System

The most interesting technical piece was the speaking fairness algorithm. Here’s how it works:

Relevance Score Calculation:

Start with keyword matching (how relevant is this topic to this agent’s domain?)
Penalize recent speakers – If you just spoke, your score drops significantly
Boost quiet agents – Haven’t spoken in a while? Score goes up
Add randomness to break ties

This created surprisingly natural-feeling discussions where agents would:

Jump in when their expertise was relevant
Let others take the lead when appropriate
Circle back with follow-up points

Giving Agents “Experience” Through Mock Tools

Here’s something that really elevated the quality: giving each agent access to domain-specific “tools” that simulate having data and experience.

When Finance speaks, it can query a mock API that returns budget impact estimates. When HR evaluates a policy, it checks fairness scores and employee sentiment data. Legal gets compliance risk assessments.

			
if self.persona == "finance":
    return {
        "estimated_cost": 50000,
        "roi_timeline": "18 months",
        "budget_impact": "moderate",
        "cost_risk": 0.3,
        "recommendations": [
            "Phased rollout to control costs",
            "Monitor budget closely",
            "Identify cost-saving opportunities",
        ],
    }

		

These aren’t real data calls (yet—that’s v2), but they give agents something concrete to reference rather than just vibes and opinions.

Demo Results: The Unlimited PTO Debate

Let me show you what this actually looks like in practice. I gave the system this scenario:

Company: RocketShip AI (Series A startup, 35 employees)

Question: Should we implement unlimited PTO?

Expected Difficulty: Easy (most startups do this nowadays)

Expected Turns: ~6

Here’s the actual discussion (condensed for readability):

			
================================================================================
💬 DISCUSSION TRANSCRIPT
================================================================================
[Turn 1] FOUNDER:
Implementing unlimited PTO is a strategic move that aligns with our commitment 
to fostering a competitive workplace culture that attracts top talent. By 
offering this benefit, we signal that we trust our employees to manage their 
time responsibly while also prioritizing their well-being.
While some may argue that unlimited PTO could lead to abuse or decreased 
productivity, it's essential to frame this policy within clear guidelines and 
expectations. Rather than focusing on the potential negatives, we should 
leverage this opportunity to cultivate a results-oriented culture where 
performance is measured by outcomes rather than hours worked.
[Turn 2] HR:
I understand the strategic appeal, but from a well-being perspective, it's 
crucial that we consider how this policy would be practically implemented. 
In some organizations, employees might feel pressured NOT to take time off 
unless there are clear guidelines and strong managerial encouragement, 
leading to uneven usage and possible burnout.
To ensure fairness and support morale, we should establish a comprehensive 
framework: setting expectations around minimum time off, having managers 
lead by example, and promoting a culture where taking time off is both 
encouraged and supported.
[Turn 3] FINANCE:
While I understand that unlimited PTO can be an attractive perk, we need to 
assess the financial implications thoroughly. We need to ensure that our 
operational efficiency and productivity levels do not suffer as a result.
I would like to see a clear ROI justification for this initiative. Could we 
quantify the potential recruitment and retention benefits versus any possible 
productivity losses? Perhaps conducting a pilot program or gathering data from 
other companies could provide us with valuable insights.
[Turn 4] LEGAL:
I want to emphasize the necessity of addressing potential compliance and legal 
implications. It's crucial that our policy is clearly defined and consistently 
applied to avoid any perception of favoritism or discrimination, which could 
lead to claims under employment laws.
We must ensure the policy complies with state and federal labor laws, 
particularly those related to wage and hour regulations. Accurately tracking 
time off may be necessary to avoid disputes over exempt versus non-exempt 
employee classifications.
[Turn 5] HR:
I agree that unlimited PTO can be attractive, but we must ensure it's 
implemented in a way that genuinely supports employee well-being. Regular 
check-ins and active management support will be essential in monitoring PTO 
usage and preventing disparities that could harm team morale.
✅ Discussion completed in 99.4s
Total turns: 5
Consensus reached: True

		

The final output? A nuanced policy recommendation that addresses concerns from all stakeholders—implement unlimited PTO, but with guardrails around minimum time off, manager training, and usage monitoring.

What I love about this: nobody just rolled over. Finance demanded ROI data. Legal flagged compliance risks. HR pushed for fairness metrics. The Founder kept the strategic vision front and center. This is how real policy discussions should work.

Lessons Learned

1. Fairness Metrics Matter

The speaking queue algorithm was more important than I expected. Without it, the Founder dominated every conversation (turns out CEOs do this in real life too). With fairness controls, quieter voices got their moment.

Key techniques:

Keyword relevance scoring
Penalty after speaking
Boost for agents who haven’t contributed recently

2. Tools Give Agents Credibility

Agents with access to “data” (even mock data) made more convincing arguments. “This will cost $50K and take 18 months to ROI” hits different than “this might be expensive.”

3. Context Compression Is Essential

You can’t keep the full conversation history for 10+ turns across multiple agents. Smart summarization and selective memory are necessary trade-offs.

4. Pre-defined Company Profiles Scale Testing

I use AI to create 7 company archetypes (early-stage startup, growth-stage, public company, bootstrapped, etc.) with different risk tolerances and budgets. This let me test the same policy question across different contexts quickly.

			
Available Company Scenarios:
  • RocketShip AI - Series A, 35 employees, tight budget, aggressive risk
  • DataFlow Analytics - Series C, 280 employees, moderate budget
  • SecureCloud Corp - Public, 8500 employees, flexible budget, conservative
  • Clarity Design Tools - Bootstrapped, 42 employees, profitable
  • Heritage Financial - Mature, 3200 employees, transforming culture
  • Global Education Alliance - Non-profit, 180 employees
  • Anderson Manufacturing - Family business, 450 employees

		

What’s Next

This is very much v1. Some obvious improvements:

Real data integration – Connect agents to actual company metrics, budget data, compliance databases
Customized agents – Create custom agents to chime in, eg. Engineering Lead
Memory across sessions – Let agents learn from previous policy discussions
More sophisticated consensus detection – Right now it’s pretty naive
User steering – Let humans inject questions or concerns mid-discussion
Different discussion formats – Not just consensus-seeking, but also brainstorming, red-teaming, etc.

Closing Thoughts

Building this taught me that multi-agent systems need social dynamics, not just task delegation. The difference between agents that coordinate versus agents that debate is the difference between a workflow and a conversation.

Real decision-making involves tension, disagreement, and negotiation. If your AI agents just politely take turns agreeing with each other, you’re missing the entire point of having multiple perspectives.

The code for this is available on GitHub . I’m curious how others might use this framework—policy discussions are just one application. Could you use this for product prioritization? Architecture decisions? Ethical review boards?

If you’ve built something similar or have ideas for improving the discussion dynamics, I’d love to hear about it. Find me on LinkedIn.

Thanks to the AI Builders community for the early feedback on this concept, and kudos to Warp for being a surprisingly good pair programmer when building systems to make it argue with itself.

2 responses to “The Boardroom: When AI Agents Need to Disagree”

LOW Confidence Isn't a Bug. It's Your Policy Telling You Something. – Data Slug

Apr 1, 2026 11:49 PM

[…] (based on my previous post) is a multi-agent boardroom system where four AI agents — Founder, HR, Finance, and Legal — […]

Loading…

Why Internal Tools Are the Enterprise AI Entry Point — Data Slug

Apr 2, 2026 8:02 AM

[…] Don’t pitch “AI transformation.” Pitch “we eliminated 5 hours per week of policy lookup per employee.” One number lands in a budget review. The other gets a polite nod and a […]

Loading…

Data Slug

Journey over Data Tools and Exploration