Data Slug

Journey over Data Tools and Exploration

source: Canva AI

TLDR: Built an async standup tool that detects when engineers are stuck through tone and conversational signals — not just words. 8th place at AI Agents Waterloo. GitHub here. Skip to Demo Results if you want to see the personas in action.


The Problem Nobody Talks About Until It’s Too Late

Most async standups are text dumps. “Worked on X. Will do Y. No blockers.” Managers skim them, engineers phone them in, and somewhere around day four of that pattern, a project quietly falls off a cliff.

The real issue isn’t that engineers lie. It’s that they don’t know how to signal struggle in writing. “I think I’m fine” and “I’m completely underwater” look almost identical in a Slack message. By the time the words change, you’ve already missed two sprints.

But voice? Voice is different. Hesitation, vagueness, flat affect, the ratio of concrete details to abstract filler — these signals are present long before an engineer types “I’m blocked.” I wanted to know if you could actually detect them. So I signed up for AI Agents Waterloo, built a tool called Standup Buddy in 10 hours, and learned more than I expected about voice AI, async communication, and my own assumptions.


What I Thought Would Happen

My initial assumptions going in:

  • AI will do most of the coding, so this will be fast
  • Voice AI is mature enough to treat like any other API
  • I can test multiple approaches in parallel
  • Video demo will take maybe 30 minutes

All of these were wrong. Here’s how.


Attempt 1: The Feature Explosion

I had the core idea — async standup with emotional and conversational signal detection — and immediately started layering on features. Trend dashboards. Manager alerts. Team comparisons. Longitudinal scoring. Historical baselines.

Claude was happy to help me spec all of it.

That was the problem.

About two hours in I had a beautiful system design and zero working code. The cognitive load of holding every feature in my head while trying to ship something in 10 hours was genuinely disorienting. Part of me thought this was just the normal creative chaos of starting a project. But it wasn’t. It was scope creep disguised as ambition.

I had to stop, delete half my notes, and ask a different question: what is the smallest thing that demonstrates the concept?

The answer: one engineer, one standup session, one score. Everything else is v2.

Dev Note

This is the trap that hackathons expose brutally fast — the same trap that kills features in real engineering orgs. The difference between a hackathon and a sprint is just the timebox. The failure mode is identical: building the system you wish existed instead of the system you can actually ship.


Attempt 2: The AI-to-AI Testing Rabbit Hole

One thing I was genuinely proud of as an idea: instead of talking to my own tool all day, I’d use GPT-4o-mini-tts to generate simulated engineer personas. Feed it a character description, get back audio, pipe it into my pipeline. Clean, unbiased, repeatable.

This did not work.

The generation was too slow for live testing and way too slow for video recording. Each simulated response took long enough that any demo would just be… waiting. I burned about two hours on this before accepting that it was a dead end for the hackathon format.

What I kept from it, though, was the idea of the personas. I just moved them to pre-scripted voice recordings instead of dynamically generated ones. That decision saved the demo.


Attempt 3: The Video Demo Tax

Nobody warned me about this.

The hackathon required a two-minute demo video. Two minutes sounds easy. It is not easy when your product involves waiting for AI to process speech, generate follow-up questions, and return a score. The natural pace of the tool is just… slower than two minutes allows.

I spent three to four hours on video recording alone. Multiple takes. Trying to compress five simulated standup days into something that fit the format. This was the most demoralizing part of the day — not the technical problems, but the production problem I hadn’t anticipated.

If I do another hackathon: budget a hard four hours for video. It will take that long.


What Actually Worked: The Pipeline

Once I stopped adding features and focused on the core loop, the architecture came together quickly:

smallest.ai STT → GPT-4 dynamic follow-up → scoring pipeline → repeat (Day N)

 

Each standup session works like this: the engineer records their update, smallest.ai‘s Pulse API transcribes it and returns emotional signal data (tone, sentiment, affect), then GPT-4 asks targeted follow-up questions based on what it heard. After the exchange, the pipeline produces a composite score and the cycle repeats the next day.

The multi-day loop is what makes this interesting. A single session is noise. Progression across five sessions is signal.

Dev Note

This is the same reason a good manager doesn’t panic after one quiet 1:1. Context accumulates. What changes across days is more revealing than any single snapshot.


The Scoring System

The composite score is weighted 70% conversational, 30% emotional — intentionally.

# Conversational Signals (70% weight)
# Detected by GPT-4 analysis:
conversational_score = (
vagueness * 0.3 +
(1 - specificity) * 0.3 +
(hedging / 20) * 0.2 + # "um", "like", "I think", "kind of"
(0 if help_seeking else 1) * 0.2
)
# Emotional Signals (30% weight)
# Detected by smallest.ai Pulse API:
emotional_score = (
(sadness + frustration) * 0.4 +
(1 - (happiness + excitement)) * 0.3 +
anxiety * 0.3
)

 

Emotion at 30% was a deliberate choice, not a default. Emotional signals alone are unreliable without a per-engineer baseline. Someone who naturally speaks in a flat, measured tone will score as “burnt out” on day one and every day after. That’s not signal, that’s bias baked into the model.

Conversational signals — vagueness, hedging, whether someone is actively seeking help — are more stable across personality types and more directly correlated with being stuck.

Dev Tangent

The 70/30 split is basically what a good engineering manager does instinctively in a 1:1. They’re not just listening to whether someone sounds sad. They’re listening for whether the answers are getting vaguer, whether next steps are getting murkier, whether the person stops asking questions. We’re just making that instinct legible.


The Personas: Where It Gets Interesting

Testing against my own voice all day would have introduced bias and also been exhausting. So I built five simulated engineer personas and ran each through multiple days of standup sessions:

Priya (Healthy) — concrete details, clear next steps, actively surfaces blockers early. She became the baseline. What “not stuck” actually sounds like gives the model something to measure against.

Steve (Avoider) — confident-sounding on the surface, but never surfaces blockers. Conversational specificity is low. Help-seeking score: zero, every session.

Sarah (Overwhelmed) — vague, scattered, increasing anxiety markers across days. The progression here is the tell. Day 1 Sarah sounds fine. Day 4 Sarah sounds like a retention risk.

Marcus (Overconfident) — this one is the hardest case. High-confidence language, clear sentences, sounds great on first listen. But the specificity is hollow. Real blockers are getting masked by the framing. Pure sentiment analysis misses Marcus entirely. The conversational weighting catches him.

Alex (Burnt Out) — flat affect, low engagement, consistent across all five days. No trajectory, just steady low signal. Different from overwhelmed, which has a direction. Burnout is a plateau.

Marcus is why the 70/30 weighting matters. If you weight emotion too heavily, you’ll consistently miss the engineers who’ve learned to sound fine while quietly struggling. That’s not a hypothetical — that’s the pattern that precedes most surprise resignations.


Demo Results

Priya — Healthy Progress Engineer: AI generated voice using GPT4o-TTS]

Steve — Avoider: This is yours truly after editing videos around midnight


What Claude Actually Did (And Didn’t Do)

Something I want to be honest about: Claude’s default behavior when you give it a coding problem is to immediately start coding. No plan, no architecture discussion, just output.

That’s not how I work. And in a hackathon, that instinct costs you hours.

What I found useful instead: treat Claude as a prompt generator. Use it to iterate on the thinking before touching any code. Ask it to propose the plan. Then ask it to poke holes in the plan. Then generate the prompts you’ll actually use for implementation.

The other thing that saved me: I fed the hackathon rubric directly to Claude and asked it to evaluate my project against the scoring criteria. It flagged two gaps I would have missed. That’s not a hack, that’s just using the tool correctly.

Dev Note

I’ve started thinking of Claude as a thinking partner that happens to also write code — not a code generator that happens to also think. The order matters.


The Real Economics

For anyone evaluating whether to run something like this internally or encourage it on their team:

  • Warp (AI coding agent): ~$30 USD / ~2,000 credits for 10 hours of active development
  • Claude Pro: Didn’t hit the message limit across a full day of iteration
  • smallest.ai Pulse API: Hackathon sponsor access; production pricing worth evaluating separately
  • GPT-4 (analysis): Minimal cost at this scale — the calls are short

Total out-of-pocket for a 10-hour AI-augmented solo build: roughly $30. That’s a meaningful data point if you’re thinking about whether to give engineers a dedicated AI tooling budget for exploration days.


What I Actually Learned

Voice is a different modality, not just text with audio attached. VAD (voice activity detection) alone is a hard problem — your brain does it constantly without noticing. Voice AI models don’t necessarily listen and record simultaneously. Data chunking, silence detection, latency — all of these change the UX in ways that text-based AI doesn’t prepare you for.

Emotion signals need a baseline or they’re noise. Per-engineer calibration isn’t a nice-to-have. It’s the difference between a useful tool and a system that flags your quietest engineer as burnt out every single week.

Scope is the only real skill at a hackathon. Developers become product managers under time pressure. The ability to kill features fast — not reluctantly, but decisively — is what separates finished projects from beautiful unshipped system designs.

Read the rubric. Feed it to your AI. This sounds obvious. I almost didn’t do it. It changed my final submission.


What’s Next

Near term: the obvious gap is per-engineer baseline calibration. Right now Priya functions as a proxy baseline, which works for a demo and breaks in production. Real deployment means establishing what “normal” looks like for each person before the signals mean anything.

Further out: the same pipeline applied to live team discussions. Group dynamics have their own signal patterns — who dominates, who goes quiet, whose ideas get built on versus dropped. That’s a different problem than individual async standups, but the underlying architecture is the same.

This is very much v1. Some obvious improvements I haven’t touched: real-time feedback instead of batch scoring, manager-facing dashboards with trend alerts, integration with existing standup tooling rather than replacing it.

If you’ve built something in the async communication or team signal space — or if you’ve tried to tackle the “engineer doesn’t say they’re stuck” problem through any other means — I’d genuinely love to hear what you found. What signals actually predicted problems before they surfaced?


Built with: smallest.ai Pulse API, Warp, Claude, GPT-4. Thanks to AI Agents Waterloo for the forcing function.

GitHub: standup_buddy

One response to “Standup Buddy: What 10 Hours of Voice AI Taught Me About Engineer Communication”

  1. […] and extend its own capabilities. After placing 8th at the AI Agents Waterloo Voice Hackathon with Standup Buddy, I wanted to bring voice into my daily NanoClaw […]

Leave a Reply

Discover more from Data Slug

Subscribe now to keep reading and get access to the full archive.

Continue reading