All articles
Cresta's AI Voice Maestro is Leaning Into Human Flaws to Craft a Better AI Speech Tool
Henry Zhang, Product Leader at Cresta, explains why building credible AI voices depends on breaking speech into its smallest elements and tuning them for emotion, context, and real conversations.

Key Points
Many AI voices fail because they sound technically correct but miss the subtle speech cues that signal confidence, emotion, and appropriateness in real conversations.
Henry Zhang, a product leader at Cresta, explains how breaking the human voice into timing, pitch, timbre, and context reveals why those cues matter.
By rebuilding the voice cloning process from the ground up and refining them through real use, the team designs AI voices that respond appropriately to emotional and situational context, not just scripts.
The first thing we set out to do was break down the components of voice and understand what actually makes a human sound unique.
Humans are finely tuned to the sound of other humans. Even without pinpointing why, people can immediately sense when a voice just feels off. Emotion and intent in AI voice come from deliberate, incremental choices, and those choices determine whether credibility is earned when it matters.
To understand the ins and outs we spoke with Henry Zhang, a Product Leader at Cresta at the forefront of building conversational AI agents. His perspective is shaped by a career path that includes co-founding an EdTech platform, technical roles at Google and Bloomberg, and strategy consulting as an Engagement Manager at McKinsey & Company. For him, before you can build a better voice, you first need to break it down.
"The first thing we set out to do was break down the components of voice and understand what actually makes a human sound unique. The key dimensions are how fast you speak, how you pause between ideas, and how you control pitch within a conversation," says Zhang. To answer the question of what, precisely, does it mean for an AI to sound 'good'?, Zhang's team hunted for the specific vocal traits that build or erode trust. That principle meant engineering a voice that deliberately avoids the subconscious signals of nervousness that listeners are so sensitive to.
What's up, talk?: Where AI voices tend to break down is in subtle vocal habits that signal uncertainty rather than confidence. "You see bad speech show up in things like exaggerated uptalk, where the rising tone at the end of a statement makes it unclear whether something is a statement or a question," Zhang explains. He points to a related issue at the opposite end of the spectrum. "As pitch and volume drop, you start to hear vocal fry: that strained, creaking quality that humans associate with nervousness or uncertainty."
Voicing trust: That same sensitivity shapes how people respond to vocal tone. "What's known as a 'darker' voice is typically associated with executives who bring more trust, which is why we look for voices with a darker color and less brightness," he continues. In AI voice development, that association matters because tone becomes a proxy for authority and credibility the moment the system speaks. Small shifts in vocal color can change how seriously a customer takes the interaction, even before the content of the response fully lands.
When Zhang's team evaluated hundreds of off-the-shelf voices from top vendors, they had an insight: the available voices lacked naturalness combined with the specific characteristics needed for reliable, "production-ready performance." As a result, they chose to build their own. The search for the right voice ended internally with Claire, a sales rep at Cresta, in a process that balanced capturing candid stories with the structured collection of data.
Story time: In the studio, Zhang explains, much of the time was spent talking with Claire about her life stories. "You want to capture the extreme happiness. You want to hear some sadness. You want to hear some empathy. In natural stories, people will have to recall things," he notes. "That helps you have the natural pauses."
Pacing the pattern: AI voices have a tendency to speak too quickly, especially when handling dense or sensitive information. To correct for that, the team had Claire read from tightly controlled scripts with anonymized data, designed to slow the voice down in specific moments. "You have to deliberately pronounce those numbers, email addresses, actual physical addresses, SSNs," Zhang says. "Otherwise, the listener is unable to process that sort of information."
Quantifying quality: Once a baseline voice leaves the studio, it's stress-tested in real use, where even small imperfections become obvious. "Our go-to-market team can flag a single sentence that sounds unnatural, like when the voice says 'hang on a second' without the right inflection, and that feedback tells us exactly what needs to be rerecorded," explains Zhang. That qualitative input is paired with rigorous measurement. "We listen to roughly 300 samples of the same voice and analyze them across about fifteen dimensions, including rhythm, pauses, and intonation, to understand precisely where the voice succeeds and where it falls short."
As AI voices approach human-like realism, the focus is already shifting to what Zhang describes as "the next layer of effectiveness: context-aware responses and empathy." He illustrates this with an example where vocal delivery, informed by context, determines the entire meaning of the interaction.
Not the time or place: The challenge becomes even sharper in emotionally charged settings like healthcare, where delivery matters as much as content. An AI might generate a perfectly correct response on paper, but tone determines whether it lands appropriately. "If you strip away the context and just generate a sentence, it can sound completely wrong for the situation," Zhang says. In healthcare interactions, cues from the prior exchange shape how a response should sound, from volume to pacing to inflection. "That context from the previous turn is extremely informative, and without it, the voice risks sounding careless even when the words are technically correct."
The goal is not just realism, but appropriateness. "The parts where emotions are at play are where sounding human is actually crucial," Zhang concludes. As AI voices move closer to human range, their success will depend less on how convincingly they mimic speech and more on whether they rise to the emotional stakes of the moment.





