When Language Becomes Sound: How Voices Connect Us

Speech, first and foremost, is a sonic form of communication. We express our thoughts and feelings through speech — and language is only one piece of the puzzle. Outside of the semantic content (language), speech contains an array of sonic attributes like pitch, rhythm, and timbre. The sound of our voice is a powerful communicator, and carries with it deep emotional content.

The way we speak plays a significant role in influencing our perceptions of others and the connections we form with them.

Scholars in psycholinguistics have found mountains of evidence suggesting that the way we speak influences how we feel about each other, and that we go to great cognitive lengths to make our speech more like our conversation partner’s. Giles “communication accommodation theory,” which has been borne out in hundreds of studies, papers and books since its inception in the 1970s, describes the different non-semantic ways in which we adjust our mannerisms to accommodate our conversation partner:

Upon entering a communicative encounter, people immediately (and often unconsciously) begin to synchronize aspects of their verbal (e.g., accent, speech rate) and nonverbal behavior (e.g., gesture, posture). These adjustments are at the core of Communication Accommodation Theory (CAT).
(Giles 2016)

Sound Communicates Emotions

In a fascinating study from 2020, Lausen et al. found that acoustic parameters can effectively discriminate between emotional expressions. In a nutshell, we can tell a lot about how other people feel based purely on the sound they’re making as they speak — the semantic content isn’t required. This aspect of human communication is completely lost if we, for example, use a speech-to-text engine to turn speech into pure written language before we start to manipulate and learn from it with AI. So, you can use such an engine to turn the dialogue from a movie into text, but you would lose much of the emotional content and subtext that’s being communicated between the characters.

A person shouting into a public telephone

“This is me, telling you in words that I’m ANGRY!!“

– No person, ever

You can test this for yourself by watching a movie in a language you don’t understand. Even if you close your eyes, the sound of the film will give you quite a bit of information about what’s happening between characters.

In our work on Clio Music in the late ’00s, we found that our algorithm, which was trained to analyze audio containing music according to its musical and emotional content, was pretty darned good at clumping different sorts of speech-based audio tracks together based purely on their sound. For example, it knew the difference between stand-up comedy and film audio. This led us to some pretty promising business cases we never got around to.

Which leads to some questions…

All of this evidence leads me to a set of questions that could have massive business implications in a number of arenas, including dating apps and chat engines:

Are people attracted to the sound of a voice?

A white man with long dreadlocks, bright green sunglasses, and a loose-fitting paisley shirt. In a word, a hippie. — Chances are, you know something about how this person sounds.

Anecdotally, the answer would seem to be a resounding “yes.” We could all listen to Morgan Freeman read the phonebook, for instance. But on a much more personal and realistic level, if you think about the people you’ve been close with, you probably sound somewhat like them. You’re from the same place (think Atlanta vs. Boston), you belong to the same tribe (think valley girls vs. punks, hippies vs. yuppies), and you might even share things like general intensity level and speed of speech.

Given that we generally appreciate feeling that we can be ourselves around our friends and partners, it follows that we would prefer spending less of our mental energy accommodating them by changing how we talk, and so we’d tend to be attracted to people who talk like us.

Would people prefer interactions with AI assistants that display human-like mirroring capabilities?

Speech synthesis has been the subject of much research and investment, and machine learning has made huge leaps in this space; however, what if a machine speaker could alter its own speech in order to mirror a human companion in the same way that humans do? While today’s LLMs can certainly provide better semantic language mirroring than we’ve seen before, improving the emotional tone of a computer speaker requires further training from the perspective of sonic patterns and prosody. A machine learning system that uses intelligent sonic pattern processing could be significantly better at building trust and rapport, creating a sense of familiarity and a comfort that speaking with machines doesn’t provide today.

Could the sound of your voice help make a love connection?

Ask anyone who’s used dating apps — it’s incredibly hard to find “the one.” Any identifiable trait that can lead to a meaningful connection (or simply disqualifying someone you won’t connect with!) is valuable. We know that scent is one such trait — but so far, we don’t have a way to capture someone’s scent in our data centers. We can, however, capture a voice, and with the proper processing, voices could help connect people with each other before they even know it.

Imagine that you provide a recording of yourself to Dating App Y (X is obviously taken at this point). Y then analyzes your voice for a number of important characteristics: speech, intensity, accent, rhythm, etc. Y is, of course, doing the same thing for every other person on the app, and then uses that speech data alongside success data to train an AI system that learns what you like to hear in a partner.

While these questions would be difficult to address in a traditional research lab due to the overwhelming number of confounding variables, they’re exactly the sorts of questions that the right AI trained on the right dataset could provide quick and potentially valuable answers to.

Speech is so much more than language.

We’re well aware of the power of language. We’ve seen the incredible feats undertaken by the new and powerful large language models. By most expert estimations, language is the primary driver that took us from cave-dwelling primates to the sophisticated creatures we are today. Of course, written language has been a powerful force for moving ideas from one person to another — that’s what information does.

But well before writing, language was expressed orally as speech. If language is the primary carrier of symbolic information, speech is the set of wheels that takes it directly from one person to another. While this naturally limits speech’s viral efficiency when compared to written language, it does make it the perfect vehicle for human-to-human communication. And I think we can all agree that human connection is a thing that computers could facilitate better.

References/Further Reading

Danescu-Niculescu-Mizil, C., & Lee, L. (2011). Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs (arXiv:1106.3077). arXiv. http://arxiv.org/abs/1106.3077

Giles, H. (2016). Communication Accommodation Theory.

Lausen, A., & Hammerschmidt, K. (2020). Emotion recognition and confidence ratings predicted by vocal stimulus type and prosodic parameters. Humanities and Social Sciences Communications, 7(1), 2. https://doi.org/10.1057/s41599-020-0499-z

Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27(02). https://doi.org/10.1017/S0140525X04000056

Weatherholtz, K., Campbell-Kibler, K., & Jaeger, T. F. (2014). Socially-mediated syntactic alignment. Language Variation and Change, 26(3), 387–420. https://doi.org/10.1017/S0954394514000155