Meta's Spirit LM generates more expressive voices that reflect anger, surprise, happiness and other emotions
Meta Platforms Inc.'s Fundamental AI Research team is going head-to-head with OpenAI yet again, unveiling a new open-source multimodal large language model called Spirit LM that can handle both text and speech as inputs and outputs.
These are the same capabilities that distinguish OpenAI's most powerful LLM, GPT-4o, as well as other multimodal models such as Hume AI Inc.'s EVI 2. Meta's artificial intelligence research team announced Spirit LM late Friday, saying it's designed to address some of the challenges around existing AI voice systems, which often sound somewhat robotic and emotionless.
The problem with traditional AI models is that they're unable to replicate the expressive qualities of human voices, such as tone and emotion. That's because they rely on automatic speech recognition systems to process spoken inputs, before synthesizing them with a language model and converting it all using text-to-speech models.
Meta Spirit LM has an entirely different design featuring tokens for phonetics, pitch and tones, in order to add those expressive qualities to its speech outputs. At the same time, it's capable of learning new tasks across a range of modalities, including automatic speech recognition, text-to-speech and speech classification.
What that means is that it can learn and improve the way it converts spoken language into text, generates spoken language from text, and identifies and categorizes speech based on its content or emotional tone.
Meta said it's making two versions of Meta Spirit LM available to the research community under its FAIR Noncommercial Research License, which allows anyone to use, reproduce, modify and create derivative works for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction.
The models include Spirit LM Base, which uses phonetic tokens to process and generate speech, and Spirit LM Expressive, which is a more advanced version that includes tokens for pitch and tone. These allow it to understand and reproduce more nuanced emotions in voices, such as excitement and sadness, and reflect them in its own speech.
The models were trained on a wide range of information, including both text and speech datasets, allowing it to handle cross-modal tasks such as text-to-speech and speech-to-text with humanlike natural expressiveness in its outputs, Meta's researchers said.
According to the researchers, the Spirit LM Expressive model can also detect and reproduce emotional states such as anger, surprise and happiness in its speech outputs. They believe this will have huge implications for AI assistants such as customer service bots, where the ability to engage in more nuanced conversations can help to improve customer satisfaction.
Along with the two models, Meta is making all of the model weights, code and supporting documentation available to the research community, encouraging them to build and experiment with them further. The hope is that this will inspire other researchers to explore new ways for integrating speech and text in multimodal AI systems.
In addition to Meta Spirit LM, Meta's research team also announced an update to the Segment Anything model for image and video segmentation tasks that was revealed last year. It's designed to power applications such as medical imaging and meteorology.
The company also published its latest research on boosting the efficiency of LLMs, as part of its broader goal to create advanced machine intelligence, or AMI.