After announcing its AI assistant Moshi in July, Kyutai has, as promised, released open-source models. The release includes several items: a technical report, weights for Moshi and its Mimi codec, and streaming inference code in PyTorch, Rust, and MLX.
According to the report, Moshi consists of three main components: Helium, a 7B language model; Mimi, a neural audio codec; and a new multi-stream architecture. The system can model real-time conversations with overlaps and gaps. Kyutai Labs provides two Moshi models with artificially generated voices. More details can be found in the published paper and the GitHub repository
OpenAI previously demonstrated this feature for GPT-4o but has not yet released it.
During the presentation, Kyutai CEO Patrick Perez explained that Moshi was developed by an eight-person team in just six months. What sets Moshi apart is its ability to speak and listen in real time. Kyutai claims that Moshi's theoretical latency is just 160 milliseconds, while in practice it ranges from 200 to 240 milliseconds.
Moshi's architecture is based on a new approach that Kyutai calls the "audio language model." Instead of converting speech to text in the usual way, this model heavily compresses audio data and treats it as pseudowords. This allows it to work directly with audio data and predict speech, making it a native multimodal model, similar to GPT-4o.
First, a pure text model called Helium was trained. Then, combined training was performed with text and audio data. Synthetic dialogues were used to fine-tune the conversation.
Since the underlying language model has only 7 billion parameters, it exhibits the typical limitations of small models in dialogue. Nevertheless, the language capabilities and speed are impressive and indicate the potential when more powerful and larger modules are used in this technology.
To give Moshi a consistent voice, Kyutai worked with a voice actress named Alice. She recorded monologues and dialogue in a variety of styles, which were then used to train the speech synthesis system.
Kyutai sees huge potential in Moshi to change the way we communicate with machines. The company sees promising applications, especially in the area of accessibility for people with disabilities.