On December 15, Lei Jun, founder, chairman, and CEO of Xiaomi, announced that the second Audio Encoder Capability Challenge (AECC), jointly initiated by Xiaomi, the University of Surrey, Tsinghua University, and Haitai Ruisheng, will debut alongside the prestigious international speech conference Interspeech 2026 in September next year. Registrations are now officially open.
Lei Jun stated that the competition aims to enhance the efficiency of audio encoders for large audio language models (LALMs) and encouraged participants to sign up.
Interspeech 2026, a top-tier global speech conference, will be held in Sydney, Australia, in September next year. The second AECC, co-organized by Xiaomi, the University of Surrey, Tsinghua University, and Haitai Ruisheng, will run concurrently with the event.
Currently, large audio language models (LALMs) are advancing rapidly, but most mainstream models rely heavily on a single audio front-end encoder, primarily OpenAI’s Whisper Encoder. This dependence limits architectural diversity and hinders further improvements in LALMs' overall capabilities. To address growing demands for audio comprehension, the challenge will focus on evaluating audio encoders' understanding and feature representation in complex real-world scenarios.
**1. Competition Overview** **1.1 Evaluation Method** The challenge adopts a unified end-to-end training and evaluation framework. Participants need only submit pre-trained encoder models, while downstream task training and evaluation will be handled by the organizers. The open-source evaluation system, XARES-LLM (https://github.com/xiaomi-research/xares-llm), automatically trains a typical LALM based on submitted encoders, downloads training data, tests downstream tasks, and provides scores.
Participants are not required to run XARES-LLM themselves. Instead, they need only package their audio encoders according to provided guidelines and email them to the organizers for large model training and evaluation. However, as XARES-LLM is open-source and can be run on a GTX 4090, participants may also use it to assess their encoders' performance before submission.
**1.2 Training Data** Unlike most competitions, this challenge emphasizes both model design and data utilization. No specific training dataset is mandated—participants may use any publicly accessible data, including web-scraped sources, but proprietary data is prohibited. Models can be built on open-source pre-trained parameters or trained from scratch.
Haitai Ruisheng has provided a supplementary dataset, free for participants, derived from eight commercial datasets (e.g., King-ASR-457, King-ASR-958). It covers diverse environmental noises (e.g., bookstores, gyms, subways, restaurants), household background noises, non-speech interference (e.g., water flow, footsteps), and vehicle-related sounds (e.g., mechanical noise, wind noise). Details: https://dataoceanai.github.io/Interspeech2026-Audio-Encoder-Challenge/King_NonSpeech-Dataset_en_20h.html
**1.3 Tracks** Two tracks are available: - **Track A**: Focuses on traditional classification tasks. - **Track B**: Evaluates comprehension and expressive capabilities. Submissions will be assessed in both tracks, with independent rankings.
**2. Registration & Submission** **2.1 Process** - Register by January 25, 2026 (AoE): https://docs.google.com/forms/d/1oaTnhh0HVX8K2oRdHKXsnyZfBWb7F6Oj8xZ6yAiMI74/viewform - Package encoder code/model files (zip) and email by February 12, 2026 (AoE). - Submit a technical report (PDF) by February 25, 2026 (AoE), which may also serve as a conference paper for Interspeech.
**2.2 Contact** Email: 2026interspeech-aecc@dataoceanai.com Official site: https://dataoceanai.github.io/Interspeech2026-Audio-Encoder-Challenge/