ai
  • Crypto News
  • Ai
  • eSports
  • Bitcoin
  • Ethereum
  • Blockchain
Home»Ai»This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)
Ai

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)

Share
Facebook Twitter LinkedIn Pinterest Email

Can a speech enhancer trained only on real noisy recordings cleanly separate speech and noise—without ever seeing paired data? A team of researchers from Brno University of Technology and Johns Hopkins University proposes Unsupervised Speech Enhancement using Data-defined Priors (USE-DDP), a dual-stream encoder–decoder that separates any noisy input into two waveforms—estimated clean speech and residual noise—and learns both solely from unpaired datasets (clean-speech corpus and optional noise corpus). Training enforces that the sum of the two outputs reconstructs the input waveform, avoiding degenerate solutions and aligning the design with neural audio codec objectives.

https://arxiv.org/pdf/2509.22942

Why this is important?

Most learning-based speech enhancement pipelines depend on paired clean–noisy recordings, which are expensive or impossible to collect at scale in real-world conditions. Unsupervised routes like MetricGAN-U remove the need for clean data but couple model performance to external, non-intrusive metrics used during training. USE-DDP keeps the training data-only, imposing priors with discriminators over independent clean-speech and noise datasets and using reconstruction consistency to tie estimates back to the observed mixture.

How it works?

  • Generator: A codec-style encoder compresses the input audio into a latent sequence; this is split into two parallel transformer branches (RoFormer) that target clean speech and noise respectively, decoded by a shared decoder back to waveforms. The input is reconstructed as the least-squares combination of the two outputs (scalars α, β compensate for amplitude errors). Reconstruction uses multi-scale mel/STFT and SI-SDR losses, as in neural audio codecs.
  • Priors via adversaries: Three discriminator ensembles—clean, noise, and noisy—impose distributional constraints: the clean branch must resemble the clean-speech corpus; the noise branch must resemble a noise corpus; the reconstructed mixture must sound natural. LS-GAN and feature-matching losses are used.
  • Initialization: Initializing encoder/decoder from a pretrained Descript Audio Codec improves convergence and final quality vs. training from scratch.

How it compares?

On the standard VCTK+DEMAND simulated setup, USE-DDP reports parity with the strongest unsupervised baselines (e.g., unSE/unSE+ based on optimal transport) and competitive DNSMOS vs. MetricGAN-U (which directly optimizes DNSMOS). Example numbers from the paper’s Table 1 (input vs. systems): DNSMOS improves from 2.54 (noisy) to ~3.03 (USE-DDP), PESQ from 1.97 to ~2.47; CBAK trails some baselines due to more aggressive noise attenuation in non-speech segments—consistent with the explicit noise prior.

https://arxiv.org/pdf/2509.22942

Data choice is not a detail—it’s the result

A central finding: which clean-speech corpus defines the prior can swing outcomes and even create over-optimistic results on simulated tests.

  • In-domain prior (VCTK clean) on VCTK+DEMAND → best scores (DNSMOS ≈3.03), but this configuration unrealistically “peeks” at the target distribution used to synthesize the mixtures.
  • Out-of-domain prior → notably lower metrics (e.g., PESQ ~2.04), reflecting distribution mismatch and some noise leakage into the clean branch.
  • Real-world CHiME-3: using a “close-talk” channel as in-domain clean prior actually hurts—because the “clean” reference itself contains environment bleed; an out-of-domain truly clean corpus yields higher DNSMOS/UTMOS on both dev and test, albeit with some intelligibility trade-off under stronger suppression.

This clarifies discrepancies across prior unsupervised results and argues for careful, transparent prior selection when claiming SOTA on simulated benchmarks.

The proposed dual-branch encoder-decoder architecture treats enhancement as explicit two-source estimation with data-defined priors, not metric-chasing. The reconstruction constraint (clean + noise = input) plus adversarial priors over independent clean/noise corpora gives a clear inductive bias, and initializing from a neural audio codec is a pragmatic way to stabilize training. The results look competitive with unsupervised baselines while avoiding DNSMOS-guided objectives; the caveat is that “clean prior” choice materially affects reported gains, so claims should specify corpus selection.


Check out the PAPER. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

Related Posts

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

octobre 5, 2025

Why Amazon’s Alexa+ Is the AI We’ve Been Waiting For

octobre 4, 2025

A Coding Guide to Build an Autonomous Agentic AI for Time Series Forecasting with Darts and Hugging Face

octobre 4, 2025

AI maps how a new antibiotic targets gut bacteria | MIT News

octobre 3, 2025
Add A Comment

Comments are closed.

Top Posts

SwissCryptoDaily.ch delivers the latest cryptocurrency news, market insights, and expert analysis. Stay informed with daily updates from the world of blockchain and digital assets.

We're social. Connect with us:

Facebook X (Twitter) Instagram Pinterest YouTube
Top Insights

This AI Paper Proposes a Novel Dual-Branch Encoder-Decoder Architecture for Unsupervised Speech Enhancement (SE)

octobre 5, 2025

The Relic: First Guardian is a new action role-playing game with stunning graphics and a dark fantasy setting.

octobre 5, 2025

Crypto Trader Turns $3k Into $2M, Nets 650x On CZ Memecoin Post

octobre 5, 2025
Get Informed

Subscribe to Updates

Get the latest creative news from FooBar about art, design and business.

Facebook X (Twitter) Instagram Pinterest
  • About us
  • Get In Touch
  • Cookies Policy
  • Privacy-Policy
  • Terms and Conditions
© 2025 Swisscryptodaily.ch.

Type above and press Enter to search. Press Esc to cancel.