Abstract:
Substantial improvements in automatic speech recognition performance have been realized through supervised fine-tuning
after self-supervised pre-training of a speech foundation model. The large size of foundation models, along with their varied
losses and objective functions, makes it impractical to obtain optimum results with these models, and fine-tuning each model
independently for several downstream tasks is prohibitively expensive. The proposed methodology consists of three phases: feature extraction, feature fusion, and prediction. During feature extraction, several pre-trained models, each with varying losses
and objective functions, are used to derive representations. Then, a designed co-attentional fusion mechanism is applied during
feature fusion, enabling the network to adaptively weight different fusion operations to acquire common representations across
models. Finally, a connectionist temporal classification (CTC) layer is used as a framework to generate transcription predictions.
Moreover, the proposed self-supervised feature-fusion transformer block (SSF-FT), incorporating inter-model techniques, effectively captures both shared and distinctive information across all fused representations. We conducted an interpretability study
in high-resource (English) and low-resource (Congolese) scenarios. In both settings, we observe that features performing well
with shallow ensemble methods also perform well with attention-weighted soft mixing. Experimental results demonstrate that
our approach offers complementary strengths to existing ensemble techniques, with particular improvements in acoustically
challenging and low-resource scenarios.