Can You Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? Apple Machine Learning Research

Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-banks as inputs, and thus, training them on self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for a downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the… Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-banks as inputs, and thus, training them on self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for a downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the… Read More