Skip to content

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models Apple Machine Learning Research

  • by

​*=Equal Contributors
This paper was accepted at the Efficient Natural Language and Speech Processing workshop at NeurIPS 2023.
Interactions with virtual assistants often begin with a predefined trigger phrase followed by the user command. To make interactions with the assistant more natural, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We address this task by combining the decoder signals of an automatic speech recognition (ASR) system with acoustic and lexical representations as input features to a large language model… *=Equal Contributors
This paper was accepted at the Efficient Natural Language and Speech Processing workshop at NeurIPS 2023.
Interactions with virtual assistants often begin with a predefined trigger phrase followed by the user command. To make interactions with the assistant more natural, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We address this task by combining the decoder signals of an automatic speech recognition (ASR) system with acoustic and lexical representations as input features to a large language model…  Read More  

Leave a Reply

Your email address will not be published. Required fields are marked *