丶胶水吧 关注:0贴子:78
  • 0回复贴,共1
Thanks to the rapid development of smartphones and tablets, interactingwith technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for specifickeywords to initiate voice input. This could be especially usefulin situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.Keyword Spotting (KWS) aims at detecting predefined keywordsin an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environments.For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.A commonly used technique for keyword spotting is the Keyword/FillerHidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained for each keyword, and a filler model HMM is trained from the nonkeywordsegments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally expensivedepending on the HMM topology. Other recent work exploresdiscriminative models for keyword spotting based on largemarginformulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach. Thelarge-margin formulation based methods, however, require processingof the entire utterance to find the optimal keyword region, whichincreases detection latency. We have also been working on recurrentneural networks for keyword spotting, but it is work in progress andwill not be discussed in this paper.We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimentalsetup, results and some discussion follow in Section 4. Section 5closes with the conclusions.


1楼2018-03-04 15:53回复