Our application requires a keyword spotting system with a smallmemory footprint, low computational cost, and high precision. Tomeet these requirements, we propose a simple approach based ondeep neural networks. A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.Keyword recognition results achieve 45% relative improvement withrespect to a competitive Hidden Markov Model-based system, whileperformance in the presence of babble noise shows 39% relative improvement.Index Terms— Deep Neural Network, Keyword Spotting, EmbeddedSpeech Recognition1. INTRODUCTIONThanks to the rapid development of smartphones and tablets, interactingwith technology using voice is becoming commonplace. Forexample, Google offers the ability to search by voice [1] on Androiddevices and Apple’s iOS devices are equipped with a conversationalassistant named Siri. These products allow a user to tap a device andthen speak a query or a command.We are interested in enabling users to have a fully hands-freeexperience by developing a system that listens continuously for specifickeywords to initiate voice input. This could be especially usefulin situations like driving. The proposed system must be highlyaccurate, low-latency, small-footprint, and run in computationallyconstrained environments such as modern mobile devices. Runningthe system on the device avoids latency and power implications withconnecting to the server for recognition.Keyword Spotting (KWS) aims at detecting predefined keywordsin an audio stream, and it is a potential technique to providethe desired hands-free interface. There is an extensive literature inKWS, although most of the proposed methods are not suitable forlow-latency applications in computationally constrained environments.For example, several KWS systems [2, 3, 4] assume offlineprocessing of the audio using large vocabulary continuous speechrecognition systems (LVCSR) to generate rich lattices. In this case,their task focuses on efficient indexing and search for keywords inthe lattices. These systems are often used to search large databasesof audio content. We focus instead on detecting keywords in theaudio stream without any latency.A commonly used technique for keyword spotting is the Keyword/FillerHidden Markov Model (HMM) [5, 6, 7, 8, 9]. Despitebeing initially proposed over two decades ago, it remains highlycompetitive. In this generative approach, an HMM model is trained*The author performed the work as a summer intern at Google, MTV.for each keyword, and a filler model HMM is trained from the nonkeywordsegments of the speech signal (fillers). At runtime, thesesystems require Viterbi decoding, which can be computationally expensivedepending on the HMM topology. Other recent work exploresdiscriminative models for keyword spotting based on largemarginformulation [10, 11] or recurrent neural networks [12, 13].These systems show improvement over the HMM approach. Thelarge-margin formulation based methods, however, require processingof the entire utterance to find the optimal keyword region, whichincreases detection latency. We have also been working on recurrentneural networks for keyword spotting, but it is work in progress andwill not be discussed in this paper.We propose a simple discriminative KWS approach based ondeep neural networks that is appropriate for mobile devices. Werefer to it as Deep KWS . A deep neural network is trained to directlypredict the keyword(s) or subword units of the keyword(s) followedby a posterior handling method producing a final confidence score.In contrast with the HMM approach, this system does not requirea sequence search algorithm (decoding), leading to a significantlysimpler implementation, reduced runtime computation, and smallermemory footprint. It also makes a decision every 10 ms, minimizinglatency. We show that the Deep KWS system outperforms a standardHMM based system on both clean and noisy test sets, even when asmaller amount of data is used for training.We describe our DNN based KWS framework in Section 2, andthe baseline HMM based KWS system in Section 3. The experimentalsetup, results and some discussion follow in Section 4. Section 5closes with the conclusions.2. DEEP KWS SYSTEMThe proposed Deep KWS framework is illustrated in Figure 1. Theframework consists of three major components: (i) a feature extractionmodule, (ii) a deep neural network, and (iii) a posterior handlingmodule. The feature extraction module (i) performs voice-activitydetection and generates a vector of features every frame (10 ms).Fig. 1. Framework of Deep KWS system, components from left toright: (i) Feature Extraction (ii) Deep Neural Network (iii) PosteriorHandlingThese features are stacked using the left and right context to createa larger vector, which is fed as input to the DNN (Section 2.1).We train a DNN (ii) to predict posterior probabilities for each outputlabel from the stacked features. These labels can correspond toentire words or sub-words for the keywords (Section 2.2). Finally,a simple posterior handling module (iii) combines the label posteriorsproduced every frame into a confidence score used for detection(Section 2.3).In the example of Figure 1, the audio contains the key-phrase“okay google”. The DNN in this case only has 3 output labels:“okay”, “google”, and “filler”, and it generates frame-level posteriorscores shown in (iii). The posterior handling module combinesthese scores to provide a final confidence score for that window.2.1. Feature ExtractionThe feature extraction module is common to our proposed DeepKWS system and the baseline HMM system.To reduce computation, we use a voice-activity detection systemand only run the KWS algorithm in voice regions. The voice-activitydetector, described in [14], uses 13-dimensional PLP features andtheir deltas and double-deltas as input to a trained 30-componentdiagonal covariance GMM, which generates speech and non-speechposteriors at every frame. This is followed by a hand-tuned statemachine (SM), which performs temporal smoothing by identifyingregions where many frame speech posteriors exceed a threshold.For the speech regions, we generate acoustic features based on40-dimensional log-filterbank energies computed every 10 ms overa window of 25 ms. Contiguous frames are stacked to add sufficientleft and right context. The input window is asymmetric since eachadditional frame of future context adds 10 ms of latency to the system.For our Deep KWS system, we use 10 future frames and 30frames in the past. For the HMM baseline system we use 5 futureframes and 10 frames in the past, as this provided the best trade-offbetween accuracy, latency, and computation [15].2.2. Deep Neural NetworkThe deep neural network model is a standard feed-forward fully connectedneural network with k hidden layers and n hidden nodes perlayer, each computing a non-linear function of the weighted sum ofthe output of the previous layer. The last layer has a softmax whichoutputs an estimate of the posterior of each output label. For thehidden layers, we have experimented with conventional logistic andrectified linear unit (ReLU) functions [16], and consistently foundthat ReLU outperforms logistic on our development set, while reducingcomputation. We present results with ReLU activations only.The size of the network is also dictated by the number of outputlabels. In the following sub-sections we describe in detail the labelgeneration and training for our neural network. We also describe alearning technique that further improves the KWS performance.Labeling. For our baseline HMM system, as in previous work [8,9, 17] the labels in the output layer of the neural network are contextdependentHMM states. More specifically the baseline system uses2002 context dependent states selected as described in [15].For the proposed Deep KWS , the labels can represent entirewords or sub-word units in the keyword/key-phrase. We reportresults with full word labels, as these outperform sub-word units.These labels are generated at training time via forced alignmentusing our 50M parameter LVCSR system [18]. Using entire wordlabels as output for the network, instead of the HMM states, has severaladvantages: (i) smaller inventory of output labels reduces thenumber of neural network parameters in the last layer, which is computationallyexpensive (ii) a simple posterior handling method canbe used to make a decision (as explained in Section 2.3), (iii) wholeword models achieve better performance, assuming the training datais adequate for each word label considered.Training. Suppose pij is the neural network posterior for the ithlabel and the jth frame xj (see Section 2.1), where i takes values between0, 1, ..., n*1, with n the number of total labels and 0 the labelfor non-keyword. The weights and biases of the deep neural network,θ, are estimated by minimizing the cross-entropy training criterionover the labeled training data {xj , ij}j (previous paragraph).F(θ) = Xjlog pij j . (1)The optimization is performed with the software framework DistBelief[19, 20] that supports distributed computation on multiple CPUsfor deep neural networks. We use asynchronous stochastic gradientdescent with an exponential decay for the learning rate.Transfer learning. Transfer learning refers to the situation where(some of) the network parameters are initialized with the correspondingparameters of an existing network, and are not trainedfrom scratch [21, 22]. Here, we use a deep neural network forspeech recognition with suitable topology to initialize the hiddenlayers of the network. All layers are updated in training. Transferlearning has the potential advantage that the hidden layers canlearn a better and more robust feature representation by exploitinglarger amounts of data and avoiding bad local optima [21]. In ourexperiments we find this to be the case.2.3. Posterior HandlingThe DNN explained in Section 2.2 produces frame level label posteriors.In this section we discuss our proposed simple, yet effective,approach to combine DNN posteriors into keyword/key-phrase confidencescores. A decision then will be made if the confidence exceedssome predefined threshold. We describe the confidence computationassuming a single keyword. However, it can be easily modifiedto detect multiple keywords simultaneously.Posterior smoothing. Raw posteriors from the neural network arenoisy, so we smooth the posteriors over a fixed time window of sizewsmooth. Suppose p0ij is the smoothed posterior of pij (Eq. 1). Thesmoothing is done with the following formula:p0ij =1j * hsmooth + 1Xjk=hsmoothpik (2)where hsmooth = max{1, j *wsmooth + 1} is the index of the firstframe within the smoothing window.Confidence. The confidence score at jth frame is computed withina sliding window of size wmax, as followsconf idence =n*1vuutnY*1i=1maxhmax≤k≤jp0ik (3)where p0ij is the smoothed state posterior in Eq. (2), hmax =max{1, j * wmax + 1} is the index of the first frame within thesliding window. We use wsmooth = 30 frames, and wmax = 100,as this gives the best performance on the development set; the performancehowever is not very sensitive to the window sizes. AlsoEq. (3) does not enforce the order of the label sequence, we do notbother enforcing it because the stacked frames fed as input to theneural network help encode contextual information.3. BASELINE HMM KWS SYSTEMWe implement a standard Keyword-Filler Hidden Markov Model asour baseline. The basic idea is to create a HMM for the keyword anda HMM to represent all non-keyword segments of the speech signal(filler model). There are several choices for the filler model, fromfully connected phonetic units [6] to a full LVCSR system where thelexicon excludes the keyword [23]. Obviously, the latter approachyields a better filler model, however it requires higher computationalcost at runtime, and significantly larger memory footprint. Giventhe constraints of our application, we implemented a triphone-basedHMM model as filler. In contrast to previous work [6, 23], our implementationuses a Deep Neural Network to compute the HMMstate densities.The Keyword-Filler HMM topology is shown in Figure 2. Keyworddetection is achieved by running Viterbi decoding with thistopology and checking if the best path passes through the KeywordHMM or not. The trade-off between false alarms (a keyword is notpresent but the KWS system gives a positive decision) and false rejects(a keyword is present but the KWS system gives a negativedecision) is controlled by the transition probability between keywordand filler models. High transition probability leads to highfalse alarm rate and vice versa.An important advantage of the Keyword-Filler model is that itdoes not require keyword-specific data at training time. It simplylearns a generative model for all triphone HMM states through likelihoodmaximization on general speech data. Knowledge of the keywordcan be introduced only at runtime, by specifying the keywordin the decoder graph. However, if keyword-specific data is availablefor training, one can improve system performance using transferlearning (Section 2.2), i.e., by initializing the acoustic model networkwith a network trained on the general speech data and thencontinue training it using the keyword-specific data.Keyword HMMsFiller HMMsHMM HMMHMMHMMendHMM......HMMstartFig. 2. HMM topology for KWS system, which consists of a keywordmodel and a triphone filler model4. EXPERIMENTAL RESULTSExperiments are performed on a data set which combines real voicesearch queries as negative examples and phrases including the keywords,sometimes followed by queries, as positive examples. A fulllist of the keywords evaluated is shown in Table 1. We train a separateDeep KWS and build a separate Keyword-Filler HMM KWSsystem for each key-phrase. Results are presented in the form of amodified receiver operating characteristic (ROC) curves, where wereplace true positive rate with the false reject rate on Y-axis. Lowercurves are better. The ROC for the baseline system is obtained bysweeping the transition probability for the Keyword HMM path inFigure 2. For the Deep KWS system, the ROC is obtained by sweepingthe confidence threshold. We generate a curve for each keywordand average the curves vertically (at fixed FA rates) over all keywordstested. Detailed comparison is given at 0.5% FA rate, whichis a typical operating point for practical applications.Table 1. Keywords used in evaluationanswer call dismiss alarmgo back ok googleread aloud record a videoreject call show more commandssnooze alarm take a pictureWe compare the Deep KWS system and the HMM system withdifferent size of neural networks (Section 4.3), evaluate the effect oftransfer learning for both systems (Section 4.2), and show performancechanges in the presence of babble noise (Section 4.4).4.1. DataWe use two sets of training data. The first set is a general speechcorpus, which consists of 3,000 hours of manually transcribed utterances(referred to as VS data). The second set is a keyword specificdata (referred to as KW data), which includes around 2.3K trainingexamples for each keyword, and 133K negative examples comprisedof anonymized voice search queries or other short phrases.For the keyword “okay google”, 40K positive examples are availablefor training.The evaluation set contains roughly 1K positive examples foreach keyword and 70K negative examples, representing 1.4% of positiveto negative ratio, to match expected application usage. Again,for keyword “okay google” we use instead 2.2K positive examples.The noisy test set is generated by adding babble noise to this test setwith a 10db Signal to Noise Ratio (SNR). Finally, we use a similarsize non-overlapping set of positive and negative examples as developmentset to tune decoder parameters and DNN input window sizeparameters.4.2. ResultsWe first evaluate the performance of the smaller neural networktrained for the baseline HMM and the Deep KWS systems. Bothsystems used the frontend described in 2.1. They both used a networkwith 3 hidden layers and 128 hidden nodes per layer withReLU non-linearity. However, the number of parameters for bothnetworks is not identical. The DNN acoustic model used for thebaseline HMM system uses an input window size of 10 left framesand 5 right frames, and outputs 2,002 HMM states, resulting inaround 373K parameters. The Deep KWS uses instead 30 leftframes and 10 right frames, but only produces word labels reducingthe output label inventory to 3 or 4 depending on the key-phraseevaluated. The total number of parameters for Deep KWS is nolarger than 244K parameters.Figure 3 shows the performance for both systems. Baseline3x128 (VS) refers to the HMM system with a DNN acoustic modeltrained on the voice search corpus. Baseline 3x128 (VS + KW) isthis same system after adapting the DNN acoustic model using keywordspecific data. Deep 3x128 (KW) refers to the proposed DeepKWS system trained on keyword specific data. Finally, Deep 3x128(VS + KW) shows the performance when we initialize the Deep3x128 KW network with a network trained on VS data as explainedin Section 2.2.It is clear from the results that the proposed Deep KWS outperformsthe baseline HMM KWS system even when it is trainedwith less data and has fewer number of parameters. For example,see Deep 3x128 (KW) vs Baseline 3x128 (VS + KW) in Figure 3.The gains are larger at very low false alarm rate, which is a desirableoperating point for our application. At 0.5% FA rate, DeepFig. 3. HMM vs. Deep KWS system with 3 hidden layers, 128hidden nodes neural networkFig. 4. HMM vs. Deep KWS system with 6 hidden layers, 512hidden nodes neural network3x128 (VS + KW) system achieves 45% relative improvement withrespect to Baseline 3x128 (VS + KW). Training a network on theKW data takes only a couple of hours, while training it on VS +KW takes about a week using our DistBelief framework describedin Section 2.2.4.3. Model SizeFigure 4 presents the performance when evaluating both systemswith a 6x512 network. In this case the number of parameters for thebaseline increases to 2.6M while that of the Deep models reaches2.1M. Deep 6x512 (KW) system, actually performs worse thanthe smaller 3x128 models, we conjecture this is due to not havingenough KW data to train the larger number of parameters. Howeverwhen both systems are trained on VS + KW data, we observe aconsistent improvement with respect to their corresponding 3x128systems. Here again, the Deep KWS system has superior performanceto the baseline.4.4. Noise RobustnessWe also test the same models on a noisy test set, generated by addingbabble noise to the original test set with a 10db SNR. ComparingBaseline 3x128 (VS + KW) in Figure 3 and Figure 5, at 0.5% FArate, the FR rate of the HMM doubles from 5% FR to 10% FR. TheFig. 5. HMM vs. Deep KWS system with 3 hidden layers, 128hidden nodes neural network on NOISY dataFig. 6. HMM vs. Deep KWS system with 6 hidden layers, 512hidden nodes neural network on NOISY dataDeep KWS system suffers similar degradation. However it achieves39% relative improvement with respect to the baseline.5. CONCLUSIONWe have presented a new deep neural network based frameworkfor keyword spotting. Experimental results show that the proposedframework outperforms the standard HMM based system on bothclean and noisy conditions. We further demonstrate that a DeepKWS model trained with only the KW data yields better search performancethan the baseline HMM KWS system trained with bothKW and VS data. The Deep KWS system also leads to a simplerimplementation by removing the necessity of a decoder, as well asreduced runtime computation, and a smaller model, and thus is favoredfor our embedded application.Since the detection application we are working on only requiresa real-time YES/NO decision, the proposed framework in this workdoes not model the keyword ending time explicitly. We will extendthe proposed method to model keyword boundary in the future work.