4

Kaldi-nnet3-上下文和egs

 2 years ago
source link: http://placebokkk.github.io/kaldi-nnet3/2020/03/15/asr-kaldi-nnet3-context.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Kaldi-nnet3-上下文和egs

Mar 15, 2020

model-context

t时刻的输出label依赖的输入特征的上下文宽度叫做model-context,包括左右的model-context

  • model-left-context(也叫net-left-context, 有时在也称为left-context,需要根据语境判断)
  • model-right-context(也叫net-right-context,有时在也称为right-context,需要根据语境判断)

extra-context

kaldi中还有一个概念是extra-left-context和extra-right-context,这个是用于recurrent网络的recurrent计算, 需要多少context计算得到recurrent的输入。 对于Bi-RNN,还有右侧的recurrent输入,所以需要右侧的上下文。

网络上下文的计算

## tdnn.network.xconfig
  input dim=40 name=input
  relu-batchnorm-layer name=tdnn dim=64 input=Append(-2,0,2)
  fast-lstm-layer name=lstm cell-dim=32 decay-time=20 delay=-3
  output-layer name=output input=lstm dim=10 max-change=1.5

查看其左右上下文大小

## steps/nnet3/xconfig_to_configs.py --xconfig-file tdnn.network.xconfig --config-dir ./
## nnet3-info ref.raw
left-context: 2
right-context: 2
num-parameters: 20586
modulus: 1

各层参数,其中+1是affine层的bias项。相加等于20586

tdnn affine (40 * 3 +1)* 64 
lstm affine ((64+32)+1) * 32 * 4 
lstm nonlin 32* 3
softmax affine (32+1)*10 

看一个更复杂的网络的上下文结果,比如我们使用wsj的chain中的tdnn模型。

    input dim=100 name=ivector
    input dim=40 name=input

    idct-layer name=idct input=input dim=40 cepstral-lifter=22 affine-transform-file=$dir/configs/idct.mat
    delta-layer name=delta input=idct
    no-op-component name=input2 input=Append(delta, Scale(1.0, ReplaceIndex(ivector, t, 0)))

    # the first splicing is moved before the lda layer, so no splicing here
    relu-batchnorm-layer name=tdnn1 $tdnn_opts dim=1024 input=input2
    tdnnf-layer name=tdnnf2 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=1
    tdnnf-layer name=tdnnf3 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=1
    tdnnf-layer name=tdnnf4 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=1
    tdnnf-layer name=tdnnf5 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=0
    tdnnf-layer name=tdnnf6 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf7 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf8 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf9 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf10 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf11 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf12 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    tdnnf-layer name=tdnnf13 $tdnnf_opts dim=1024 bottleneck-dim=128 time-stride=3
    linear-component name=prefinal-l dim=192 $linear_opts

    prefinal-layer name=prefinal-chain input=prefinal-l $prefinal_opts big-dim=1024 small-dim=192
    output-layer name=output include-log-softmax=false dim=$num_targets $output_opts

    prefinal-layer name=prefinal-xent input=prefinal-l $prefinal_opts big-dim=1024 small-dim=192
    output-layer name=output-xent dim=$num_targets learning-rate-factor=$learning_rate_factor $output_opts

查看其左右上下文大小

## steps/nnet3/xconfig_to_configs.py --xconfig-file tdnn-lstm.xconfig --config-dir ./
## nnet3-info wsj/configs/ref.raw
left-context: 29
right-context: 29
num-parameters: 8642592

其中tdnnf的time-stride表示左右上下文,累加起来1*3 + 3*8 = 27.那为何网络左右上下文是29呢,因为delta层也会引入左右上下文依赖。delta-delta时左右上下文都为2. 因此left-context=29,right-context=29.

delta(t) = f(t+1) - f(t-1) 
delta-delta(t)  = f(t-2) - 2f(t) + f(t+2)  
/steps/nnet3/chain/get_egs.sh
/steps/nnet3/get_egs.sh

nnet3需要提前生成好训练样本,生成的训练样本叫做egs(examples),会根据模型的上下文生成对应egs,一旦模型上下文变了, egs也需要重新生成。

每个example是一个长度固定chunk,即一组连续的label和其根据上下文需要的输入特征。

下面的是一个模型上下文为左侧16,右侧12对应的egs,01.wav这个音频被切分成了很多chunk,-87表示第87个chunk.

01.wav-87 
<Nnet3Eg> 
<NumIo> 2 
    <NnetIo> input 
        <I1V> 36 
        <I1> 0 -16 0 <I1> 0 -15 0 <I1> 0 -14 0 <I1> 0 -13 0 <I1> 0 -12 0 <I1> 0 -11 0 <I1> 0 -10 0 <I1> 0 -9 0 <I1> 0 -8 0 <I1> 0 -7 0 <I1> 0 -6 0 <I1> 0 -5 0 <I1> 0 -4 0 <I1> 0 -3 0 <I1> 0 -2 0 <I1> 0 -1 0 <I1> 0 0 0 <I1> 0 1 0 <I1> 0 2 0 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 0 7 0 <I1> 0 8 0 <I1> 0 9 0 <I1> 0 10 0 <I1> 0 11 0 <I1> 0 12 0 <I1> 0 13 0 <I1> 0 14 0 <I1> 0 15 0 <I1> 0 16 0 <I1> 0 17 0 <I1> 0 18 0 <I1> 0 19 0
        [
        97.30364 -3.910992 -30.80304 -3.48616 -66.96941 -29.00227 -88.99083 21.78121 -62.15462 -4.151295 -35.50327 -4.110187 -18.79502 70.5752 -1.083632 38.90316 -15.14383 -4.67984 3.844811 0.7083166 -2.214117 2.550018 1.870166 1.769193 -1.508338 0.2067747 -6.741038 5.639647 -10.63978 10.62695 -7.235626 16.03898 -12.22836 3.21749 -4.498186 1.417891 2.665665 -1.110982 -5.823317 1.687254
        99.10075 2.414833 -38.25296 -1.217627 -71.97943 -41.72618 -68.26828 18.00008 -59.13413 -6.729376 -18.06305 -11.4845 -19.0238 69.58968 4.329955 42.28728 2.413088 -11.88971 -5.693924 5.658363 -0.2441978 4.228565 0.1156468 2.205399 -3.323665 -0.6477451 1.101045 5.354781 -3.874245 4.511787 -4.556805 9.047355 -13.33412 7.345819 -1.971818 -0.5560644 -3.297477 -3.503572 -4.566537 2.10817
        99.55003 9.926746 -53.82625 11.6374 -71.47052 -26.45749 -52.59174 5.153442 -57.12048 -26.77036 0.3643494 -47.99191 -0.248088 63.67664 -3.669809 49.73235 1.612537 -23.90616 -8.469775 -3.113039 -0.4495978 5.011887 0.7205315 1.575324 -1.529328 -0.7170305 3.179681 3.93045 6.694186 -0.7881775 -9.213675 13.28471 -13.14982 5.640185 8.76902 -1.971353 1.758175 -1.027197 -6.577384 2.859537
        ...
        ]
    </NnetIo>
    <NnetIo> output 
        <I1V> 8 
        <I1> 0 0 0 <I1> 0 1 0 <I1> 0 2 0 <I1> 0 3 0 <I1> 0 4 0 <I1> 0 5 0 <I1> 0 6 0 <I1> 0 7 0 
        rows=8 dim=2040 [ 228 1 ] dim=2040 [ 264 1 ] dim=2040 [ 264 1 ] dim=2040 [ 264 1 ] dim=2040 [ 1631 1 ] dim=2040 [ 1631 1 ] dim=2040 [ 1631 1 ] dim=2040 [ 1631 1 ]
    </NnetIo> 
</Nnet3Eg>

其中各标签含义如下

  • 后表示的个数,egs里有input和ouput两个,所以是2

  • 后是该NnetIo的类型,代码里根据改值对后面进行不同的格式解析。

  • 后面跟着帧数,输入是36帧,label是8帧。 8帧,t到t+7,所以需要的输入上下文是t-16到t+19

  • 后是每帧的Index (minibatch_index, time_index, conv_index)

  • input中给出所有输入帧的特征组成的矩阵
  • output中给出是8帧label的具体值[228,264,264,2641631,1631,1631,1631]. 1表示概率为1(也可以用小于1的比如0.5??label smoothing??). 2040是output pdf总数????

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK