Learning Speech Representations with Variational Predictive Coding
Summary
This paper introduces a variational predictive coding framework to reinterpret and improve the HuBERT objective for self-supervised speech representation learning. The authors argue that HuBERT's success stems from an underlying principle of predictive coding, which has not been explicitly formalized until now. By framing HuBERT within a variational framework, they demonstrate how its components—masked prediction and quantization—emerge naturally from a probabilistic formulation. This perspective enables new opportunities for improving parameterization and optimization. The paper proposes two modifications: softening the hard k-means quantization with a temperature-controlled soft assignment and using Gumbel-softmax for gradient-based optimization. These changes allow joint optimization of the reconstruction and prediction objectives, which were previously handled in separate steps. Empirical results show that these improvements lead to better pre-training performance and significant gains on four downstream tasks: phone classification, fundamental frequency tracking, speaker recognition, and automatic speech recognition. The framework also connects HuBERT to other self-supervised methods like VQ-APC, CPC, and wav2vec 2.0, highlighting the broader applicability of the predictive coding principle.
PDF viewer
Chunks(34)
Chunk 0 · 1,990 chars
Learning Speech Representations with Variational Predictive Coding Sung-Lin Yeh Peter Bell Hao Tang Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB sunglin.yeh@ed.ac.uk Peter.Bell@ed.ac.uk hao.tang@ed.ac.uk Abstract Despite being the best known objective for learning speech representations, the Hu- BERT objective has not been further de- veloped and improved. We argue that it is the lack of an underlying principle that stalls the development, and, in this paper, we show that predictive coding under a variational view is the principle behind the HuBERT objective. Due to its generality, our formulation provides opportunities to improve parameterization and optimization, and we show two simple modifications that bring immediate improvements to the Hu- BERT objective. In addition, the predictive coding formulation has tight connections to various other objectives, such as APC, CPC, wav2vec, and BEST-RQ. Empirically, the improvement in pre-training brings sig- nificant improvements to four downstream tasks: phone classification, f0 tracking, speaker recognition, and automatic speech recognition, highlighting the importance of the predictive coding interpretation. 1 Introduction Self-supervised learning has been the most suc- cessful approach to semi-supervised learning, leveraging unlabeled data for various downstream tasks (Zhang et al., 2022). The impact of self- supervised learning in speech processing has now been extended to the pursuit of more accessible yet compact, discrete representations to interact with language models (Lakhotia et al., 2021; Borsos et al., 2023; Wang et al., 2023). Behind these suc- cesses, other than the ease of scaling with Trans- formers, the training objectives are the major fac- tor that makes the advancement possible. The training objective of HuBERT (Hsu et al., 2021) is undoubtedly the most successful self- supervised objective. The HuBERT
Chunk 1 · 1,996 chars
al., 2023; Wang et al., 2023). Behind these suc- cesses, other than the ease of scaling with Trans- formers, the training objectives are the major fac- tor that makes the advancement possible. The training objective of HuBERT (Hsu et al., 2021) is undoubtedly the most successful self- supervised objective. The HuBERT objective combines quantization and masked prediction as its main components. However, those choices are supported by empirical evidence rather than an un- derlying principle, and it becomes difficult to fur- ther improve the HuBERT objective and to design novel objectives. This is evidenced by the fact that subsequent work has focused on data augmenta- tion (e.g., WavLM (Chen et al., 2022)), simpli- fication of training (e.g., BEST-RQ (Chiu et al., 2022) and MelHuBERT (Lin et al., 2023)), pair- ing with other objectives (e.g., w2v-BERT (Chung et al., 2021), DinoSR (Liu et al., 2024), and MS- HuBERT (Yadav et al., 2024)), training with dif- ferent resolution (Shi et al., 2023); none has im- proved the HuBERT objective itself. In this paper, we will show that predictive cod- ing is the underlying principle of HuBERT. It is not difficult to see the lineage of predictive coding in HuBERT. HuBERT is inspired by wav2vec 2.0 (Baevski et al., 2020b) and DeepCluster (Caron et al., 2018), and wav2vec 2.0 is in turn influenced by BERT (Devlin et al., 2019) and contrastive pre- dictive coding (CPC) (van den Oord et al., 2018). However, the HuBERT objective looks distinc- tively different from, say, the formulation in Atal and Hanauer (1971), Srinivasan et al. (1982), and Rao and Ballard (1999). What is missing is a general framework for predictive coding (to sub- sume masked prediction) and a discrete interme- diate representation (to subsume quantization). In this paper, the general framework we develop is a variational view of predictive coding and we present HuBERT as a special case. We will de- rive a training objective from first principles, and will show
Chunk 2 · 1,997 chars
coding (to sub- sume masked prediction) and a discrete interme- diate representation (to subsume quantization). In this paper, the general framework we develop is a variational view of predictive coding and we present HuBERT as a special case. We will de- rive a training objective from first principles, and will show how it relates, not only to the HuBERT objective but also to various other objectives, such as VQ-APC (Chung et al., 2020), VQ-CPC (van Niekerk et al., 2020), wav2vec 2.0, and BEST- RQ. Moreover, our derivation can spawn new ob- jectives, and we will give an example that brings immediate improvement over HuBERT. Once we have an objective, the optimization of it, i.e., arXiv:2601.00100v1 [eess.AS] 31 Dec 2025 -- 1 of 16 -- the algorithm of learning representations, natu- rally decouples, and provides opportunities for im- provement. We will give an example of how opti- mizing the HuBERT objective with a different al- gorithm brings immediate improvement over Hu- BERT. We will also validate how much the im- provement on the self-supervised objective trans- fers to downstream performance. 2 A Variational Framework for Predictive Coding To see how predictive coding relates to HuBERT, in this section, we briefly review predictive cod- ing, and its variational formulation. Given its long history, predictive coding comes in various forms. Predictive coding at the concep- tual level, as described in Elias (1955), involves an encoder on one end and a decoder on the other. The encoder processes a signal as it comes in (e.g., frames of a speech utterance in streaming mode), predicts what comes next, and computes the resid- uals (or how off the prediction is). The decoder re- ceives the residuals, makes a prediction, and com- bines the two to reconstruct the signal. The hope is that the residuals require fewer bits to send than the original signal, achieving the goal of coding. When predictive coding later evolves into algo- rithms, the goal becomes predicting
Chunk 3 · 1,999 chars
ediction is). The decoder re-
ceives the residuals, makes a prediction, and com-
bines the two to reconstruct the signal. The hope
is that the residuals require fewer bits to send than
the original signal, achieving the goal of coding.
When predictive coding later evolves into algo-
rithms, the goal becomes predicting one part of
a signal given the other. For example, it is pre-
dicting the next wave sample given the past sam-
ples in Atal and Hanauer (1971), and predicting
the center pixel given the neighboring pixels in
Srinivasan et al. (1982). A detailed exposition is
beyond the scope of this paper and can be found
in Makhoul (1975), Huang and Rao (2011), and
Spratling (2017).
The general idea of predicting one part of a
signal given the other can be formalized as fol-
lows. Let x be a signal, for example, wave
samples of a speech utterance. Predictive cod-
ing is about learning p(xB |xA), or minimizing
− log p(xB |xA), where xA and xB forms a par-
tition of x. To learn the entirety of x, the partition
is not fixed but drawn stochastically, resulting in
the objective
E(xA,xB )∼M(x)[− log p(xB |xA)], (1)
where M(x) is a distribution over many partitions
of x. The formulation in Atal and Hanauer (1971)
can be seen as choosing an M to partition wave
samples in the future and the past, while in Srini-
vasan et al. (1982), M is chosen to partition the
center pixel and the neighboring pixels. For the
interest of this paper, the signal x is a sequence of
acoustic frames x1, . . . , xT , and we partition the
signal into xA = {xi}i∈A and xB = {xi}i∈B
based on two sets of time indices A ⊂ {1, . . . , T }
and B = {1, . . . , T } \ A. It is not difficult to see
that xA will be the masked frames and xB will
be the unmasked frames in HuBERT, and we willl
make this explicit in the next section.
From the coding perspective, a few important
components are missing in equation 1: the en-
coder, the decoder, and the message sent from the
encoder to the decoder. Suppose the encoderChunk 4 · 1,996 chars
e that xA will be the masked frames and xB will be the unmasked frames in HuBERT, and we willl make this explicit in the next section. From the coding perspective, a few important components are missing in equation 1: the en- coder, the decoder, and the message sent from the encoder to the decoder. Suppose the encoder en- codes xA into a message z and sends z to the de- coder to infer xB . We assume that knowing z is sufficient to infer xB , i.e., xB ⊥ ⊥ xA | z. Other- wise, the compression is deemed lossy. A varia- tional upper bound of equation 1 can then be writ- ten as E(xA,xB )∼M(x) h KLq(z|xB )∥p(z|xA) (2) − Ez∼q(z|xB )[log p(xB |z)] i , where q(z|xB ) is an auxiliary distribution of our choice.1 The second term Ez∼q(z|xB )[log p(xB |z)] is known as the reconstruction (or distortion in coding), where p(z|xA) is thought of as the en- coder, p(xB |z) the decoder, and z the message. The variational formulation has the advantage of making the encoder, decoder, and message ex- plicit in the objective. Equation 2 is known as the negative free energy or the negative evidence lower bound (negative ELBO), and this view of predictive coding is first made explicit in Friston and Kiebel (2009) and later generalized in Feld- man and Friston (2010). Our treatment adheres more to the variational lower bound in Kingma and Welling (2014) and Sohn et al. (2015) with the additional assumption that xB ⊥ ⊥ xA | z. 3 HuBERT as Predictive Coding Given the variational framework of predictive cod- ing, we now turn to the HuBERT objective and discuss how it relates to predictive coding. Recall that HuBERT training consists of two steps: first quantizing the acoustic frames and second train- ing a Transformer to predict the cluster IDs of the 1See Appendix A.1 for the derivation of our objective based on the variational lower bound. -- 2 of 16 -- quantized frames. The HuBERT objective often refers to the cross entropy of predicting the clus- ter IDs of each frame (shown as KL in
Chunk 5 · 1,998 chars
frames and second train-
ing a Transformer to predict the cluster IDs of the
1See Appendix A.1 for the derivation of our objective
based on the variational lower bound.
-- 2 of 16 --
quantized frames. The HuBERT objective often
refers to the cross entropy of predicting the clus-
ter IDs of each frame (shown as KL in Figure 1).
However, the cluster IDs of the quantized frames
are produced by k-means, so there is an implicit
ℓ2 loss that measures the distortion (or reconstruc-
tion) of k-means (shown as MSE in Figure 1).
For the following subsections, we will detail
how the partition M and the parameterization of
q(z|xA), p(xB |z), and p(z|xA) are chosen, such
that the variational objective in Equation 2 covers
the cross entropy and the ℓ2 loss. In particular, we
will assume the latent variables are discrete and
correspond to the cluster IDs of frames.
3.1 Masked Prediction
When training HuBERT, a mask is generated at
random for every utterance, and frames in an ut-
terance are partitioned into those that are masked
and those that are not. The objective is to pre-
dict the cluster IDs of the masked frames given
the unmasked frames. Formally, a mask is a sub-
set of indices M ⊂ {1, . . . , T }, and forms a par-
tition xM = {xi}i∈M and x\M = {xi}i̸∈M .
Let M(x) be the stochastic process of generating
masks for the utterance x, where typically a frame
has a small probability to be the start of a span of
frames being masked (Baevski et al., 2020b). With
this choice of M, the predictive coding objective
(equation 2) becomes
LMasked = E(x\M ,xM )∼M(x)[Lx\M ,xM ] (3)
where
Lx\M ,xM = KLq(z|xM )∥p(z|x\M )
+ Ez∼q[− log p(xM |z)]. (4)
Since the HuBERT objective is frame-wise, we as-
sume q(z|xM ) and p(xM |z) factorize frame-wise,
i.e., q(z|xM ) = Q
i∈M q(zi|xi) and p(xM |z) =Q
i∈M p(xi|zi), where z1, . . . , zT are discrete la-
tent variables for frames x1, . . . , xT . After includ-
ing the frame-wise independence, we have
Lx\M ,xM = X
i∈M
Ezi∼q
log q(zi|xi) (5)
− logChunk 6 · 1,994 chars
ve is frame-wise, we as-
sume q(z|xM ) and p(xM |z) factorize frame-wise,
i.e., q(z|xM ) = Q
i∈M q(zi|xi) and p(xM |z) =Q
i∈M p(xi|zi), where z1, . . . , zT are discrete la-
tent variables for frames x1, . . . , xT . After includ-
ing the frame-wise independence, we have
Lx\M ,xM = X
i∈M
Ezi∼q
log q(zi|xi) (5)
− log p(zi|x\M ) − log p(xi|zi)].
Each zi ∈ {1, . . . , K} corresponds to the
cluster ID of xi, where K is the total num-
ber of clusters. Note that the second term
Ezi∼q[− log p(zi|x\M )] is the cross entropy, and
xM
ˆxM
x\M
Transformer
ˆz z
KL
MSE
v1 v2 v3
· · ·
vK
Figure 1: HuBERT as variational predictive cod-
ing. The set v1, . . . , vK are the codewords in the
codebook, x\M are the unmasked frames, and xM
are the masked frames. There are two loss func-
tions involved, the Kullback-Leibler divergence
(KL) and the mean-squared error (MSE).
the last term Ezi∼q[− log p(xi|zi)] is the recon-
struction. We will discuss how these two terms be-
come the cross entropy and the ℓ2 loss of k-means
in the HuBERT objective. We will also discuss
why the first term Ezi∼q[log q(zi|xi)], the negative
entropy, does not appear in the HuBERT objective.
3.2 Quantization and Reconstruction
In HuBERT, cluster IDs of frames that later serve
as targets for prediction are produced by k-means.
The k-means algorithm finds the ID of the closest
centroid for each individual frame, where close-
ness is defined by the ℓ2 loss. To realize the
quantization with k-means in our predictive cod-
ing framework, we assign a point mass to the min-
imum and let
q(zi|xi) = 1zi=argmink=1,...,K ∥xi−vk ∥2 , (6)
where vk is the k-th column of a matrix V , a code-
book consisting of the centroids as columns. Note
that q in principle can depend on xM , but for this
particular choice, q only depends on xi.
The k-means objective is to minimize the dis-
tortion (or reconstruction) measured by the ℓ2 loss.
Given the codebook V , using the cluster zi to re-
construct xi with the centroid vzi gives aChunk 7 · 1,997 chars
of the centroids as columns. Note that q in principle can depend on xM , but for this particular choice, q only depends on xi. The k-means objective is to minimize the dis- tortion (or reconstruction) measured by the ℓ2 loss. Given the codebook V , using the cluster zi to re- construct xi with the centroid vzi gives a distortion ∥xi − vzi ∥2. Naturally, we choose to parameterize p(xi|zi) with a Gaussian and let p(xi|zi) = 1 (2π)d/2 exp − 1 2 ∥xi − vzi ∥2 , (7) where d is the dimension of an acoustic frame. The log of p(xi|zi) gives the ℓ2 loss, i.e., the mean- squared error (MSE). It is now clear that the term -- 3 of 16 -- Ezi∼q[− log p(xi|zi)] in equation 5 involves both the quantization of a frame xi and reconstructing it with the closest centroid vzi . The negative en- tropy term Ezi∼q[log q(zi|xi)] becomes 0 due to q being a point mass. 3.3 Predicting Quantized Frames Because z1, . . . , zT correspond to the targets for prediction, we parameterize p(zi|x\M ) as a soft- max, i.e., p(zi|x\M ) = exp enc(x\M )⊤ i uzi PK k=1 exp enc(x\M )⊤ i uk , (8) where enc(·) is an encoder (typically a Trans- former encoder), enc(x\M )i is the i-th frame of the encoder output after taking the unmasked frames x\M as input, and uk is the k-th col- umn of the final linear layer U . The choice of q(zi|xi) paired with the p(zi|x\M ) above com- pletes the cross entropy Ezi∼q[− log p(zi|x\M )] in equation 5 for predicting the cluster IDs of frames. 3.4 Two-Step Optimization We have now instantiated the HuBERT objec- tive from predictive coding. The cross entropy term Ezi∼q[− log p(zi|x\M )] and the reconstruc- tion term Ezi∼q[− log p(xi|zi)] are both present in the objective (equation 5). In principle, both terms should be optimized together, but HuBERT takes a two-step approach, first finding the codebook by optimizing the reconstruction (running k-means) and then finding the Transformer parameters by optimizing the cross entropy. The two-step ap- proach is reminiscent to
Chunk 8 · 1,993 chars
n the objective (equation 5). In principle, both terms should be optimized together, but HuBERT takes a two-step approach, first finding the codebook by optimizing the reconstruction (running k-means) and then finding the Transformer parameters by optimizing the cross entropy. The two-step ap- proach is reminiscent to the variational view of ex- pectation maximization (Neal and Hinton, 1998; Attias, 1999). However, the codebook in HuBERT is never updated again after the first k-means, and is said to be optimized offline as opposed to jointly with the Transformer parameters. This provides an opportunity to improve the optimization of the HuBERT objective. 4 Extensions Given how HuBERT is a special case of predic- tive coding, in this section, we discuss several im- mediate extensions that are made possible by our framework. 4.1 Softening the Point Mass Even though the two-step optimization in Hu- BERT likely leads to suboptimal solutions, it turns out to be difficult to optimize both the cross en- tropy and the reconstruction together. The diffi- culty stems from q being a point mass, which can- not be optimized with gradient descent. Instead of exact minimization, we can use soft-min and let q(zi|xi) = exp −∥xi − vzi ∥2/τ PK k=1 exp (−∥xi − vk∥2/τ ) , (9) where τ is the temperature. As τ → 0, the distribution collapses to minimization or a hard k-means assignment (Kulis and Jordan, 2011), where each xi is assigned to the code vzi with the smallest squared Euclidean distance. With this pa- rameterization, it is now possible to optimize the entire objective in equation 5 with gradient de- scent. Since q is no longer a point mass, we now have an additional negative entropy term Ezi∼q[log q(zi|xi)] to optimize in equation 5. As entropy is maximized when q is uniform, this term, when being minimized with other terms, encour- ages a diverse set of codes to be used and serves as a regularizer. This is reminiscent to the diver- sity loss in wav2vec 2.0, and we will
Chunk 9 · 1,988 chars
dditional negative entropy term Ezi∼q[log q(zi|xi)] to optimize in equation 5. As entropy is maximized when q is uniform, this term, when being minimized with other terms, encour- ages a diverse set of codes to be used and serves as a regularizer. This is reminiscent to the diver- sity loss in wav2vec 2.0, and we will discuss the differences in later sections. 4.2 Approximating the Expectation with Sampling Since p(xi|zi) is parameterized in a rela- tively simple form, computing the expectation Ezi∼q[− log p(xi|zi)] by enumerating all values of zi, i.e., exact marginalization, is feasible. How- ever, exact marginalization is not always feasible when p(xi|zi) becomes expensive to compute. An alternative is to approximate the expectation with sampling. There are various approaches to opti- mizing a function that involves sampling, and the simplest solution is to use Gumbel softmax (Jang et al., 2017). We will compare marginalization and Gumbel softmax in the experiments. Note that, the expectation is optimized offline with k-means in HuBERT, which will also be included in the com- parisons. 4.3 Future Prediction Our framework of predictive coding can also in- stantiate future prediction given the past.2 The partition of a signal is a simpler than the masked prediction in that choosing a time point partitions 2Next-token prediction in language modeling, e.g., is a form of future prediction. -- 4 of 16 -- the signal into the past and the future. Formally, let M(x) be the stochastic process of choosing a time point t in a signal x. The past x<t and the future x≥t forms a partition of x, and our autoregressive objective becomes LFuture = E(x<t,x≥t)∼M(x)[Lx<t,x≥t ] (10) where Lx<t,x≥t = T X i=t Ezi∼q log q(zi|xi) (11) − log p(zi|x<i) − log p(xi|zi)]. The form of equation 11 is nearly identical to masked prediction in equation 5, except the term p(zi|x<i).3 In terms of parameterization, p(zi|x<i) is typically modeled with a unidirec- tional LSTM or a causal
Chunk 10 · 1,999 chars
≥t)∼M(x)[Lx<t,x≥t ] (10) where Lx<t,x≥t = T X i=t Ezi∼q log q(zi|xi) (11) − log p(zi|x<i) − log p(xi|zi)]. The form of equation 11 is nearly identical to masked prediction in equation 5, except the term p(zi|x<i).3 In terms of parameterization, p(zi|x<i) is typically modeled with a unidirec- tional LSTM or a causal Transformer; the rest of the terms remain the same. Note that only the suf- fix of a signal participates in the objective, and M needs to be carefully chosen to avoid putting too much weight on the suffixes. In practice, instead of choosing a time point at random, all frames are predicted with an equal amount of times (van den Oord et al., 2018; Chung et al., 2019). Since speech is generally smooth in time, an additional assumption zi ⊥ ⊥ xi−κ+1:i | x≤i−κ is commonly made for a small κ > 0 (e.g., in van den Oord et al. (2018) and Chung et al. (2019)). In other words, once we know the past frames x≤i−κ close enough to the current time point i, knowing additional few frames xi−κ+1:i does not add much information to zi. The term p(zi|x<i) becomes p(zi|x≤i−κ) under this as- sumption. 5 Connections to Prior Work We have demonstrated in Section 3 how our framework instantiates HuBERT. Given the gen- erality of our framework, we can also instanti- ate other objectives that are similar to other self- supervised objectives. In this section, we will dis- cuss what our framework can achieve and how it differs from prior work, including approaches based on likelihood and contrastive learning. We do not consider combined objectives in this sec- tion, such as w2v-BERT (Chung et al., 2021). 3The derivation of future prediction can be found in Ap- pendix A.1. 5.1 APC and VQ-APC We begin with other variants of predictive cod- ing for speech representation that optimizes the likelihood equation 1, in which xA is x<t, xB is x≥t. First, autoregressive predictive coding (Chung et al., 2019; Yang et al., 2022) (APC) is a generalization of linear predictive coding (Atal and
Chunk 11 · 1,997 chars
A.1. 5.1 APC and VQ-APC We begin with other variants of predictive cod- ing for speech representation that optimizes the likelihood equation 1, in which xA is x<t, xB is x≥t. First, autoregressive predictive coding (Chung et al., 2019; Yang et al., 2022) (APC) is a generalization of linear predictive coding (Atal and Hanauer, 1971; Saito et al., 1967) (LPC), where a model is trained to predict future frames given the past. Its vector-quantized variant, VQ- APC (Chung et al., 2020), includes a quantization layer in the neural network while optimizing APC. Both APC and VQ-APC optimize the likelihood, while our framework is based on the variational bound in equation 2. APC and VQ-APC in their original form in Chung et al. (2019) and Chung et al. (2020) do not have the latent variables ex- plicitly stated. Our framework makes the choice of latent variables explicit, and the proposed auxil- iary distribution offers more flexibility to estimate the likelihood, especially when marginalization of latent variables is not tractable. Similar to our approach, Yang et al. (2022) and Yeh and Tang (2022) make the latent variables ex- plcit, leading to various other interpretations of APC with respect to, e.g., mutual information and co-training (McAllester, 2018). 5.2 MPC and DeCoAR MPC (Jiang et al., 2019; Zhang et al., 2021) and DeCoAR (Ling et al., 2020) are both general- izations of APC, replacing future prediction with masked prediction. DeCoAR 2.0 (Ling and Liu, 2020) adds quantization similar to VQ-APC. Both optimize the likelihood in equation 1. The dif- ference between them lies in how the frames are masked—MPC masks many small spans while DeCoAR masks a single large span. 5.3 HuBERT, WavLM, and BEST-RQ We have shown that HuBERT is a special case of our framework. Subsequent work, such as WavLM (Chen et al., 2022) and BEST-RQ (Chiu et al., 2022), uses the same HuBERT objective. WavLM improves over HuBERT with data aug- mentation. BEST-RQ avoids updating the code- book,
Chunk 12 · 1,995 chars
ingle large span. 5.3 HuBERT, WavLM, and BEST-RQ We have shown that HuBERT is a special case of our framework. Subsequent work, such as WavLM (Chen et al., 2022) and BEST-RQ (Chiu et al., 2022), uses the same HuBERT objective. WavLM improves over HuBERT with data aug- mentation. BEST-RQ avoids updating the code- book, but requires a careful initialization. When training HuBERT in Hsu et al. (2021), there are multiple iterations, each of which trains a Transformer encoder from scratch. What dif- fers for each iteration are the training targets. In -- 5 of 16 -- the first iteration, quantized MFCC is used as tar- gets, while in the second iteration, quantized hid- den vectors from layer 9 of the first iteration are used as targets. Our framework can also instanti- ate the second iteration with a pre-trained Trans- former encoder from the first iteration. We can simply let q(zi|hi) = exp −∥hi − vzi ∥2/τ PK k=1 exp (−∥hi − vk∥2/τ ) , (12) where the i-th frame hi now comes from the inter- mediate layer (e.g., 6th layer) of the encoder from the first iteration. 5.4 CPC, VQ-CPC, and wav2vec 2.0 CPC (van den Oord et al., 2018), its vector- quantized variants, VQ-CPC (van Niekerk et al., 2020), wav2vec (Schneider et al., 2019), and wav2vec 2.0 (Baevski et al., 2020b), are all based on noise contrastive estimation (NCE) (Smith and Eisner, 2005; Gutmann and Hyvärinen). In this section, we show how wav2vec 2.0 can be instan- tiated with our framework. We derive their contrastive objective in wav2vec 2.0 from the cross entropy in the variational objec- tive Ezi∼q[− log p(zi|x\M )]. Recall that wav2vec 2.0 has a CNN followed by VQ and a Trans- former, where the VQ produces targets for con- trastive learning. We first choose q(zi|xM ) = δzi=g(xM ) (13) p(zi|x\M ) = exp(sim(enc(x\M )i, zi)) R exp(sim(enc(x\M )i, z))dz , (14) where g(·) is the CNN with the VQ, enc(·) is the Transformer, δ being the Dirac delta function, and sim(x, y) = cos(x, y). The choice of p(zi|x\M ) is
Chunk 13 · 1,987 chars
the VQ produces targets for con- trastive learning. We first choose q(zi|xM ) = δzi=g(xM ) (13) p(zi|x\M ) = exp(sim(enc(x\M )i, zi)) R exp(sim(enc(x\M )i, z))dz , (14) where g(·) is the CNN with the VQ, enc(·) is the Transformer, δ being the Dirac delta function, and sim(x, y) = cos(x, y). The choice of p(zi|x\M ) is similar to that of Equation 8 in our framework, except a similarity function and a continuous z. The cross entropy becomes Ezi∼q[− log p(zi|x\M )] = log exp(sim(enc(x\M )i, zi)) R exp(sim(enc(x\M )i, z′))dz . (15) Note that zi here is the codeword after quanti- zation, i.e., a continuous vector (Baevski et al., 2020b). The denominator in the cross entropy is in general difficult to compute due to the integral.4 4In fact, even when sim(x, y) = x⊤y, i.e., p(zi|x\M ) being a von-Mises–Fisher distribution, the denominator of p is still difficult to compute. Inspired by NCE, van den Oord et al. (2018) and Baevski et al. (2020b) approximate the integral by summing over a set of negative samples Si. The cross entropy becomes log exp(sim(enc(x\M )i, zi)) P z′∈Si exp(sim(enc(x\M )i, z′)) . (16) which is the objective used in wav2vec 2.0. Con- sequently, minimizing the contrastive loss is ana- logue to minimizing the cross entropy between the masked frames and unmasked frames after quanti- zation. Several design choices in wav2vec 2.0 are taken verbatim by HuBERT, but they have a few key differences. Both use a softmax to parameterize p(zi|x\M ), but wav2vec 2.0 assumes zi to be a continuous vector, while HuBERT assumes zi to be discrete. As a consequence, wav2vec 2.0 uses a softmax over the negative and positive samples, while HuBERT uses a softmax over the codewords in a codebook. The codebook in wav2vec 2.0 is trained together with the rest of the model. In other words, there is no reconstruction and we can simply parameterize p(xi|zi) as a uniform distri- bution. In contrast, the codebook in HuBERT is trained with an offline k-means. Compared to
Chunk 14 · 1,991 chars
ses a softmax over the codewords in a codebook. The codebook in wav2vec 2.0 is trained together with the rest of the model. In other words, there is no reconstruction and we can simply parameterize p(xi|zi) as a uniform distri- bution. In contrast, the codebook in HuBERT is trained with an offline k-means. Compared to our framework, we have the choice to optimize the codebook offline or jointly with the Transformer. There is a additional diversity loss in wav2vec 2.0 that computes the entropy of codeword usage within a batch, to encourage the use of individual codewords. The diversity loss was not meant to be part of the main objective (Baevski et al., 2020b), but more of a regularizer that improves training. However, our instantiation on HuBERT naturally has an entropy term per frame that serves a similar purpose as the diversity loss. 6 Experiments We have shown that our framework subsumes the HuBERT objective and the two are numerically identical. In the experiments, we will study how the extensions in Section 4 can lead to immediate improvements over HuBERT. 6.1 Experimental Settings Unless otherwise stated, models are built with a Transformer encoder, a linear projection after the encoder, and a codebook (consisting of the centroids in k-means). We follow Chung et al. (2019), Jiang et al. (2019), Ling and Liu (2020), -- 6 of 16 -- Model p(z|xA) q(z|xB ) p(xB |z) Codebook Ez∼q(z|xB )[·] MASKED PREDICTION HuBERT Obj (eq 5) softmax (eq 8) point-mass (eq 6) Gaussian (eq 7) offline single point Masked-VPC (eq 5) softmax (eq 8) soft-min (eq 9) Gaussian (eq 7) joint opt. Gumbel / marginal Masked-NCE (eq 5) softmax (eq 16) point-mass (eq 13) uniform joint opt. single point FUTURE PREDICTION Future-VPC (eq 11) softmax (eq 8) soft-min (eq 9) Gaussian (eq 7) joint opt. Gumbel / marginal VQ-APC (eq 1) softmax (eq 8) n/a Gaussian (eq 7) joint opt. n/a Table 1: A summary of objectives that can instantiated from our framework. Each loss is charaterized by how the
Chunk 15 · 1,998 chars
iform joint opt. single point FUTURE PREDICTION Future-VPC (eq 11) softmax (eq 8) soft-min (eq 9) Gaussian (eq 7) joint opt. Gumbel / marginal VQ-APC (eq 1) softmax (eq 8) n/a Gaussian (eq 7) joint opt. n/a Table 1: A summary of objectives that can instantiated from our framework. Each loss is charaterized by how the probability distributions are parameterized, how the code is optimized, and how the expectation is computed. When q is a point-mass, the expectation becomes an assignment to the point (single point). The codebook can be either optimized with k-means (offline) or jointly optimized (online) with the Transformer. Misra et al. (2021), Chiu et al. (2022), and Lin et al. (2023), using Mel spectrograms as acoustic frames. We extract 40-dimensional log Mel fea- tures with a 25 ms window and a 10 ms hop. We concatenate every two contiguous frames, having 80-dimensional features with a 20 ms frame rate. For masked prediction, we use a masking span of 4 frames (80 ms) with every frame being the start of a span with a probability of 0.2; spans may over- lap. For future prediction, we use a time shift (the κ in Section 4.3) of 2 frames (40 ms). In the experiments, we consider a BASE setting similar to that described by Baevski et al. (2020b); Hsu et al. (2021), using 12-layer Transformers with 6 attention heads instead of 8. We set the codebook size to 100 for all models, following the cluster size in Hsu et al. (2021). We train models on the 360-hour subset of LibriSpeech (Panayotov et al., 2015) for 150 epochs. We detail hyperpa- rameters in Appendix A.3. 6.2 Model Variants and Feature Summary Before comparing different optimization strate- gies and downstream performance, we summa- rize model variants evaluated in our experiments in Table 1. Each model corresponds to a par- ticular instantiation of our framework, character- ized by a combination of choices such as predic- tion task, codebook optimization and hard (point- mass) or soft assignments (soft-min) of
Chunk 16 · 1,998 chars
d downstream performance, we summa- rize model variants evaluated in our experiments in Table 1. Each model corresponds to a par- ticular instantiation of our framework, character- ized by a combination of choices such as predic- tion task, codebook optimization and hard (point- mass) or soft assignments (soft-min) of q. Specif- ically, we denote our variational predictive coding framework as Masked-VPC or Future-VPC, de- pends on the prediction task. The framework thus allows us to compare different features in isola- tion. Note that the original HuBERT model (Hsu et al., 2021) consists of two or more training itera- tions. Since we concerns training objective rather than the models, we use HuBERT Obj to indicate its training “objective” to avoid confusions. The first few sets experiments compares the different losses in the first iteration, and we leave the results of the second iteration to last subsection. 6.3 Comparing Optimization Methods We first focus on comparing the first two variants (HuBERT Obj, Masked-VPC) in Table 1, which optimize the same loss. As pointed out in Sec- tion 4, softening the point mass, i.e., moving away from hard assignment in k-means to soft assign- ment, provides an opportunity to jointly optimize the cross entropy and the reconstruction. Not only could we jointly optimize both terms, we also have the option to choose between exact marginaliza- tion and sampling. When sampling, we only take a single sample and use Gumbel softmax to com- pute the gradient. There are a total of three op- tions: the original hard assignment in HuBERT Obj, soft assignment with exact marginalization, and soft assignment with sampling. HuBERT uses k-means++ (Arthur and Vassilvitskii, 2007) as ini- tialization, so we also compare random initializa- tion and k-means++ initialization for the code- book. Figure 2 (top) shows the variational objective (the negative ELBO in equation 5) of training BASE models on the 360-hour subset of Lib- rispeech. The HuBERT
Chunk 17 · 1,988 chars
T uses k-means++ (Arthur and Vassilvitskii, 2007) as ini- tialization, so we also compare random initializa- tion and k-means++ initialization for the code- book. Figure 2 (top) shows the variational objective (the negative ELBO in equation 5) of training BASE models on the 360-hour subset of Lib- rispeech. The HuBERT objective starts off low because of the offline k-means, but both exact marginalization and sampling improve over the HuBERT objective after training. Similar to how HuBERT and BEST-RQ are sensitive to initializa- tion (Chiu et al., 2022), marginalization ends up -- 7 of 16 -- 0 50 100 150 7.5 7.8 8.2 8.6 9.0 log Mels -ELBO 0 50 100 150 Epoch 7.9 8.3 8.8 9.3 9.8 waveform -ELBO HuBERT Marginal (k-means++) Gumbel (random) Marginal (random) Figure 2: The training losses of different optimiza- tion approaches implemented by our own BASE (top) and in fairseq BASE (bottom). The final loss values can be found in Table 6. at drastically different final values depending on whether k-means++ is used. Sampling (and tak- ing the gradient with Gumbel softmax) turns out to be both efficient and fast converging. To confirm the findings, we run the same ex- periments in fairseq5 , in which CNN is used to aggregate frame-wise features. Figure 2 (bottom) shows the same trend, except that marginalization works out of the box with a randomly initialized codebook. The final losses are presented in Ta- ble 6. For the rest of the experiments, if not oth- erwise stated, we will only use Mel spectrograms as input. We will also use sampling (and Gumbel softmax) to optimize our objective due to its faster convergence. 6.4 Downstream Evaluation Given that we can improve pre-training, the next question is whether the improvement transfer to downstream tasks. Besides, our framework allows for controlled experiments that only differ in loss functions while holding everything else, such as the model architecture and training hyperparame- ters, fixed. Based on the
Chunk 18 · 1,989 chars
iven that we can improve pre-training, the next question is whether the improvement transfer to downstream tasks. Besides, our framework allows for controlled experiments that only differ in loss functions while holding everything else, such as the model architecture and training hyperparame- ters, fixed. Based on the connections to prior work in Section 5, we will study two important design choices, i.e., comparing masked prediction to fu- 5The experiments mostly follow the default HuBERT configuration at hubert_base_librispeech.yaml, using waveform as input. Table 2: Downstream probing results for differ- ent speech representations on phone classification, speaker verification, and f0 tracking. PER ↓ EER ↓ RMSE ↓ dev93 eval92 voxceleb eval92 log Mel 49.8 50.0 24.6 38.4 i-vector - - 15.7 - HuBERT Obj 12.8 12.6 18.3 23.4 Masked-VPC 11.8 11.3 14.4 20.9 ture prediction and comparing the contrastive loss in wav2vec 2.0 to the softmax in the HuBERT ob- jective. The utility of speech representations on down- stream tasks are evaluated with probing (van den Oord et al., 2018; Chung et al., 2019; Yang et al., 2022, 2021), where the pre-trained models are frozen and small classifiers are trained to complete tasks given the hidden vectors of one of the layers. We conduct four probing tasks, phone clas- sification, speaker verification, fundamental fre- quency tracking, and automatic speech recogni- tion. All settings follow Yang et al. (2022) unless otherwise noted. For phone classification, we train a linear classifier to predict phone labels obtained from forced alignments on the Wall Street Journal (Paul and Baker, 1992) (WSJ). We report phone error rates (PERs) on dev93 and eval92, using 10% of the training set si284 for development. For f0 tracking, we train a linear regression model to predict the f0 extracted with PYIN (Mauch and Dixon, 2014) on WSJ. We report the root-mean- square error (RMSE) in Hz on eval92. For speaker verification, we train a two-layer
Chunk 19 · 1,997 chars
r rates (PERs) on dev93 and eval92, using 10% of the training set si284 for development. For f0 tracking, we train a linear regression model to predict the f0 extracted with PYIN (Mauch and Dixon, 2014) on WSJ. We report the root-mean- square error (RMSE) in Hz on eval92. For speaker verification, we train a two-layer classifier to predict speaker identities on VoxCeleb1 (Na- grani et al., 2020) and use the intermediate layer as the speaker embedding for speaker verification. We report equal error rates (EERs) on the test set. The reported numbers are with the best layer for each task and for each model. More details are in Appendix A.4. Better pre-training leads to better downstream performance We have shown that using soft assignment and sampling gives us the best pre- training loss. To answer whether the improved pre-training transfer to downstream tasks, we compare HuBERT Obj with the soft assignment variant optimized with sampling and Gumbel soft- max (termed Masked-VPC). Results on three tasks are shown in Table 2. We see that improving pre- -- 8 of 16 -- Table 3: Downstream probing results comparing future prediction (Future-VPC) and masked pre- diction (Masked-VPC) on phone classification, speaker verification, and f0 tracking. PER ↓ EER ↓ RMSE ↓ dev93 eval92 voxceleb eval92 Masked-VPC 11.8 11.3 14.4 20.9 Future-VPC 16.0 15.5 13.6 20.5 Table 4: Downstream probing results comparing NCE (Masked-NCE), i.e., using a contrastive loss with the HuBERT approaches (HuBERT Obj and Masked-VPC) on phone classification, speaker verification, and f0 tracking. PER ↓ EER ↓ RMSE ↓ dev93 eval92 voxceleb eval92 Masked-NCE 12.2 12.4 20.2 24.4 HuBERT Obj 12.8 12.6 18.3 23.4 Masked-VPC 11.8 11.3 14.4 20.9 training indeed improves downstream tasks across the board. ASR results are deferred to later sec- tions, but the conclusion stands. Future prediction can be as strong as masked prediction on some downstream tasks As shown in Section 4.3, under our framework, it is possible
Chunk 20 · 1,996 chars
2.6 18.3 23.4 Masked-VPC 11.8 11.3 14.4 20.9 training indeed improves downstream tasks across the board. ASR results are deferred to later sec- tions, but the conclusion stands. Future prediction can be as strong as masked prediction on some downstream tasks As shown in Section 4.3, under our framework, it is possible (and relatively simple) to switch from masked prediction to future prediction while hold- ing everything else fixed. Future prediction is more relevant than masked prediction in cer- tain scenarios, such as streaming or pre-training decoder-only Transformers. Misra et al. (2021), for example, show that future prediction can be as strong as masked prediction for streaming ASR. Results comparing future prediction (Future-VPC) and masked prediction (Masked-VPC) are shown in Table 3. We find that, not surprisingly, future prediction is worse at phone classification com- pared to masked prediction. However, future pre- diction is better than masked prediction on speaker verification and fundamental frequency tracking. NCE is on par with HuBERT Obj To compare to the contrastive loss, we parameterize q(zi|xi) and p(zi|x\M ) as detailed in Section 5.4. We characterize the model in Table 1 and denote it as Masked-NCE. The entropy term is a constant be- cause of the Dirac delta. We follow Baevski et al. (2020b) and take the negative samples for NCE Table 5: Probing results with lexicon-free seq2seq models on WSJ. No language models are used. CER ↓ WER ↓ dev93 eval92 dev93 eval92 BASELINE wav2letter++ † 6.3 4.1 19.5 13.9 18L Transformer‡ - - 22.2 17.9 PROBING WITH OUR SEQ2SEQ log Mel 6.8 5.1 18.2 14.7 VQ-APC 5.8 5.2 16.8 14.9 Future-VPC 5.4 4.0 15.5 12.4 Masked-NCE 5.0 4.9 14.5 14.4 HuBERT Obj 5.2 5.0 15.2 14.5 Masked-VPC 4.4 3.6 13.6 11.4 † Numbers taken from Baevski et al. (2020a). ‡ Numbers taken from Higuchi et al. (2020). from frames within the same batch. The codebook is updated together with the contrastive loss with Gumbel softmax as in Baevski et al.
Chunk 21 · 1,997 chars
5.5 12.4 Masked-NCE 5.0 4.9 14.5 14.4 HuBERT Obj 5.2 5.0 15.2 14.5 Masked-VPC 4.4 3.6 13.6 11.4 † Numbers taken from Baevski et al. (2020a). ‡ Numbers taken from Higuchi et al. (2020). from frames within the same batch. The codebook is updated together with the contrastive loss with Gumbel softmax as in Baevski et al. (2020b). Re- sults are shown in Table 4. NCE alone performs surprisingly well and closely matches HuBERT’s results, while our approach is still superior. 6.5 Automatic Speech Recognition Automatic speech recognition (ASR) has been the main driving force behind speech representation learning. In this section, we evaluate the qual- ity of speech representations with two ASR set- tings, a probing setting with sequence-to-sequence (seq2seq) models and a fine-tuning setting with connectionist temporal classification (CTC). We report CERs and WERs averaged over three runs. Probing with seq2seq In the probing setting, we follow Yang et al. (2022), extracting the layer that achieves the best phone classification, and train a seq2seq model with character outputs on WSJ.6 Following the spirit of probing, we make sure that the improvement is solely from the learned rep- resentations and that the phonetic information is accessible, by not using a language model or a lexicon, and by using a lightweight model. Beam search is used with a beam size of 5. Table 5 reports the ASR performance in terms of character error rates (CERs) and word error rates (WERs). Despite being lightweight, our baseline model on log Mel features is by no means weak, and is on par with wav2letter++ in Pratap et al. (2019) and better than a 18-layer Trans- 6More training details are in Appendix A.4. -- 9 of 16 -- Table 6: Results of ASR fine-tuned on WSJ using CTC for different optimization approaches during pre-training. No language models are used. −ELBO dev93 eval92 FINE-TUNING (OURS) HuBERT Obj 7.79 15.5 12.5 Marginal (k-means++) 7.50 14.3 11.0 Marginal (random) 8.70 15.0 11.7 Gumbel
Chunk 22 · 1,988 chars
ls are in Appendix A.4. -- 9 of 16 -- Table 6: Results of ASR fine-tuned on WSJ using CTC for different optimization approaches during pre-training. No language models are used. −ELBO dev93 eval92 FINE-TUNING (OURS) HuBERT Obj 7.79 15.5 12.5 Marginal (k-means++) 7.50 14.3 11.0 Marginal (random) 8.70 15.0 11.7 Gumbel (random) 7.48 14.1 11.7 FINE-TUNING (FAIRSEQ) HuBERT Obj 8.14 14.9 11.8 Marginal (k-means++) 7.91 14.5 11.1 Marginal (random) 7.91 14.2 10.9 Gumbel (random) 7.89 14.3 11.0 former baseline in Higuchi et al. (2020). We then compare representations learned from the ap- proaches characterized in Table 1, including two future prediction approaches (VQ-APC, Future- VPC) and three masked prediction approaches (Masked-NCE, HuBERT Obj, Masked-VPC). We first find that representations learned from future prediction are not far behind those from masked prediction. The second is that the improvement from HuBERT Obj to Masked-VPC (in Section 6.3) transfers well to ASR as well. Fine-tuning with CTC To further confirm that improvement in pre-training transfers to ASR, we fine-tune the pre-trained models on WSJ (Paul and Baker, 1992) using CTC (Graves et al., 2006) with characters output. We fine-tune our BASE for 100 epochs and fairseq BASE for 200 epochs on si284. Similar to Lee and Watanabe (2021); Higuchi et al. (2020), we average models from last 10 epochs after training to obtain final model parameters. Table 6 summarizes the pre-training losses (-ELBO) reported in Figure 2 and the cor- responding WERs on dev93 and eval92. The inference is done with greedy decoding. The im- provement in objective indeed transfers well to ASR in both settings. To see whether the improvement is still present when decoding with a language model, Table 7 compared the results after using a 4-gram word language model (Heafield et al., 2013). A beam size of 2000 is used for decoding with the 4-gram language model. The improvement transfer well even with the use of a language
Chunk 23 · 1,997 chars
ngs. To see whether the improvement is still present when decoding with a language model, Table 7 compared the results after using a 4-gram word language model (Heafield et al., 2013). A beam size of 2000 is used for decoding with the 4-gram language model. The improvement transfer well even with the use of a language model. Lastly, to see the effect on small datasets, we follow Baevski et al. (2020b) and evaluate pho- Table 7: Comparing ASR fine-tuned on WSJ using CTC with and without a 4-gram language model. w/o LM w/ LM dev93 eval92 dev93 eval92 BASELINE 12L Transformer‡ 20.1 16.5 - - wav2letter++ † 19.5 8.57 FINE-TUNING Masked-NCE 14.7 11.1 7.8 4.4 HuBERT Obj 15.5 12.5 7.3 4.6 Masked-VPC 14.1 10.3 6.8 4.6 ‡ Numbers taken from Lee and Watanabe (2021). † Numbers taken from Baevski et al. (2020a). Table 8: Phone recognition fine-tuned with CTC on TIMIT. dev PER test PER ↓ BASELINE 12L Transformer 22.1 24.1 FINE-TUNING Masked-NCE 11.2 12.9 HuBERT Obj 10.2 11.3 Masked-VPC 9.5 11.0 netic recognition on TIMIT. We observe the same trend in Table 8 that our approach consistently per- forms better than HuBERT Obj. 6.6 Second-Iteration Training As detailed in Section 5.3, our framework can be extended to the second iteration as in HuBERT (Hsu et al., 2021). Following Hsu et al. (2021), the Transformer encoder from the previous itera- tion is held fixed in the second iteration. We run the second iteration using fairseq, based on the its first iteration in Figure 2 (bottom). For the proposed approach, we choose Masked-VPC with k-means++ initialization from the first iter- ation, applying the same initialization to the sec- ond iteration training. We adopt the 6-th layer of Transformers of HuBERT and Masked-VPC in the previous iteration, the layer that provides the best downstream ASR performance in Table 5. We use a temperature τ of 10 to avoid q being one-hot. We train second-iteration HuBERT and Masked- VPC for 150 epochs, and evaluate them with CTC finetuning as in the
Chunk 24 · 1,990 chars
6-th layer of Transformers of HuBERT and Masked-VPC in the previous iteration, the layer that provides the best downstream ASR performance in Table 5. We use a temperature τ of 10 to avoid q being one-hot. We train second-iteration HuBERT and Masked- VPC for 150 epochs, and evaluate them with CTC finetuning as in the previous section. As shown in Table 9, the second iteration im- proves both approaches by a large margin. In par- ticular, we observe 4.1% reduction of WER on -- 10 of 16 -- Table 9: Second-iteration results of ASR fine- tuned on WSJ using CTC with or without an ngram LM. w/o LM w/ LM dev93 eval92 dev93 eval92 FIRST ITERATION HuBERT Obj 14.9 11.8 7.9 6.0 Masked-VPC 14.5 11.1 7.8 5.3 SECOND ITERATION HuBERT Obj 11.1 8.2 6.2 4.2 Masked-VPC 10.4 7.9 5.9 3.8 dev93 with our approach. Moreover, the im- provements in objective transfers to the second it- eration, in which Masked-VPC obtains 0.7% and 0.3% lower WERs than HuBERT on each set. Similar trend is observed with an ngram language model for CTC decoding. 7 Conclusion In this work, we provide an underlying princi- ple for the HuBERT objective—a variational view of predictive coding. We show the utility of this framework by identifying opportunities to im- prove the HuBERT objective, i.e., its parameter- ization and optimization. Evaluating across sev- eral downstream tasks using probing and fine- tuning, we empirically show that the improved pre-training learns better speech representations. We further show that the predictive coding frame- work is general and has many connections to ex- isting self-supervised objectives. Predictive cod- ing has fruitful and mature theoretical underpin- nings, while the theory of self-supervised learning is still in its infancy. We hope the making the con- nections among self-supervised learning and pre- dictive coding clear as is done in this work helps advance the understanding of both. References David Arthur and Sergei Vassilvitskii. 2007. k- means++: the
Chunk 25 · 1,997 chars
derpin- nings, while the theory of self-supervised learning is still in its infancy. We hope the making the con- nections among self-supervised learning and pre- dictive coding clear as is done in this work helps advance the understanding of both. References David Arthur and Sergei Vassilvitskii. 2007. k- means++: the advantages of careful seeding. In SODA. Bishnu S Atal and Suzanne L Hanauer. 1971. Speech analysis and synthesis by linear predic- tion of the speech wave. The journal of the acoustical society of America. Hagai Attias. 1999. A variational Baysian frame- work for graphical models. NeurIPS. Alexei Baevski, Steffen Schneider, and Michael Auli. 2020a. vq-wav2vec: Self-supervised learning of discrete speech representations. In ICLR. Alexei Baevski, Yuhao Zhou, Abdelrahman Mo- hamed, and Michael Auli. 2020b. wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS. Zalán Borsos, Raphaël Marinier, Damien Vin- cent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. 2023. Au- diolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing. Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. 2018. Deep cluster- ing for unsupervised learning of visual features. In ECCV. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al. 2022. WavLM: Large-scale self- supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing. Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. 2022. Self-supervised learning with random-projection quantizer for speech recognition. In ICML. Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James R. Glass. 2019. An unsupervised autore- gressive model for speech representation learn- ing. In Interspeech. Yu-An Chung, Hao Tang,
Chunk 26 · 1,989 chars
s Qin, Yu Zhang, Jiahui Yu, and Yonghui Wu. 2022. Self-supervised learning with random-projection quantizer for speech recognition. In ICML. Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James R. Glass. 2019. An unsupervised autore- gressive model for speech representation learn- ing. In Interspeech. Yu-An Chung, Hao Tang, and James R. Glass. 2020. Vector-quantized autoregressive predic- tive coding. In Interspeech. Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. 2021. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In ASRU. -- 11 of 16 -- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. ACL. Peter Elias. 1955. Predictive coding–i. IRE trans- actions on information theory. Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu. 2020. Exploring wav2vec 2.0 on speaker ver- ification and language identification. In Inter- speech. Harriet Feldman and Karl J Friston. 2010. Atten- tion, uncertainty, and free-energy. Frontiers in human neuroscience. Karl Friston and Stefan Kiebel. 2009. Predictive coding under the free-energy principle. Philo- sophical transactions of the Royal Society B: Biological sciences. Jonas Geiping and Tom Goldstein. 2023. Cram- ming: Training a language model on a single GPU in one day. In ICML. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Con- nectionist temporal classification: labelling un- segmented sequence data with recurrent neural networks. In ICML. Michael Gutmann and Aapo Hyvärinen. Noise- contrastive estimation: A new estimation prin- ciple for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statis- tics. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable mod- ified Kneser-Ney language model
Chunk 27 · 1,993 chars
- contrastive estimation: A new estimation prin- ciple for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statis- tics. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable mod- ified Kneser-Ney language model estimation. In ACL. Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi. 2020. Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict. In Inter- speech. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE Transactions on Audio, Speech, and Language Processing. Yanping Huang and Rajesh PN Rao. 2011. Predic- tive coding. Wiley Interdisciplinary Reviews: Cognitive Science. Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with Gumbel- softmax. In ICLR. Dongwei Jiang, Xiaoning Lei, Wubo Li, Ne Luo, Yuxuan Hu, Wei Zou, and Xiangang Li. 2019. Improving transformer-based speech recogni- tion using unsupervised pre-training. arXiv preprint arXiv:1910.09932. Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In ICLR. Brian Kulis and Michael I. Jordan. 2011. Revisit- ing k-means: New algorithms via bayesian non- parametrics. In ICML. Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Abdelrahman Mohamed, et al. 2021. On gen- erative spoken language modeling from raw au- dio. Transactions of the Association for Com- putational Linguistics. Jaesong Lee and Shinji Watanabe. 2021. Interme- diate loss regularization for CTC-based speech recognition. In ICASSP. Tzu-Quan Lin, Hung-yi Lee, and Hao Tang. 2023. MelHubert: A simplified Hubert on Mel spec- trograms. In ASRU. Shaoshi Ling and Yuzong Liu. 2020. Decoar 2.0: Deep
Chunk 28 · 1,997 chars
ation for Com- putational Linguistics. Jaesong Lee and Shinji Watanabe. 2021. Interme- diate loss regularization for CTC-based speech recognition. In ICASSP. Tzu-Quan Lin, Hung-yi Lee, and Hao Tang. 2023. MelHubert: A simplified Hubert on Mel spec- trograms. In ASRU. Shaoshi Ling and Yuzong Liu. 2020. Decoar 2.0: Deep contextualized acoustic representa- tions with vector quantization. arXiv preprint arXiv:2012.06659. Shaoshi Ling, Yuzong Liu, Julian Salazar, and Katrin Kirchhoff. 2020. Deep contextualized acoustic representations for semi-supervised speech recognition. In ICASSP. Alexander H Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, and Jim Glass. 2024. Dinosr: Self-distillation and online clustering for self- supervised speech representation learning. In NeurIPS. -- 12 of 16 -- John Makhoul. 1975. Linear prediction: A tutorial review. Proceedings of the IEEE. Matthias Mauch and Simon Dixon. 2014. PYIN: A fundamental frequency estimator using prob- abilistic threshold distributions. IEEE. David McAllester. 2018. Information theoretic co-training. arXiv preprint arXiv:1802.07572. Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, and Khe Chai Sim. 2021. A com- parison of supervised and unsupervised pre- training of end-to-end models. In Interspeech. Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. Voxceleb: Large- scale speaker verification in the wild. Computer Speech & Language. Radford M Neal and Geoffrey E Hinton. 1998. A view of the em algorithm that justifies incre- mental, sparse, and other variants. In Learning in graphical models. Springer. Benjamin van Niekerk, Leanne Nortje, and Her- man Kamper. 2020. Vector-quantized neural networks for acoustic unit discovery in the Ze- roSpeech 2020 challenge. In Interspeech. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Vassil Panayotov, Guoguo
Chunk 29 · 1,997 chars
ortje, and Her- man Kamper. 2020. Vector-quantized neural networks for acoustic unit discovery in the Ze- roSpeech 2020 challenge. In Interspeech. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. LibriSpeech: an asr corpus based on public domain audio books. In ICASSP. Douglas B Paul and Janet Baker. 1992. The design for the Wall Street Journal-based CSR corpus. In Speech and Natural Language Workshop. Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vi- taliy Liptchinsky, and Ronan Collobert. 2019. wav2letter++: A fast open-source speech recog- nition system. In ICASSP. Rajesh PN Rao and Dana H Ballard. 1999. Pre- dictive coding in the visual cortex: a functional interpretation of some extra-classical receptive- field effects. Nature neuroscience. S Saito, F Itakura, et al. 1967. Theoretical consid- eration of the statistical optimum recognition of the spectral density of speech. J. Acoust. Soc. Japan. Steffen Schneider, Alexei Baevski, Ronan Col- lobert, and Michael Auli. 2019. wav2vec: Un- supervised pre-training for speech recognition. In Interspeech. Jiatong Shi, Hirofumi Inaguma, Xutai Ma, Ilia Kulikov, and Anna Sun. 2023. Multi- resolution HuBERT: Multi-resolution speech self-supervised learning with masked unit pre- diction. In ICLR. Noah A. Smith and Jason Eisner. 2005. Con- trastive estimation: Training log-linear models on unlabeled data. In ACL. Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representa- tion using deep conditional generative models. In NeurIPS. Michael W Spratling. 2017. A review of predic- tive coding algorithms. Brain and cognition. Mandyam Veerambudi Srinivasan, Simon Barry Laughlin, and Andreas Dubs. 1982. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of
Chunk 30 · 1,982 chars
using deep conditional generative models. In NeurIPS. Michael W Spratling. 2017. A review of predic- tive coding algorithms. Brain and cognition. Mandyam Veerambudi Srinivasan, Simon Barry Laughlin, and Andreas Dubs. 1982. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Se- ries B. Biological Sciences. Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. 2023. Neural codec language models are zero- shot text to speech synthesizers. arXiv preprint arXiv:2301.02111. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer normalization in the trans- former architecture. In ICML. Hemant Yadav, Sunayana Sitaram, and Rajiv Ratn Shah. 2024. MS-Hubert: Mitigating pre- training and inference mismatch in masked lan- guage modelling methods for learning speech representations. In Interspeech. -- 13 of 16 -- Gene-Ping Yang, Sung-Lin Yeh, Yu-An Chung, James Glass, and Hao Tang. 2022. Autoregres- sive predictive coding: A comprehensive study. IEEE Journal of Selected Topics in Signal Pro- cessing. Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watan- abe, Abdelrahman Mohamed, and Hung yi Lee. 2021. SUPERB: Speech Processing Universal PERformance Benchmark. In Interspeech. Sung-Lin Yeh and Hao Tang. 2022. Autoregres- sive co-training for learning discrete speech rep- resentations. In Interspeech. Ruixiong Zhang, Haiwei Wu, Wubo Li, Dongwei Jiang, Wei Zou, and Xiangang Li. 2021. Trans- former based unsupervised pre-training for acoustic representation learning. In ICASSP. Yu Zhang, Daniel S Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren
Chunk 31 · 1,999 chars
ng for learning discrete speech rep- resentations. In Interspeech. Ruixiong Zhang, Haiwei Wu, Wubo Li, Dongwei Jiang, Wei Zou, and Xiangang Li. 2021. Trans- former based unsupervised pre-training for acoustic representation learning. In ICASSP. Yu Zhang, Daniel S Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, et al. 2022. Bigssl: Exploring the frontier of large-scale semi-supervised learning for auto- matic speech recognition. IEEE Journal of Se- lected Topics in Signal Processing. -- 14 of 16 -- A Appendix A.1 Negative Free Energy of Predictive Coding The proof of equation 2 mirrors those in varia- tional autoencoders (Kingma and Welling, 2014). We start with the KL divergence between the pos- terior distribution p(z|xA, xB ) and the auxiliary distribution q(z|xB ), KLq(z|xB )∥p(z|xA, xB ) (17) = Ez∼q log q(z|xB ) p(xB |z)p(z|xA) + log p(xB |xA) where we assume xB ⊥ ⊥ xA | z or p(xB |xA, z) = p(xB |z). Because the KL divergence is always positive, we obtain − log p(xB |xA) (18) ≤ KLq(z|xB )∥p(z|xA) − Ez∼q[log p(xB |z)] A.2 Future Prediction The proof of equation 11 involves unrolling p(xB , z|xA) or p(x≥t, z≥t|x<t), where xB = x≥t and xA = x<t. Based on the definition of condi- tional probability, we have p(x≥t, z≥t|x<t) = T Y i=t p(xi, zi|x<i, zt:i−1) (19) = T Y i=t p(xi|x<i, zt:i)p(zi|x<i, zt:i−1) (20) = T Y i=t p(xi|zi)p(zi|x<i), (21) where the last line makes two reasonable assump- tions, xi ⊥ ⊥ zt:i−1 | zi and zi ⊥ ⊥ zt:i−1 | x<i. The variable zi should represent xi well without relying on previous latent variables zt:i−1. Given the history x<i, computing the representation zi should not depend on the previous latent variables zt:i−1. A.3 Pre-Training Recipes We set a maximum length of 1400 frames per ut- terance, corresponding to about 28 seconds. The learning rate is fixed to 10−4 under the Adam op- timizer, no learning rate schedule is applied. We pre-train BASE models on a single A40 with
Chunk 32 · 1,994 chars
ould not depend on the previous latent variables zt:i−1. A.3 Pre-Training Recipes We set a maximum length of 1400 frames per ut- terance, corresponding to about 28 seconds. The learning rate is fixed to 10−4 under the Adam op- timizer, no learning rate schedule is applied. We pre-train BASE models on a single A40 with a batch size of 16 for 150 epochs. We set model di- mension to 768, with inner dimension of feedfor- ward netowrks being 3072. A dropout probability of 0.1 is applied. The temperature is set to 1 without annealing for variational training. In wav2vec 2.0, the sim- ilarity value is re-scaled by dividing it with 0.1, the temperature for Gumbel-Softmax (Jang et al., 2017) is annealed from 2 to a minimum of 0.5 by a decay rate of 0.999995, i.e., the τ in Equation 9. We use 100 negative samples following Baevski et al. (2020b). There are a few architectural differences be- tween our setting and Baevski et al. (2020b). We employ a single codebook in our wav2vec 2.0 for quantization. Rather than Post-LN Transformers, we use Pre-LN Transformers for pre-training and remove the warm-up stage (Geiping and Gold- stein, 2023; Xiong et al., 2020). We use sinusoidal positional embeddings rather than relative position embeddings used in Baevski et al. (2020b). A.4 Downstream Evaluation We provide additional details on the experimental setups of downstream tasks. The layers chosen for each task is noted in Table 10. Phone Classification (PC) We freeze the pre- trained model and only train a linear layer with a learning rate of 10−3 for 10 epochs. Speaker Verification (SV) We average frame representations to obtain utterance-level represen- tations, and simply employ two linear layers to predict 1251 speakers following (Fan et al., 2020). The linear classifier is optimized with a learning rate of 10−3 for 10 epochs. After training, we take the output of the first linear layer as speaker vec- tors for speaker verification. We set the dimension of speaker vectors to
Chunk 33 · 1,332 chars
nd simply employ two linear layers to predict 1251 speakers following (Fan et al., 2020). The linear classifier is optimized with a learning rate of 10−3 for 10 epochs. After training, we take the output of the first linear layer as speaker vec- tors for speaker verification. We set the dimension of speaker vectors to 512. F0 Tracking (F0) We train a linear regression layer with a learning rate of 10−3 for 10 epochs. We set the minimum and maximum frequency to 50 Hz and 600 Hz respectively. Automatic Speech Recognition (ASR) We use a lightweight sequence-to-sequence encoder for downstream ASR. The encoder contains two con- volutional layers with (32, 32) channels and (2, 1) strides, and a 4-layer, 256-dimensional bidirec- tional GRU. The decoder is a unidirectional 256- dimensional GRU. We adopt a fixed scheduled sampling probabil- ity of 0.4 during training. We use Adam with learning rates of 10−4 for all s2s models. We em- ploy a dropout rate of 0.2, and a label smoothing -- 15 of 16 -- VQ-APC Future-VPC Masked-NCE HuBERT Masked-VPC PC/ASR 8 8 9 8 8 f0 3 3 2 2 3 SV 4 3 3 3 3 Table 10: Layers selected for downstream experiments for each model variant. rate of 0.1 for regularization. We train seq2seq models for 100 epochs, and lower the learning rate with a factor of 0.1 for another 20 epochs. -- 16 of 16 --