跳至主要內容

Source codes analysis for Cornac BiVAE

Kevin 吴嘉文大约 13 分钟知识笔记Recommender systemIn English

The latent factor and matrix factorization models are limited to capture non-linear patterns of the user and latent spaces. However, VAE[1] solve this problem by finding the latent factor distribution after projecting User or Item Ratings, and using the latent factor distribution to generate the corresponding rating predictions.

While the architecture of VAE do not matched two-way nature of preference data. That drawback was solved by Bilateral Variational Autoencoder (BiVAE)[2], which what we are going to discuss in detail.

BiVAE

cornac bivae sourceopen in new window

Notation

R=(rui)R = (r_{ui}) The integer preference rating matrix for user u and item i.

ru:r_{u*}: the row in RR corresponding to user uu.

ri:r_{*i}: the column in RR corresponding to item ii.

θu,βi:\theta_u, \beta_i: latent factor for user and item. It is assume that they are follow normal distribution, that is: p(θi)=N(0,I)p({\theta}_i) = {N}({0},{I}) and p(βi)=N(0,I)p({\beta}_i) = {N}({0},{I}).

It is assume that Conditional on the latent variables, the observations are drawn from a univariate exponential family.

p(ruiθu,βi)=EXPFAM(rui;η(θu;βi;ω))=h(rui)exp{η(θu;βi;ω)ruia(η(θu;βi;ω))} \begin{aligned} p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right) &=\operatorname{EXPFAM}\left(r_{u i} ; \eta\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i} ; \omega\right)\right) \\ &=h\left(r_{u i}\right) \exp \left\{\eta\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i} ; \omega\right) r_{u i}-a\left(\eta\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i} ; \omega\right)\right)\right\} \end{aligned}

By testing Poisson distribution, Gaussian distribution, Bernoulli distribution on self-designed task. The Poisson distribution outperforms the others, which is:

p(ruiθu,βi)=1rui!exp{ruiloggω(θu;βi)gω(θu;βi)} p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)=\frac{1}{r_{u i} !} \exp \left\{r_{u i} \log g_{\omega}\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i}\right)-g_{\omega}\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i}\right)\right\}

Encoder

input: Rating matrix of user or item. Shape [batch_size, num_item] for user.

output: μ,σ\mu, \sigma for user of item. Shape [k,] for both.

Encoder inputs a batch of rur_{u*} and project them to μu,σu\mu_u,\sigma_u by the following:

  • Ex=a(WeTX)\begin{aligned} E_x = a(W_e ^T X) \end{aligned} W size [num_item, hidden_size]
  • μ=WμTEx\mu = W_{\mu}^T E_x W_mu size [hidden_size, k]
  • σ=sigmoid(WσTEx)\sigma = sigmoid(W^T_{\sigma}E_x) W_std size [hidden_size, k]

Where θ=N(μ,σ)\theta= \mathcal{N}(\mu,\sigma)

The first step of encoding might include multiply fully-connected layers. Similarly, we encode rir_{*i} to get μi,σi\mu_i,\sigma_i.

user_encoder_structure = [16,16]  # sample encoder structure, which include 2 16-node layers 
# line 68
self.user_encoder = nn.Sequential()
for i in range(len(user_encoder_structure) - 1):
self.user_encoder.add_module(
        "fc{}".format(i),
        nn.Linear(user_encoder_structure[i], user_encoder_structure[i + 1]),
    )
self.user_encoder.add_module("act{}".format(i), self.act_fn)
self.user_mu = nn.Linear(user_encoder_structure[-1], k)  # mu
self.user_std = nn.Linear(user_encoder_structure[-1], k)
# line 104
def encode_user(self, x):
	h = self.user_encoder(x)
	return self.user_mu(h), torch.sigmoid(self.user_std(h))

Sampling and decoding

input: μ,σ\mu, \sigma for user of item. Shape [k,] for both.

output: rating prediction of user or item. Shape [batch_size, num_item] for user.

  • θ=μ+σrandom()\theta = \mu + \sigma*random() sampling user factor[num_user,k]
    • during each epochs, user and item factors will be replaced with sampled factors
  • β=μ+σrandom()\beta = \mu+\sigma * random()sampling item factor[num_item,k]
  • r^u=θβT\hat r_u = \theta\beta^T user prediction
  • r^i=βθT\hat r_i=\beta\theta^T item prediction
def reparameterize(self, mu, std): # line 120
    eps = torch.randn_like(mu)
    return mu + eps * std

def decode_user(self, theta, beta): # line 112
    h = theta.mm(beta.t())
    return torch.sigmoid(h)

And the BiVAE network is a combination of Encoder and Decoder:

def forward(self, x, user=True, beta=None, theta=None):  # line 124
    if user:
        mu, std = self.encode_user(x)
        theta = self.reparameterize(mu, std)
        return theta, self.decode_user(theta, beta), mu, std
    else:
        mu, std = self.encode_item(x)
        beta = self.reparameterize(mu, std)
        return beta, self.decode_item(theta, beta), mu, std

Criterion Function

Inputs

x [batch_size, num_item] : Rating ground truth . shape=[batch_size, num_user] for item prediction x_ [x.shape] : rating prediction mu [k,] : μ\mu for user or item latent factor mu_prior [k,] : std [k,] : σ\sigma for user or item latent factor kl_beta Int : penalty weight of kl term, larger kl beta, larger penalty. default is 1.

The loss function is the Evidence Lower Bound (ELBO):

L=u,iEq(θuru)Eq(βiri)[logp(ruiθu,βi)]uKL(q(θuru)p(θu))iKL(q(βiri)p(βi)) \begin{aligned} \mathcal{L}=& \sum_{u, i} \mathbb{E}_{q\left(\theta_{u} \mid \mathbf{r}_{u *}\right)} \mathbb{E}_{q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right)}\left[\log p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)\right] \\ &-\sum_{u} \operatorname{KL}\left(q\left(\boldsymbol{\theta}_{u} \mid \mathbf{r}_{u *}\right) \| p\left(\boldsymbol{\theta}_{u}\right)\right)-\sum_{i} \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right) \end{aligned}

Regarding to our assumption for output distribution, the term u,iEq(θuru)Eq(βiri)[logp(ruiθu,βi)]\sum_{u, i} \mathbb{E}_{q\left(\theta_{u} \mid \mathbf{r}_{u *}\right)} \mathbb{E}_{q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right)}\left[\log p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)\right]

could be:

  • 1mim(yilog(y^i)y^i)-\frac1m \sum^m_i(y^ilog(\hat y^i)-\hat y^i) for Poisson.
  • 1mim(yiy^i)2-\frac 1{m}\sum^m_i(y^i-\hat y^i)^2 for Gaussian.
  • 1mim(yilog(y^i)+(1yi)log(1y^))-\frac1m\sum^m_i(y^i\log(\hat y^i)+(1-y^i)\log(1-\hat y)) for Bernoulli.

KL term will be derived as:

KL(q(βiri)p(βi))=log(σ1)12+σ2+(μ1μ2)2 \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right)=-log(\sigma_1)-\frac12+\sigma^2+(\mu_1-\mu_2)^2

Total Loss:

Luser=1mim(yuilog(y^ui)y^ui)log(σu1)12+σu12+(μu1μu2)2Litem=1mim(yiilog(y^ii)y^ii)log(σi1)12+σi12+(μi1μi2)2L=Luser+Litem L_{user=}-\frac1m \sum^m_i(y^i_ulog(\hat y^i_u)-\hat y^i_u)-log(\sigma_{u1})-\frac12+\sigma_{u1}^2+(\mu_{u1}-\mu_{u2})^2\\ L_{item} = -\frac1m \sum^m_i(y^i_ilog(\hat y^i_i)-\hat y^i_i)-log(\sigma_{i1})-\frac12+\sigma_{i1}^2+(\mu_{i1}-\mu_{i2})^2\\ L = L_{user} + L_{item}

The detail math behind the criterion function is explained [in the following](# Math of ELBO).

def loss(self, x, x_, mu, mu_prior, std, kl_beta):
    # Likelihood
    ll_choices = {
        "bern": x * torch.log(x_ + EPS) + (1 - x) * torch.log(1 - x_ + EPS),
        "gaus": -(x - x_) ** 2,
        "pois": x * torch.log(x_ + EPS) - x_,
    }

    ll = ll_choices.get(self.likelihood, None)
    if ll is None:
        raise ValueError("Supported likelihoods: {}".format(ll_choices.keys()))
	ll = torch.sum(ll, dim=1)

        # KL term
	kld = -0.5 * (1 + 2.0 * torch.log(std) - (mu - mu_prior).pow(2) - std.pow(2))
	kld = torch.sum(kld, dim=1)

    return torch.mean(kl_beta * kld - ll)

Optimizing with out cap priors

Two optimizer are used for user and item:

u_optimizer = torch.optim.Adam(params=user_params, lr=learn_rate)
i_optimizer = torch.optim.Adam(params=item_params, lr=learn_rate)

An interesting operation before training is that, the cornac BiVAE map all ratings to 1 and 0 for item and user without rating. That is the ground truth of rating with 5 and rating with 1 will be both mapped to 1.

the train_set.matrix is CSR matrix, which is both time and space efficient.

x = train_set.matrix.copy()
x.data = np.ones_like(x.data)  # Binarize data
tx = x.transpose()

Since sometimes, rating with 1 indicates that the user do not like the item, if we define the ground truth as 1 if the user like it. This labeling approach might not be suitable. Therefore, one approach mentioned by the paper is to remain items with at least 5 rating.

Referred to the optimizing step, β,μ\beta, \mu are recomputed and updated after gradient step. Since the beta and mu will be used in optimizing user latent factors.

i_batch = tx[i_ids, :]
i_batch = i_batch.A  # transfer csr to np.narray
i_batch = torch.tensor(i_batch, dtype=dtype, device=device)

# Reconstructed batch
beta, i_batch_, i_mu, i_std = bivae(i_batch, user=False, theta=bivae.theta)

i_mu_prior = 0.0  # zero mean for standard normal prior if not CAP prior
i_loss = bivae.loss(i_batch, i_batch_, i_mu, i_mu_prior, i_std, beta_kl)
i_optimizer.zero_grad()
i_loss.backward()
i_optimizer.step()

i_sum_loss += i_loss.data.item()
i_count += len(i_batch)

beta, _, i_mu, _ = bivae(i_batch, user=False, theta=bivae.theta)

bivae.beta.data[i_ids] = beta.data
bivae.mu_beta.data[i_ids] = i_mu.data

After optimizing user latent factor, which is almost same as above, the μ\mu will be updated again since we have a new user factor θ\theta.

for i_ids in train_set.item_iter(batch_size, shuffle=False):
    i_batch = tx[i_ids, :]
    i_batch = i_batch.A
    i_batch = torch.tensor(i_batch, dtype=dtype, device=device)
    beta, _, i_mu, _ = bivae(i_batch, user=False, theta=bivae.theta)
    bivae.mu_beta.data[i_ids] = i_mu.data

Optimizing with cap priors

The Constrained Adaptive Prior (CAP) is introduced in the paper to solve posterior collapse issue. The parameter of CAP is computed from side information using VAE. For example, the user review, also-Viewed information are project into 20-dimensional embeddings using VAE, which are used as the fixed parameter for user factor prior. And the item factor prior parameters could be computed using the same approach.

# when model init
if bivae.cap_priors.get("user", False):
    user_params = it.chain(user_params, bivae.user_prior_encoder.parameters())
    user_features = train_set.user_feature.features[: train_set.num_users]

# ...
# during training
if bivae.cap_priors.get("user", False):
	u_batch_f = user_features[u_ids]
    u_batch_f = torch.tensor(u_batch_f, dtype=dtype, device=device)
    u_mu_prior = bivae.encode_user_prior(u_batch_f)

As shown above, the CAP will be used to force the prior being a standard Gaussian.

Math of ELBO

Inference Model. The starting point of VB is to introduce a tractable inference model qq, governed by a set of variational parameters ν\nu, which will be used as a proxy for the true but intractable posterior. A variational distribution, which breaks the coupling between β\beta and θ\theta - a main source of intractability in our model, is chosen as:

q(θ1:U,β1:IR)=q(θ1:UR)q(β1:IR) q\left(\boldsymbol{\theta}_{1: U}, \boldsymbol{\beta}_{1: I} \mid \mathbf{R}\right)=q\left(\boldsymbol{\theta}_{1: U} \mid \mathbf{R}\right) q\left(\boldsymbol{\beta}_{1: I} \mid \mathbf{R}\right)

with

q(θ1:U=uq(θuRu)q(β1:IR)=iq(βiR) \begin{aligned} q\left(\boldsymbol{\theta}_{1}: U\right.&=\prod_{u} q\left(\boldsymbol{\theta}_{u} \mid \mathbf{R}_{u} *\right) \\ q\left(\boldsymbol{\beta}_{1: I} \mid \mathbf{R}\right) &=\prod_{i} q\left(\boldsymbol{\beta}_{i} \mid \mathbf{R}_{*}\right) \end{aligned}

Without loss of generality, the following forms are adopted:

q(θuRu)=N(μ~ψ~(Ru),σ~ψ~(Ru))q(βiRi)=N(μ~ϕ~(R),σ~ϕ~(R)) \begin{aligned} q\left(\boldsymbol{\theta}_{u} \mid \mathbf{R}_{u} *\right) &=\mathcal{N}\left(\tilde{\boldsymbol{\mu}}_{\tilde{\psi}}\left(\mathbf{R}_{u} *\right), \tilde{\boldsymbol{\sigma}}_{\tilde{\psi}}\left(\mathbf{R}_{u} *\right)\right) \\ q\left(\boldsymbol{\beta}_{i} \mid \mathbf{R}_{*} i\right) &=\mathcal{N}\left(\tilde{\boldsymbol{\mu}}_{\tilde{\phi}}\left(\mathbf{R}_{*}\right), \tilde{\boldsymbol{\sigma}}_{\tilde{\phi}}\left(\mathbf{R}_{*}\right)\right) \end{aligned}

where , and are vector-valued functions (e.g., multilayer perceptrons) parameterized by /, outputting respectively the mean and covariance parameters of the variational distributions.

With in place, we can proceed with approximate inference by optimizing the Evidence Lower BOund (ELBO), w.r.t. the model and variational parameters, given in our case by,

L=u,iEq(θuru)Eq(βiri)[logp(ruiθu,βi)]uKL(q(θuru)p(θu))iKL(q(βiri)p(βi)) \begin{aligned} \mathcal{L}=& \sum_{u, i} \mathbb{E}_{q\left(\theta_{u} \mid \mathbf{r}_{u *}\right)} \mathbb{E}_{q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right)}\left[\log p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)\right] \\ &-\sum_{u} \operatorname{KL}\left(q\left(\boldsymbol{\theta}_{u} \mid \mathbf{r}_{u *}\right) \| p\left(\boldsymbol{\theta}_{u}\right)\right)-\sum_{i} \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right) \end{aligned}

As we assuming p(ruiθu,βi)p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right) follows Poisson Distribution, and p(θu)=N(0,I)p(\bf{\theta}_u) = \mathcal{N}(\bf{0},\bf{I}) , p(βi)=N(0,I)p(\bf{\beta}_i) = \mathcal{N}(\bf{0},\bf{I}) , we could derive the KL for two Gaussian as:

KL(N(μ1,σ12)N(μ2,σ22))=x12πσ1e(xμ1)22σ12log12πσ1e(xμ1)22σ1212πσ2e(xμ2)22σ22dx=x12πσ1e(xμ1)22σ12[logσ2σ1(xμ1)22σ12+(xμ2)22σ22]dx \begin{aligned} K L\left(\mathcal{N}\left(\mu_{1}, \sigma_{1}^{2}\right) \| \mathcal{N}\left(\mu_{2}, \sigma_{2}^{2}\right)\right) &=\int_{x} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} \log \frac{\frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}}}{\frac{1}{\sqrt{2 \pi} \sigma_{2}} e^{-\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}}} d x \\ &=\int_{x} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}}\left[\log \frac{\sigma_{2}}{\sigma_{1}}-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}+\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}\right] d x \end{aligned}

while

logσ2σ1x12πσ1e(xμ1)22σ12dx=logσ2σ1(4) \log \frac{\sigma_{2}}{\sigma_{1}} \int_{x} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x=\log \frac{\sigma_{2}}{\sigma_{1}}\tag 4\\

12σ12x(xμ1)212πσ1e(xμ1)22σ12dx=12σ12σ12=12(5) -\frac{1}{2 \sigma_{1}^{2}} \int_{x}\left(x-\mu_{1}\right)^{2} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x=-\frac{1}{2 \sigma_{1}^{2}} \sigma_{1}^{2}=-\frac{1}{2} \tag 5 \\

12σ22x(xμ2)212πσ1e(xμ1)22σ12dx=12σ22x(x22μ2x+μ22)12πσ1e(xμ1)22σ12dx=σ12+μ122μ1μ2+μ222σ22=σ12+(μ1μ2)22σ22(6) \begin{aligned} \frac{1}{2 \sigma_{2}^{2}} \int_{x}\left(x-\mu_{2}\right)^{2} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x &=\frac{1}{2 \sigma_{2}^{2}} \int_{x}\left(x^{2}-2 \mu_{2} x+\mu_{2}^{2}\right) \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x\\ &=\frac{\sigma_{1}^{2}+\mu_{1}^{2}-2 \mu_{1} \mu_{2}+\mu_{2}^{2}}{2 \sigma_{2}^{2}}=\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}} \end{aligned} \tag6

Combine above:

KL(N(μ1,σ12)N(μ2,σ22))=logσ2σ112+σ12+(μ1μ2)22σ22 K L\left(\mathcal{N}\left(\mu_{1}, \sigma_{1}^{2}\right) \| \mathcal{N}\left(\mu_{2}, \sigma_{2}^{2}\right)\right)=\log \frac{\sigma_{2}}{\sigma_{1}}-\frac{1}{2}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}

KL(q(βiri)p(βi))=log(σ1)12+σ2+μ12(6) \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right)=-log(\sigma_1)-\frac12+\sigma^2+\mu_1^2\tag6

While Constrained Adaptive Priors (CAP) is used to push the posteriors to learn from observations R. In this case p(θu)=N(0,I)p(\bf{\theta}_u) = \mathcal{N}(\bf{0},\bf{I}) , p(βi)=N(0,I)p(\bf{\beta}_i) = \mathcal{N}(\bf{0},\bf{I}) do not hold. Therefore KL divergence (6)(6) should be:

KL(q(βiri)p(βi))=log(σ1)12+σ2+(μ1μ2)2 \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right)=-log(\sigma_1)-\frac12+\sigma^2+(\mu_1-\mu_2)^2

the objective function could be written as:

Luser=1mim(yuilog(y^ui)y^ui)log(σu1)12+σu12+(μu1μu2)2Litem=1mim(yiilog(y^ii)y^ii)log(σi1)12+σi12+(μi1μi2)2L=Luser+Litem L_{user=}-\frac1m \sum^m_i(y^i_ulog(\hat y^i_u)-\hat y^i_u)-log(\sigma_{u1})-\frac12+\sigma_{u1}^2+(\mu_{u1}-\mu_{u2})^2\\ L_{item} = -\frac1m \sum^m_i(y^i_ilog(\hat y^i_i)-\hat y^i_i)-log(\sigma_{i1})-\frac12+\sigma_{i1}^2+(\mu_{i1}-\mu_{i2})^2\\ L = L_{user} + L_{item}

Reference

  1. Liang, Dawen, et al. "Variational autoencoders for collaborative filtering." Proceedings of the 2018 World Wide Web Conference. 2018.

  2. TRUONG, Quoc Tuan; SALAH, Aghiles; and LAUW, Hady W.. Bilateral variational autoencoder for collaborative filtering. (2021). WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, March 8-12. 292-300. Research Collection School Of Computing and Information Systems.open in new window

  3. Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapseopen in new window

  4. Z-Forcing: Training Stochastic Recurrent Networksopen in new window

  5. Adji B. Dieng, Yoon Kim, Alexander M. Rush, David M. Blei, "Avoiding Latent Variable Collapse with Generative Skip Models" 2019open in new window

  6. (zhihu)变分推断基础中的基础open in new window

  7. 高斯 KL 散度open in new window

  8. https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb

上次编辑于:
贡献者: kevinng77