Source codes analysis for Cornac BiVAE

Kevin 吴嘉文大约 13 分钟

The latent factor and matrix factorization models are limited to capture non-linear patterns of the user and latent spaces. However, VAE[1] solve this problem by finding the latent factor distribution after projecting User or Item Ratings, and using the latent factor distribution to generate the corresponding rating predictions.

While the architecture of VAE do not matched two-way nature of preference data. That drawback was solved by Bilateral Variational Autoencoder (BiVAE)[2], which what we are going to discuss in detail.

BiVAE

cornac bivae sourceopen in new window

Notation

$R = (r_{ui})$ The integer preference rating matrix for user u and item i.
$r_{u*}:$ the row in $R$ corresponding to user $u$ .
$r_{*i}:$ the column in $R$ corresponding to item $i$ .
$\theta_u, \beta_i:$ latent factor for user and item. It is assume that they are follow normal distribution, that is: $p({\theta}_i) = {N}({0},{I})$ and $p({\beta}_i) = {N}({0},{I})$ .

It is assume that Conditional on the latent variables, the observations are drawn from a univariate exponential family.

\begin{aligned} p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right) &=\operatorname{EXPFAM}\left(r_{u i} ; \eta\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i} ; \omega\right)\right) \\ &=h\left(r_{u i}\right) \exp \left\{\eta\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i} ; \omega\right) r_{u i}-a\left(\eta\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i} ; \omega\right)\right)\right\} \end{aligned}

By testing Poisson distribution, Gaussian distribution, Bernoulli distribution on self-designed task. The Poisson distribution outperforms the others, which is:

p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)=\frac{1}{r_{u i} !} \exp \left\{r_{u i} \log g_{\omega}\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i}\right)-g_{\omega}\left(\boldsymbol{\theta}_{u} ; \boldsymbol{\beta}_{i}\right)\right\}

Encoder

input: Rating matrix of user or item. Shape [batch_size, num_item] for user.
output: $\mu, \sigma$ for user of item. Shape [k,] for both.

Encoder inputs a batch of $r_{u*}$ and project them to $\mu_u,\sigma_u$ by the following:

$\begin{aligned} E_x = a(W_e ^T X) \end{aligned}$ W size [num_item, hidden_size]
$\mu = W_{\mu}^T E_x$ W_mu size [hidden_size, k]
$\sigma = sigmoid(W^T_{\sigma}E_x)$ W_std size [hidden_size, k]

Where $\theta= \mathcal{N}(\mu,\sigma)$

The first step of encoding might include multiply fully-connected layers. Similarly, we encode $r_{*i}$ to get $\mu_i,\sigma_i$ .

user_encoder_structure = [16,16]  # sample encoder structure, which include 2 16-node layers 
# line 68
self.user_encoder = nn.Sequential()
for i in range(len(user_encoder_structure) - 1):
self.user_encoder.add_module(
        "fc{}".format(i),
        nn.Linear(user_encoder_structure[i], user_encoder_structure[i + 1]),
    )
self.user_encoder.add_module("act{}".format(i), self.act_fn)
self.user_mu = nn.Linear(user_encoder_structure[-1], k)  # mu
self.user_std = nn.Linear(user_encoder_structure[-1], k)
# line 104
def encode_user(self, x):
	h = self.user_encoder(x)
	return self.user_mu(h), torch.sigmoid(self.user_std(h))

Sampling and decoding

input: $\mu, \sigma$ for user of item. Shape [k,] for both.
output: rating prediction of user or item. Shape [batch_size, num_item] for user.

$\theta = \mu + \sigma*random()$ $θ = μ + σ * r an d o m ()$ sampling user factor[num_user,k]
- during each epochs, user and item factors will be replaced with sampled factors
$\beta = \mu+\sigma * random()$ sampling item factor[num_item,k]
$\hat r_u = \theta\beta^T$ user prediction
$\hat r_i=\beta\theta^T$ item prediction

def reparameterize(self, mu, std): # line 120
    eps = torch.randn_like(mu)
    return mu + eps * std

def decode_user(self, theta, beta): # line 112
    h = theta.mm(beta.t())
    return torch.sigmoid(h)

And the BiVAE network is a combination of Encoder and Decoder:

def forward(self, x, user=True, beta=None, theta=None):  # line 124
    if user:
        mu, std = self.encode_user(x)
        theta = self.reparameterize(mu, std)
        return theta, self.decode_user(theta, beta), mu, std
    else:
        mu, std = self.encode_item(x)
        beta = self.reparameterize(mu, std)
        return beta, self.decode_item(theta, beta), mu, std

Criterion Function

Inputs
x [batch_size, num_item] : Rating ground truth . shape=[batch_size, num_user] for item prediction x_ [x.shape] : rating prediction mu [k,] : $\mu$ for user or item latent factor mu_prior [k,] : std [k,] : $\sigma$ for user or item latent factor kl_beta Int : penalty weight of kl term, larger kl beta, larger penalty. default is 1.

The loss function is the Evidence Lower Bound (ELBO):

\begin{aligned} \mathcal{L}=& \sum_{u, i} \mathbb{E}_{q\left(\theta_{u} \mid \mathbf{r}_{u *}\right)} \mathbb{E}_{q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right)}\left[\log p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)\right] \\ &-\sum_{u} \operatorname{KL}\left(q\left(\boldsymbol{\theta}_{u} \mid \mathbf{r}_{u *}\right) \| p\left(\boldsymbol{\theta}_{u}\right)\right)-\sum_{i} \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right) \end{aligned}

Regarding to our assumption for output distribution, the term $\sum_{u, i} \mathbb{E}_{q\left(\theta_{u} \mid \mathbf{r}_{u *}\right)} \mathbb{E}_{q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right)}\left[\log p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)\right]$

could be:

$-\frac1m \sum^m_i(y^ilog(\hat y^i)-\hat y^i)$ for Poisson.
$-\frac 1{m}\sum^m_i(y^i-\hat y^i)^2$ for Gaussian.
$-\frac1m\sum^m_i(y^i\log(\hat y^i)+(1-y^i)\log(1-\hat y))$ for Bernoulli.

KL term will be derived as:

\operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right)=-log(\sigma_1)-\frac12+\sigma^2+(\mu_1-\mu_2)^2

Total Loss:

L_{user=}-\frac1m \sum^m_i(y^i_ulog(\hat y^i_u)-\hat y^i_u)-log(\sigma_{u1})-\frac12+\sigma_{u1}^2+(\mu_{u1}-\mu_{u2})^2\\ L_{item} = -\frac1m \sum^m_i(y^i_ilog(\hat y^i_i)-\hat y^i_i)-log(\sigma_{i1})-\frac12+\sigma_{i1}^2+(\mu_{i1}-\mu_{i2})^2\\ L = L_{user} + L_{item}

The detail math behind the criterion function is explained [in the following](# Math of ELBO).

def loss(self, x, x_, mu, mu_prior, std, kl_beta):
    # Likelihood
    ll_choices = {
        "bern": x * torch.log(x_ + EPS) + (1 - x) * torch.log(1 - x_ + EPS),
        "gaus": -(x - x_) ** 2,
        "pois": x * torch.log(x_ + EPS) - x_,
    }

    ll = ll_choices.get(self.likelihood, None)
    if ll is None:
        raise ValueError("Supported likelihoods: {}".format(ll_choices.keys()))
	ll = torch.sum(ll, dim=1)

        # KL term
	kld = -0.5 * (1 + 2.0 * torch.log(std) - (mu - mu_prior).pow(2) - std.pow(2))
	kld = torch.sum(kld, dim=1)

    return torch.mean(kl_beta * kld - ll)

Optimizing with out cap priors

Two optimizer are used for user and item:

u_optimizer = torch.optim.Adam(params=user_params, lr=learn_rate)
i_optimizer = torch.optim.Adam(params=item_params, lr=learn_rate)

An interesting operation before training is that, the cornac BiVAE map all ratings to 1 and 0 for item and user without rating. That is the ground truth of rating with 5 and rating with 1 will be both mapped to 1.

the train_set.matrix is CSR matrix, which is both time and space efficient.

x = train_set.matrix.copy()
x.data = np.ones_like(x.data)  # Binarize data
tx = x.transpose()

Since sometimes, rating with 1 indicates that the user do not like the item, if we define the ground truth as 1 if the user like it. This labeling approach might not be suitable. Therefore, one approach mentioned by the paper is to remain items with at least 5 rating.

Referred to the optimizing step, $\beta, \mu$ are recomputed and updated after gradient step. Since the beta and mu will be used in optimizing user latent factors.

i_batch = tx[i_ids, :]
i_batch = i_batch.A  # transfer csr to np.narray
i_batch = torch.tensor(i_batch, dtype=dtype, device=device)

# Reconstructed batch
beta, i_batch_, i_mu, i_std = bivae(i_batch, user=False, theta=bivae.theta)

i_mu_prior = 0.0  # zero mean for standard normal prior if not CAP prior
i_loss = bivae.loss(i_batch, i_batch_, i_mu, i_mu_prior, i_std, beta_kl)
i_optimizer.zero_grad()
i_loss.backward()
i_optimizer.step()

i_sum_loss += i_loss.data.item()
i_count += len(i_batch)

beta, _, i_mu, _ = bivae(i_batch, user=False, theta=bivae.theta)

bivae.beta.data[i_ids] = beta.data
bivae.mu_beta.data[i_ids] = i_mu.data

After optimizing user latent factor, which is almost same as above, the $\mu$ will be updated again since we have a new user factor $\theta$ .

for i_ids in train_set.item_iter(batch_size, shuffle=False):
    i_batch = tx[i_ids, :]
    i_batch = i_batch.A
    i_batch = torch.tensor(i_batch, dtype=dtype, device=device)
    beta, _, i_mu, _ = bivae(i_batch, user=False, theta=bivae.theta)
    bivae.mu_beta.data[i_ids] = i_mu.data

Optimizing with cap priors

The Constrained Adaptive Prior (CAP) is introduced in the paper to solve posterior collapse issue. The parameter of CAP is computed from side information using VAE. For example, the user review, also-Viewed information are project into 20-dimensional embeddings using VAE, which are used as the fixed parameter for user factor prior. And the item factor prior parameters could be computed using the same approach.

# when model init
if bivae.cap_priors.get("user", False):
    user_params = it.chain(user_params, bivae.user_prior_encoder.parameters())
    user_features = train_set.user_feature.features[: train_set.num_users]

# ...
# during training
if bivae.cap_priors.get("user", False):
	u_batch_f = user_features[u_ids]
    u_batch_f = torch.tensor(u_batch_f, dtype=dtype, device=device)
    u_mu_prior = bivae.encode_user_prior(u_batch_f)

As shown above, the CAP will be used to force the prior being a standard Gaussian.

Math of ELBO

Inference Model. The starting point of VB is to introduce a tractable inference model $q$ , governed by a set of variational parameters $\nu$ , which will be used as a proxy for the true but intractable posterior. A variational distribution, which breaks the coupling between $\beta$ and $\theta$ - a main source of intractability in our model, is chosen as:

q\left(\boldsymbol{\theta}_{1: U}, \boldsymbol{\beta}_{1: I} \mid \mathbf{R}\right)=q\left(\boldsymbol{\theta}_{1: U} \mid \mathbf{R}\right) q\left(\boldsymbol{\beta}_{1: I} \mid \mathbf{R}\right)

with

\begin{aligned} q\left(\boldsymbol{\theta}_{1}: U\right.&=\prod_{u} q\left(\boldsymbol{\theta}_{u} \mid \mathbf{R}_{u} *\right) \\ q\left(\boldsymbol{\beta}_{1: I} \mid \mathbf{R}\right) &=\prod_{i} q\left(\boldsymbol{\beta}_{i} \mid \mathbf{R}_{*}\right) \end{aligned}

Without loss of generality, the following forms are adopted:

\begin{aligned} q\left(\boldsymbol{\theta}_{u} \mid \mathbf{R}_{u} *\right) &=\mathcal{N}\left(\tilde{\boldsymbol{\mu}}_{\tilde{\psi}}\left(\mathbf{R}_{u} *\right), \tilde{\boldsymbol{\sigma}}_{\tilde{\psi}}\left(\mathbf{R}_{u} *\right)\right) \\ q\left(\boldsymbol{\beta}_{i} \mid \mathbf{R}_{*} i\right) &=\mathcal{N}\left(\tilde{\boldsymbol{\mu}}_{\tilde{\phi}}\left(\mathbf{R}_{*}\right), \tilde{\boldsymbol{\sigma}}_{\tilde{\phi}}\left(\mathbf{R}_{*}\right)\right) \end{aligned}

where $\nu=\{\tilde\phi,\tilde\psi\}$ , $\tilde{\bf\mu}(\cdot)$ and $\tilde{\bf\sigma}(\cdot)$ are vector-valued functions (e.g., multilayer perceptrons) parameterized by $\tilde\phi$ / $\tilde\psi$ , outputting respectively the mean and covariance parameters of the variational distributions.

With $q$ in place, we can proceed with approximate inference by optimizing the Evidence Lower BOund (ELBO), w.r.t. the model $\omega$ and variational $\nu$ parameters, given in our case by,

\begin{aligned} \mathcal{L}=& \sum_{u, i} \mathbb{E}_{q\left(\theta_{u} \mid \mathbf{r}_{u *}\right)} \mathbb{E}_{q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right)}\left[\log p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)\right] \\ &-\sum_{u} \operatorname{KL}\left(q\left(\boldsymbol{\theta}_{u} \mid \mathbf{r}_{u *}\right) \| p\left(\boldsymbol{\theta}_{u}\right)\right)-\sum_{i} \operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right) \end{aligned}

As we assuming $p\left(r_{u i} \mid \boldsymbol{\theta}_{u}, \boldsymbol{\beta}_{i}\right)$ follows Poisson Distribution, and $p(\bf{\theta}_u) = \mathcal{N}(\bf{0},\bf{I})$ , $p(\bf{\beta}_i) = \mathcal{N}(\bf{0},\bf{I})$ , we could derive the KL for two Gaussian as:

\begin{aligned} K L\left(\mathcal{N}\left(\mu_{1}, \sigma_{1}^{2}\right) \| \mathcal{N}\left(\mu_{2}, \sigma_{2}^{2}\right)\right) &=\int_{x} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} \log \frac{\frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}}}{\frac{1}{\sqrt{2 \pi} \sigma_{2}} e^{-\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}}} d x \\ &=\int_{x} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}}\left[\log \frac{\sigma_{2}}{\sigma_{1}}-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}+\frac{\left(x-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}\right] d x \end{aligned}

while

\log \frac{\sigma_{2}}{\sigma_{1}} \int_{x} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x=\log \frac{\sigma_{2}}{\sigma_{1}}\tag 4\\

-\frac{1}{2 \sigma_{1}^{2}} \int_{x}\left(x-\mu_{1}\right)^{2} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x=-\frac{1}{2 \sigma_{1}^{2}} \sigma_{1}^{2}=-\frac{1}{2} \tag 5 \\

\begin{aligned} \frac{1}{2 \sigma_{2}^{2}} \int_{x}\left(x-\mu_{2}\right)^{2} \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x &=\frac{1}{2 \sigma_{2}^{2}} \int_{x}\left(x^{2}-2 \mu_{2} x+\mu_{2}^{2}\right) \frac{1}{\sqrt{2 \pi} \sigma_{1}} e^{-\frac{\left(x-\mu_{1}\right)^{2}}{2 \sigma_{1}^{2}}} d x\\ &=\frac{\sigma_{1}^{2}+\mu_{1}^{2}-2 \mu_{1} \mu_{2}+\mu_{2}^{2}}{2 \sigma_{2}^{2}}=\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}} \end{aligned} \tag6

Combine above:

K L\left(\mathcal{N}\left(\mu_{1}, \sigma_{1}^{2}\right) \| \mathcal{N}\left(\mu_{2}, \sigma_{2}^{2}\right)\right)=\log \frac{\sigma_{2}}{\sigma_{1}}-\frac{1}{2}+\frac{\sigma_{1}^{2}+\left(\mu_{1}-\mu_{2}\right)^{2}}{2 \sigma_{2}^{2}}

\operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right)=-log(\sigma_1)-\frac12+\sigma^2+\mu_1^2\tag6

While Constrained Adaptive Priors (CAP) is used to push the posteriors to learn from observations R. In this case $p(\bf{\theta}_u) = \mathcal{N}(\bf{0},\bf{I})$ , $p(\bf{\beta}_i) = \mathcal{N}(\bf{0},\bf{I})$ do not hold. Therefore KL divergence $(6)$ should be:

\operatorname{KL}\left(q\left(\boldsymbol{\beta}_{i} \mid \mathbf{r}_{* i}\right) \| p\left(\boldsymbol{\beta}_{i}\right)\right)=-log(\sigma_1)-\frac12+\sigma^2+(\mu_1-\mu_2)^2

the objective function could be written as:

L_{user=}-\frac1m \sum^m_i(y^i_ulog(\hat y^i_u)-\hat y^i_u)-log(\sigma_{u1})-\frac12+\sigma_{u1}^2+(\mu_{u1}-\mu_{u2})^2\\ L_{item} = -\frac1m \sum^m_i(y^i_ilog(\hat y^i_i)-\hat y^i_i)-log(\sigma_{i1})-\frac12+\sigma_{i1}^2+(\mu_{i1}-\mu_{i2})^2\\ L = L_{user} + L_{item}

Reference

Liang, Dawen, et al. "Variational autoencoders for collaborative filtering." Proceedings of the 2018 World Wide Web Conference. 2018.
TRUONG, Quoc Tuan; SALAH, Aghiles; and LAUW, Hady W.. Bilateral variational autoencoder for collaborative filtering. (2021). WSDM '21: Proceedings of the 14th ACM International Conference on Web Search and Data Mining, Virtual, March 8-12. 292-300. Research Collection School Of Computing and Information Systems.open in new window
Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapseopen in new window
Z-Forcing: Training Stochastic Recurrent Networksopen in new window
Adji B. Dieng, Yoon Kim, Alexander M. Rush, David M. Blei, "Avoiding Latent Variable Collapse with Generative Skip Models" 2019open in new window
(zhihu)变分推断基础中的基础open in new window
高斯 KL 散度open in new window
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb

Source codes analysis for Cornac BiVAE

# BiVAE

# Notation

# Encoder

# Sampling and decoding

# Criterion Function

# Optimizing with out cap priors

# Optimizing with cap priors

# Math of ELBO

# Reference