Source codes analysis for Cornac BiVAE
The latent factor and matrix factorization models are limited to capture non-linear patterns of the user and latent spaces. However, VAE[1] solve this problem by finding the latent factor distribution after projecting User or Item Ratings, and using the latent factor distribution to generate the corresponding rating predictions.
While the architecture of VAE do not matched two-way nature of preference data. That drawback was solved by Bilateral Variational Autoencoder (BiVAE)[2], which what we are going to discuss in detail.
BiVAE
Notation
The integer preference rating matrix for user u and item i.
the row in corresponding to user .
the column in corresponding to item .
latent factor for user and item. It is assume that they are follow normal distribution, that is: and .
It is assume that Conditional on the latent variables, the observations are drawn from a univariate exponential family.
By testing Poisson distribution, Gaussian distribution, Bernoulli distribution on self-designed task. The Poisson distribution outperforms the others, which is:
Encoder
input: Rating matrix of user or item. Shape
[batch_size, num_item]
for user.output: for user of item. Shape
[k,]
for both.
Encoder inputs a batch of and project them to by the following:
-
W size [num_item, hidden_size]
-
W_mu size [hidden_size, k]
-
W_std size [hidden_size, k]
Where
The first step of encoding might include multiply fully-connected layers. Similarly, we encode to get .
user_encoder_structure = [16,16] # sample encoder structure, which include 2 16-node layers
# line 68
self.user_encoder = nn.Sequential()
for i in range(len(user_encoder_structure) - 1):
self.user_encoder.add_module(
"fc{}".format(i),
nn.Linear(user_encoder_structure[i], user_encoder_structure[i + 1]),
)
self.user_encoder.add_module("act{}".format(i), self.act_fn)
self.user_mu = nn.Linear(user_encoder_structure[-1], k) # mu
self.user_std = nn.Linear(user_encoder_structure[-1], k)
# line 104
def encode_user(self, x):
h = self.user_encoder(x)
return self.user_mu(h), torch.sigmoid(self.user_std(h))
Sampling and decoding
input: for user of item. Shape
[k,]
for both.output: rating prediction of user or item. Shape
[batch_size, num_item]
for user.
-
sampling user factor[num_user,k]
- during each epochs, user and item factors will be replaced with sampled factors
sampling item factor[num_item,k]
-
user prediction
-
item prediction
def reparameterize(self, mu, std): # line 120
eps = torch.randn_like(mu)
return mu + eps * std
def decode_user(self, theta, beta): # line 112
h = theta.mm(beta.t())
return torch.sigmoid(h)
And the BiVAE network is a combination of Encoder and Decoder:
def forward(self, x, user=True, beta=None, theta=None): # line 124
if user:
mu, std = self.encode_user(x)
theta = self.reparameterize(mu, std)
return theta, self.decode_user(theta, beta), mu, std
else:
mu, std = self.encode_item(x)
beta = self.reparameterize(mu, std)
return beta, self.decode_item(theta, beta), mu, std
Criterion Function
Inputs
x [batch_size, num_item]
: Rating ground truth .shape=[batch_size, num_user]
for item predictionx_ [x.shape]
: rating predictionmu [k,]
: for user or item latent factormu_prior [k,]
:std [k,]
: for user or item latent factorkl_beta Int
: penalty weight of kl term, larger kl beta, larger penalty. default is 1.
The loss function is the Evidence Lower Bound (ELBO):
Regarding to our assumption for output distribution, the term
could be:
- for Poisson.
- for Gaussian.
- for Bernoulli.
KL term will be derived as:
Total Loss:
The detail math behind the criterion function is explained [in the following](# Math of ELBO).
def loss(self, x, x_, mu, mu_prior, std, kl_beta):
# Likelihood
ll_choices = {
"bern": x * torch.log(x_ + EPS) + (1 - x) * torch.log(1 - x_ + EPS),
"gaus": -(x - x_) ** 2,
"pois": x * torch.log(x_ + EPS) - x_,
}
ll = ll_choices.get(self.likelihood, None)
if ll is None:
raise ValueError("Supported likelihoods: {}".format(ll_choices.keys()))
ll = torch.sum(ll, dim=1)
# KL term
kld = -0.5 * (1 + 2.0 * torch.log(std) - (mu - mu_prior).pow(2) - std.pow(2))
kld = torch.sum(kld, dim=1)
return torch.mean(kl_beta * kld - ll)
Optimizing with out cap priors
Two optimizer are used for user and item:
u_optimizer = torch.optim.Adam(params=user_params, lr=learn_rate)
i_optimizer = torch.optim.Adam(params=item_params, lr=learn_rate)
An interesting operation before training is that, the cornac BiVAE map all ratings to 1 and 0 for item and user without rating. That is the ground truth of rating with 5 and rating with 1 will be both mapped to 1.
the train_set.matrix
is CSR matrix, which is both time and space efficient.
x = train_set.matrix.copy()
x.data = np.ones_like(x.data) # Binarize data
tx = x.transpose()
Since sometimes, rating with 1 indicates that the user do not like the item, if we define the ground truth as 1 if the user like it. This labeling approach might not be suitable. Therefore, one approach mentioned by the paper is to remain items with at least 5 rating.
Referred to the optimizing step, are recomputed and updated after gradient step. Since the beta and mu will be used in optimizing user latent factors.
i_batch = tx[i_ids, :]
i_batch = i_batch.A # transfer csr to np.narray
i_batch = torch.tensor(i_batch, dtype=dtype, device=device)
# Reconstructed batch
beta, i_batch_, i_mu, i_std = bivae(i_batch, user=False, theta=bivae.theta)
i_mu_prior = 0.0 # zero mean for standard normal prior if not CAP prior
i_loss = bivae.loss(i_batch, i_batch_, i_mu, i_mu_prior, i_std, beta_kl)
i_optimizer.zero_grad()
i_loss.backward()
i_optimizer.step()
i_sum_loss += i_loss.data.item()
i_count += len(i_batch)
beta, _, i_mu, _ = bivae(i_batch, user=False, theta=bivae.theta)
bivae.beta.data[i_ids] = beta.data
bivae.mu_beta.data[i_ids] = i_mu.data
After optimizing user latent factor, which is almost same as above, the will be updated again since we have a new user factor .
for i_ids in train_set.item_iter(batch_size, shuffle=False):
i_batch = tx[i_ids, :]
i_batch = i_batch.A
i_batch = torch.tensor(i_batch, dtype=dtype, device=device)
beta, _, i_mu, _ = bivae(i_batch, user=False, theta=bivae.theta)
bivae.mu_beta.data[i_ids] = i_mu.data
Optimizing with cap priors
The Constrained Adaptive Prior (CAP) is introduced in the paper to solve posterior collapse issue. The parameter of CAP is computed from side information using VAE. For example, the user review, also-Viewed information are project into 20-dimensional embeddings using VAE, which are used as the fixed parameter for user factor prior. And the item factor prior parameters could be computed using the same approach.
# when model init
if bivae.cap_priors.get("user", False):
user_params = it.chain(user_params, bivae.user_prior_encoder.parameters())
user_features = train_set.user_feature.features[: train_set.num_users]
# ...
# during training
if bivae.cap_priors.get("user", False):
u_batch_f = user_features[u_ids]
u_batch_f = torch.tensor(u_batch_f, dtype=dtype, device=device)
u_mu_prior = bivae.encode_user_prior(u_batch_f)
As shown above, the CAP will be used to force the prior being a standard Gaussian.
Math of ELBO
Inference Model. The starting point of VB is to introduce a tractable inference model , governed by a set of variational parameters , which will be used as a proxy for the true but intractable posterior. A variational distribution, which breaks the coupling between and - a main source of intractability in our model, is chosen as:
with
Without loss of generality, the following forms are adopted:
where , and are vector-valued functions (e.g., multilayer perceptrons) parameterized by /, outputting respectively the mean and covariance parameters of the variational distributions.
With in place, we can proceed with approximate inference by optimizing the Evidence Lower BOund (ELBO), w.r.t. the model and variational parameters, given in our case by,
As we assuming follows Poisson Distribution, and , , we could derive the KL for two Gaussian as:
while
Combine above:
While Constrained Adaptive Priors (CAP) is used to push the posteriors to learn from observations R. In this case , do not hold. Therefore KL divergence should be:
the objective function could be written as:
Reference
Liang, Dawen, et al. "Variational autoencoders for collaborative filtering." Proceedings of the 2018 World Wide Web Conference. 2018.
Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse
https://github.com/microsoft/recommenders/blob/main/examples/02_model_collaborative_filtering/cornac_bivae_deep_dive.ipynb