Visualizing the RecBERT Recommendation System

1. Training the RecBERT Embedding Model

RecBERT first adapts a base transformer model to the specific domain of user comments and then fine-tunes it to generate meaningful sentence-level embeddings.

Base Model
RoBERTa
(Pre-trained on general text)
Domain Adaptation (MLM on User Comments)
Domain Adapted
RoBERTa
(Understands comment-specific language)
Input Data:
User Comments Dataset (e.g., MyAnimeList reviews)
Fine-tuning (SimCSE + MNR Loss)
Fine-Tuned Model
RecBERT
(Generates Semantic Comment Embeddings)
Method:
Siamese Network + SimCSE (Contrastive Learning)

Benefit: This process creates a model that can accurately represent the semantic meaning of entire user comments as dense vectors (embeddings), tailored to the specific language used in those comments. This is crucial for comparing comments and queries effectively.

2. Query Processing & Ranking Retrieval

When a user query arrives, RecBERT segments it using an LLM and calculates similarity scores through two channels (full query and subqueries) to rank relevant classes (e.g., stories, items).

User Query (γ)
"isekai story with strong female lead and magic system"
LLM Query Segmentation (Few-Shot)
Subqueries (γ1, γ2, γ3)
  • isekai story (γ1)
  • strong female lead (γ2)
  • magic system (γ3)

Channel 1: Full Query Similarity (S1)

Full Query Embedding e(γ)
KNN Search vs. All Comment Embeddings e(A)
Max Similarity per Class
S1 = max(cos_sim(e(γ), e(A)))

Channel 2: Subquery Similarity (S2)

Subquery Embeddings e(γ1), e(γ2), ...
KNN Search per Subquery vs. All Comment Embeddings e(B)
Avg. of Max Similarities per Class (s2)
s2 = avg(max(cos_sim(e(γi), e(B))))
Adjusted Similarity (S2)
S2 = clamp(tanh⁻¹(s2), max=1)

Final Class Ranking

Similarity S1
&
Similarity S2
Final Score (S) per Class
S = max(S1, S2)
Ranked List of Classes

Benefit: Query segmentation allows RecBERT to understand and match different facets of a complex query that might be discussed in separate comments within the same class. Combining the full query and subquery similarities provides a robust ranking, capturing both direct matches and composite relevance. The `tanh⁻¹` adjustment non-linearly boosts scores when multiple subqueries match within a class.