RecBERT Recommendation System

1. Training the RecBERT Embedding Model

RecBERT first adapts a base transformer model to the specific domain of user comments and then fine-tunes it to generate meaningful sentence-level embeddings.

Base Model

RoBERTa

(Pre-trained on general text)

Domain Adaptation (MLM on User Comments)

Domain Adapted

RoBERTa

(Understands comment-specific language)

Input Data:

User Comments Dataset (e.g., MyAnimeList reviews)

Fine-tuning (SimCSE + MNR Loss)

Fine-Tuned Model

RecBERT

(Generates Semantic Comment Embeddings)

Method:

Siamese Network + SimCSE (Contrastive Learning)

Benefit: This process creates a model that can accurately represent the semantic meaning of entire user comments as dense vectors (embeddings), tailored to the specific language used in those comments. This is crucial for comparing comments and queries effectively.

2. Query Processing & Ranking Retrieval

When a user query arrives, RecBERT segments it using an LLM and calculates similarity scores through two channels (full query and subqueries) to rank relevant classes (e.g., stories, items).

User Query (γ)

"isekai story with strong female lead and magic system"

LLM Query Segmentation (Few-Shot)

Subqueries (γ1, γ2, γ3)

isekai story (γ1)
strong female lead (γ2)
magic system (γ3)

Channel 1: Full Query Similarity (S1)

Full Query Embedding e(γ)

↓

KNN Search vs. All Comment Embeddings e(A)

↓

Max Similarity per Class

S1 = max(cos_sim(e(γ), e(A)))

Channel 2: Subquery Similarity (S2)

Subquery Embeddings e(γ1), e(γ2), ...

↓

KNN Search per Subquery vs. All Comment Embeddings e(B)

↓

Avg. of Max Similarities per Class (s2)

s2 = avg(max(cos_sim(e(γi), e(B))))

↓

Adjusted Similarity (S2)

S2 = clamp(tanh⁻¹(s2), max=1)

Final Class Ranking

Similarity S1

Similarity S2

↓

Final Score (S) per Class

S = max(S1, S2)

↓

Ranked List of Classes

Benefit: Query segmentation allows RecBERT to understand and match different facets of a complex query that might be discussed in separate comments within the same class. Combining the full query and subquery similarities provides a robust ranking, capturing both direct matches and composite relevance. The `tanh⁻¹` adjustment non-linearly boosts scores when multiple subqueries match within a class.

Visualizing the RecBERT Recommendation System

1. Training the RecBERT Embedding Model

2. Query Processing & Ranking Retrieval

Channel 1: Full Query Similarity (S1)

Channel 2: Subquery Similarity (S2)

Final Class Ranking