Model Architecture

Our model builds upon a VideoMAE-based visual backbone and extends it
with a scene-level embedding head and a text-guided contrastive learning
module. The architecture is designed to combine visual understanding,
scene segmentation, and world-model reasoning within a unified
framework. Thus, the proposed architecture for semantic distance
understanding contains three components:
Understanding Backbone(VideoMAE) + Scene Embedding Head + LLM-based
Semantic Distance Module (Contrastive Layer).
Video Understanding Backbone (VideoMAE)
The base of our system is the pretrained MCG-NJU/videomae-base
transformer encoder. We load a fine-tuned checkpoint that has already
been trained on the FineVideo dataset for video understanding tasks. To
adapt this backbone for scene-level semantic modeling, we support two
kinds of supervisory signals: Frame-level labels, where each frame
receives an individual label; and Scene-level shared labels, where all
frames within a timestamp interval share one scene label. These two
settings allow the model to shift from coarse video understanding toward
the more structured task of scene segmentation.
During fine-tuning, only the last K layers of VideoMAE’s transformer
encoder (K = 2–4) are unfrozen. This preserves low-level visual
representations while allowing high-level scene semantics to adapt to
the downstream objective.
Scene Embedding Head (MLP)
To produce a scene-level representation, we append a lightweight MLP
head to the output CLS token: MLP (“Linear”→"ReLU"→"Linear"). The MLP
compresses the high-dimensional VideoMAE features into a compact
embedding space in which scene relationships can be reasoned about. This
embedding space is used both for segmentation tasks and for the semantic
distance learning described below.
Semantic Distance (LLM based) Contrastive training
While VideoMAE provides strong visual features, it lacks world knowledge
needed for higher-level scene reasoning.Film editing often omits
intermediate transitions because humans can infer continuity from
common-sense reasoning, and some videos take place in fictional
universes where everyday expectations do not apply. This requires
grounding in the actual visual content rather than generic knowledge. To
address both limitations, we introduce a language-model-powered semantic
distance module that uses text as an external world model.
For each scene, we extract metadata descriptions (scene description,
props, context) and encode them using a large language model to obtain a
normalized embedding
ti for scene
i. Semantic
relatedness between two scenes
i and
j is measured as
cosine similarity
These embeddings provide semantic relationships that reflect narrative
continuity, spatial logic, and object-level reasoning.
We construct soft supervision weights from
sij,
where
wposij controls how strongly we pull
a pair together and
wnegij controls how
strongly we push it apart. We experiment with two implementations for
f and
g:
Linear mapping:
Squared mapping:
In both cases, high text similarity leads to a strong pull, low
similarity leads to a strong push, and mid-range similarities receive
relatively small weights. Squaring enhances this effect by emphasizing
very similar or very dissimilar pairs.
To achieve this, first, given the video embeddings
vi from the MLP head, we normalize them:
and compute video-space similarity:
Loss of Positive term (pulling):
Loss of Negative term (pushing):
And then we have the final objective:
This module forces the video embedding space to align with the semantic
topology implied by the LLM, while still remaining grounded in actual
visual evidence. In this model architecture, both video and language
have complementary roles. A key motivation for this architecture is that
video-based and language-based reasoning possess complementary
strengths:
VideoMAE compensates for LLM limitations: 1. Fictional or stylized
worlds violate real-world logic; 2. Visual cues reveal spatial layout,
lighting, actor identity, etc.
LLM compensates for VideoMAE limitations:1. Editing creates
discontinuities humans resolve through world knowledge; 2. Similar
visual scenes may have very different narrative functions.
By coupling the two with contrastive learning, the model learns an
embedding space that reflects both visual relations and
narrative/world-level semantic similarity. Thus, the final
representation is capable of supporting fine-grained scene segmentation
and semantic distance reasoning within longer video sequences.