Multimodal Scene Representation Learning for Spatial and Temporal Understanding in Video

Deniz Erus

Yijiang Liu

Qilmeg Doudatcz

Final project for 6.7960, MIT

Introduction

In films and short videos with narrative content, editing is a common tool used by directors to highlight key moments and guide the viewer’s attention. However, compared with the real physical world, editing weakens the continuity of time and space in the “story world”. A character may instantly appear in a far-away location in the next shot, and long actions may be compressed or skipped entirely. Filmmakers use these media to guide the viewer’s attention and construct a directed encounter with space, yet architects do not assume that visitors move through environments with such fixed viewpoints. This raises an important question. Can a model learn a representation of video that reflects the identity of a space across many views, lighting conditions, and points in time?

Our project addresses this question by learning multimodal scene representations that can tell when different clips refer to the same physical environment. Rather than building a full continuity-checking system, for the scope of this project, we focus on how video and text features organize scenes in an embedding space and how this structure changes under multimodal supervision.

We combine visual embeddings from a VideoMAE encoder with text-based features from Qwen2-VL and study how these signals shape the geometry of the video embedding space. We use low-dimensional projections and similarity matrices to see how scenes organize by location and how they recur across an entire film. This is a key step in identifying continuity concerns such as lighting inconsistencies or shifts in temporal order.

Within this goal, we explore how much spatial structure is already present in a self-supervised video model and how much additional signal emerges through multimodal supervision. To do this, we compare three settings on the same videos:

Unsupervised baseline: treat VideoMAE as a frozen feature extractor and project segment embeddings into 2D using principal component analysis (PCA). We then inspect whether the resulting geometry implicitly groups scenes by semantic locations such as kitchen, living room, or exterior.
Teacher supervised variant: use a stronger multimodal model, Qwen2-VL, as a teacher to assign soft scene labels, and train a lightweight classifier head on top of VideoMAE to predict these labels from video alone.
Text guided semantic distance: embed scene metadata with a language model to obtain text similarities between segments, and use these similarities as soft weights in a contrastive loss that fine-tunes the VideoMAE embeddings toward a more location-aware space.

By contrasting these methods, we examine the degree to which a self-supervised video model can recover spatial scene categories without human labels and how much its representation improves when guided by semantic structure from an external teacher.

Related Works

Video representation learning

Recent work in video understanding has shifted from convolution based models toward transformer based architectures that process spacetime information more directly. Benchmarks such as MovieNet [4] provide long-form movie data with detailed annotations. Building on these datasets, methods like Movies2Scenes [1] and Shot Contrastive Self-Supervised Learning [2] optimize boundary accuracy or task performance, however, they do not focus on the geometry of the scene embedding space or on modeling recurring physical locations.

VideoMAE extends masked autoencoding to video by removing a very high portion of spacetime patches during training and asking the encoder to reconstruct the missing content. Let the video be represented as a sequence of patch tokens

x \in ℝ^{T \times N \times D}

. A masking operator M selects a small visible subset, and the encoder E processes only these tokens to produce a latent representation

z = E (M (x))

. The decoder D then attempts to reconstruct the original patches

\hat{x} = D (z)

. This forces the encoder to learn stable spatial cues, global structure, and temporal relationships rather than action specific features. These properties make VideoMAE suitable for identifying consistent spatial identity across scenes.

Vision language modeling

Language based models have become essential in video understanding, since narrative cues can often clarify spatial or temporal context that is not visually explicit. Qwen2-VL provides strong language representations and can operate in vision language settings that support reasoning over frames and subtitles.

Other video-language models, such as Vid2Seq [8], jointly model visual tokens and text to perform dense captioning and multi-event prediction. These models show that language can sharpen temporal and semantic resolution in long videos.

In this project, Qwen2-VL is used differently. We use it to convert video segments into dense text embeddings, and treat it as an external teacher. These embeddings provide contextual signals about location, time, or narrative intent, which stabilizes scene identity when visual content varies due to lighting or framing.

Multimodal fusion for scene understanding

Combining visual and language information has become common in tasks such as retrieval, grounding, and cross modal alignment. Recent architectures for global scene segmentation, such as Modality-Aware Shot Relating and Comparing (MASRC) [6] and context-aware transformers [10], use multiple modalities (appearance, audio, and metadata) to improve scene detection on movie benchmarks.

In this project, the normalized visual embedding v from VideoMAE and the subtitle embedding t from Qwen are concatenated to form a joint representation

f = [v, t]

. This fused representation captures both spatial structure and narrative meaning. Although multimodal fusion is well established in video question answering and video search, it has not been widely applied to identifying recurring architectural spaces across scenes.

In practice, the main multimodal interaction in our experiments happens through the contrastive loss described in Section 3.3, where text-derived similarities act as soft weights that pull or push pairs of visual embeddings. This lets the language model act as an external model that shapes the geometry of the video embedding space.

Model Architecture

Our model builds upon a VideoMAE-based visual backbone and extends it with a scene-level embedding head and a text-guided contrastive learning module. The architecture is designed to combine visual understanding, scene segmentation, and world-model reasoning within a unified framework. Thus, the proposed architecture for semantic distance understanding contains three components: Understanding Backbone(VideoMAE) + Scene Embedding Head + LLM-based Semantic Distance Module (Contrastive Layer).

Video Understanding Backbone (VideoMAE)

The base of our system is the pretrained MCG-NJU/videomae-base transformer encoder. We load a fine-tuned checkpoint that has already been trained on the FineVideo dataset for video understanding tasks. To adapt this backbone for scene-level semantic modeling, we support two kinds of supervisory signals: Frame-level labels, where each frame receives an individual label; and Scene-level shared labels, where all frames within a timestamp interval share one scene label. These two settings allow the model to shift from coarse video understanding toward the more structured task of scene segmentation.

During fine-tuning, only the last K layers of VideoMAE’s transformer encoder (K = 2–4) are unfrozen. This preserves low-level visual representations while allowing high-level scene semantics to adapt to the downstream objective.

Scene Embedding Head (MLP)

To produce a scene-level representation, we append a lightweight MLP head to the output CLS token: MLP (“Linear”→"ReLU"→"Linear"). The MLP compresses the high-dimensional VideoMAE features into a compact embedding space in which scene relationships can be reasoned about. This embedding space is used both for segmentation tasks and for the semantic distance learning described below.

Semantic Distance (LLM based) Contrastive training

While VideoMAE provides strong visual features, it lacks world knowledge needed for higher-level scene reasoning.Film editing often omits intermediate transitions because humans can infer continuity from common-sense reasoning, and some videos take place in fictional universes where everyday expectations do not apply. This requires grounding in the actual visual content rather than generic knowledge. To address both limitations, we introduce a language-model-powered semantic distance module that uses text as an external world model.

For each scene, we extract metadata descriptions (scene description, props, context) and encode them using a large language model to obtain a normalized embedding t_i for scene i. Semantic relatedness between two scenes i and j is measured as cosine similarity

s_{i j} = \cos (t_{i}, t_{j})

These embeddings provide semantic relationships that reflect narrative continuity, spatial logic, and object-level reasoning.

We construct soft supervision weights from s_ij,

w_{i j}^{pos} = f (s_{i j}), w_{i j}^{neg} = f (s_{i j})

where w^pos_ij controls how strongly we pull a pair together and w^neg_ij controls how strongly we push it apart. We experiment with two implementations for f and g:

Linear mapping:

f_{lin} (s) = s, g_{lin} (s) = 1 - s

Squared mapping:

f_{sq} (s) = s^{²}, g_{sq} (s) = {(1 - s)}^{²}

In both cases, high text similarity leads to a strong pull, low similarity leads to a strong push, and mid-range similarities receive relatively small weights. Squaring enhances this effect by emphasizing very similar or very dissimilar pairs.

To achieve this, first, given the video embeddings v_i from the MLP head, we normalize them:

{\hat{v}}_{i} = \frac{v_{i}}{∥ v_{i} ∥}

and compute video-space similarity:

{sim}_{i j}^{video} = {\hat{v}}_{i}^{T} \cdot {\hat{v}}_{j}

Loss of Positive term (pulling):

L_{pos} = E [w^{pos} (1 - {sim}_{i j}^{video})]

Loss of Negative term (pushing):

L_{neg} = E [w_{neg} \cdot \max ({sim}_{i j}^{video} - m, 0)]

And then we have the final objective:

L = L_{pos} + L_{neg}

This module forces the video embedding space to align with the semantic topology implied by the LLM, while still remaining grounded in actual visual evidence. In this model architecture, both video and language have complementary roles. A key motivation for this architecture is that video-based and language-based reasoning possess complementary strengths:

VideoMAE compensates for LLM limitations: 1. Fictional or stylized worlds violate real-world logic; 2. Visual cues reveal spatial layout, lighting, actor identity, etc.

LLM compensates for VideoMAE limitations:1. Editing creates discontinuities humans resolve through world knowledge; 2. Similar visual scenes may have very different narrative functions.

By coupling the two with contrastive learning, the model learns an embedding space that reflects both visual relations and narrative/world-level semantic similarity. Thus, the final representation is capable of supporting fine-grained scene segmentation and semantic distance reasoning within longer video sequences.

Data and Training Setup

Dataset: We use the FineVideo dataset from HuggingFace, a large-scale collection of short video clips paired with detailed metadata. Each entry contains: raw video frames (mp4), structured scene annotations, object and prop descriptions, timestamps identifying scene boundaries, and contextual descriptions of each clip. This dataset is well-suited for training both visual scene classifiers and semantic video embedding models, because every scene is accompanied by rich natural-language descriptions.

Preprocessing and Scene Extraction: FineVideo organizes videos at the segment level and already has assigned labels, therefore we can treat each unique timestamp as one scene:

All frames under the same timestamp (sharing one label) are treated as belonging to the same semantic scene.
We extract relevant text fields from the metadata provided in the dataset (scene.description, props.name, content_title, contextualRelevance)
These text snippets are cleaned and concatenated into a single scene description.
We compute text embeddings using a pretrained language model and store with each scene sample.
These embeddings later provide the semantic similarity supervision for contrastive training.

We randomly stream and store 3000 videos, covering 29628 scene segments.

Baseline head training: We begin with the pretrained VideoMAE-Base backbone and add a lightweight fully connected head to produce scene-level embeddings. In this stage the backbone is frozen, and only the head is trained. We use AdamW with a learning rate of

10 -4

for the head and weight decay

10 -4

. This gives us a visual baseline representation for downstream analysis and for the teacher-supervised classifier.

Text-guided contrastive fine-tuning: After establishing a visual baseline and training several models with it, we introduce semantic supervision from text embeddings using our contrastive loss. At this stage, we partially unfreeze the last 2–4 transformer layers (we used K = 2 in early experiments, K = 4 in final experiments), the new learning rates for backbone (unfrozen layers) are

5 \times 10 -5

, and

5 \times 10 -4

for MLP head. During this stage, the video embeddings are trained to reflect the semantic distances implied by the textual descriptions. Scenes that are close in text space receive stronger “pull” weights, and scenes that are far apart receive stronger “push” weights.

Teacher-supervised variant: To add explicit semantic structure, we also experiment with a teacher-supervised classifier. For each segment, we prompt Qwen2-VL with a short video clip and ask it to assign a location label such as “kitchen”, “living room”, or “outside”. These labels serve as targets for a linear classifier head on top of the scene embeddings. We keep the VideoMAE backbone frozen and train the classifier with cross-entropy loss. Comparing this teacher-supervised classifier to the unsupervised PCA baseline lets us qualitatively assess how much additional structure we gain from multimodal supervision by inspecting the PCA projections before and after training.

Results and Discussion

PCA Analysis of scene embeddings

We first analyze the unsupervised structure of VideoMAE embeddings using principal component analysis. For each scene, we take its CLS embedding, compute a 2D PCA projection, and color the point by Qwen2-VL’s location label.

Baseline (pre-supervision): Using embeddings from the pretrained VideoMAE model, PCA reveals that indoor vs outdoor scenes often drift into somewhat different regions of the 2D plane, this is most likely due to large brightness difference between the scenes. However, indoor categories heavily overlap. Some scenes from the same physical location appear far apart in PCA space when the camera angle changes. This indicates that the pretrained model does not enforce a“place identity”.

After Qwen-guided training (post-supervision): We recompute scene embeddings and PCA projections. The distribution changes in several ways: Scenes sharing the same location label form tighter groups. Different location categories become more separated, for example, kitchens and living rooms now occupy more distinct regions.

The PCA plots before and after supervision give a qualitative picture of how the embedding space is being reshaped: from a generic representation dominated by low-level visual similarity to a more location-aware embedding where place identity plays a stronger role.

Text-guided contrastive fine-tuning

The LLM-generated embeddings allow us to compute semantic relatedness between scenes, which serves as the supervisory signal for video embedding training. Qualitative inspection confirms that the LLM is able to assign reasonable similarity scores based on narrative context, object composition, and scene semantics.

Baseline VideoMAE Scene Similarity: Before introducing text-driven supervision, we evaluate the baseline VideoMAE model on a multi-scene video. The model produces extremely high similarity (0.9-1.0) across all scenes, regardless of actual semantic differences. This indicates that although VideoMAE captures visual content well, its embedding space lacks the semantic structure needed to distinguish narrative roles or scene types.

Fine-tuning with Linear Text-Similarity Weights: In the second experiment, we introduce contrastive supervision using linearly mapped text similarity as positive/negative weights. We unfreeze the last two transformer layers and train for 2000 steps. Training loss fluctuates between 0.30–0.27, showing only a mild downward trend. Despite the small change in loss, the model meaningfully restructures its embedding space. The fine-tuned model succeeds in separating different categories of scenes:

Non-narrative transitional frames (e.g., fades to black, title cards): similarity ≈ 0.5
Semantically distinct locations (indoor vs. outdoor, clearly different settings): similarity ≈ 0.7
Highly similar narrative scenes (same characters performing the same action): similarity ≈ 0.95–1.0

We assume the reason why loss barely changes but the model still learns is: first, the loss averages over all pairwise interactions in the batch. Even small adjustments in embedding geometry are diluted when averaged across dozens or hundreds of pairs. Second, the contrastive weights are soft and continuous, not binary. This makes the optimization landscape smooth, producing subtle numerical changes even when the embedding space is being noticeably reorganized. Third, semantic improvements mainly affect relative distances, but the global magnitude of the loss may remain similar.

Fine-tuning with Squared Text-Similarity Weights: In the third experiment, we modify the weighting function as

w^{pos} = s^{2}, w^{neg} = {(1 - s)}^{2}

This mapping suppresses mid-level weights and concentrates samples into strong positives (≈0.7–0.8) and strong negatives (≈0.1–0.2). We unfreeze four transformer layers and train for 4000 steps. Training loss stabilizes between 0.11–0.13, with no clear downward trend. The model shows two striking behaviors:

It very clearly separates transitional frames (fades, black frames, title cards). These consistently receive the lowest similarity in the embedding space.
However, all narrative scenes collapse into an extremely tight cluster, producing similarity scores equal to 1 between nearly all meaningful scenes.

We assume the reasons are as follows: Squared weighting amplifies strong positives disproportionately, High text-similarity pairs dominate the gradient signal, pulling many narrative scenes tightly together. Unfreezing more layers increases model plasticity. With four layers updated, the model can reshape its embedding space more dramatically, risking representation collapse toward a single cluster for "narrative scenes." Last but not least, Most narrative scenes share similar objects, actors, or contexts. Their text embeddings form a dense cluster; after squaring, their weights become even more dominant.

Conclusion

Overall, the project suggests that a frozen, self-supervised video backbone already encodes some useful spatial and appearance structure. But it is not a stable world model of place. Multimodal supervision from an LLM-based teacher can significantly improve alignment between visual embeddings and scene identity, even when labels and text are only pseudo-annotations. PCA projections and similarity matrices provide a useful view into how these representations are organized and how they change with supervision.

Implications and limitations

This project shows that self-supervised video models, especially if there is teacher supervision, can capture meaningful scene structure. The VideoMAE provides embeddings that can sometimes approximate the teacher's decision about whether two segments belong to the same location. This suggests that a good amount of spatial information is already present in the representation.

Qwen2-VL makes it possible to annotate many segments without manual work. This makes it possible to train stronger heads and to fine-tune video transformers for scene understanding.

This project also has several limitations. The experiments are limited to a small number of videos due to storage and compute constraints. This makes the conclusions more qualitative rather than solid quantitative results. Another limitation is that we use teacher labels as the ground truth. If the teacher is systematically mislabeling the scenes, the student will reproduce these errors and we will not be aware. A more rigorous solution would be to add human annotations of scene types and boundaries.

Future work could scale this idea up to use more data, try different teachers and architectures, and check performance in automated continuity checking while editing videos.

References:

[1] Movies2Scenes: Using Movie Metadata to Learn Scene Representation , Chen, Shixing, Chun-Hao Liu, Xiang Hao, Xiaohan Nie, Maxim Arap, and Raffay Hamid, 2023

[2] Shot Contrastive Self-Supervised Learning for Scene Boundary Detection , Chen, Shixing, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid, 2021

[3] VideoMAE/FINETUNE.Md at Main · MCG-NJU/VideoMAE , GitHub, n.d., Accessed December 1, 2025

[4] MovieNet , Huang Qingqiu, Xiong Yu, and Rao Anyi, Accessed November 23, 2025

[5] MCG-NJU/VideoMAE , Multimedia Computing Group, Nanjing University, (2022) 2025, Python, March 23, Released November 22

[6] Modality-Aware Shot Relating and Comparing for Video Scene Detection , Tan, Jiawei, Hongxing Wang, Kang Dang, Jiaxin Li, and Zhilong Ou, 2025

[7] VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training , Tong, Zhan, Yibing Song, Jue Wang, and Limin Wang, 2022

[8] Vid2Seq: A Pretrained Visual Language Model for Describing Multi-Event Videos , Antoine Yang, and Arsha Nagrani, Accessed December 1, 2025

[9] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking , Wang, Limin, Bingkun Huang, Zhiyu Zhao, et al., 2023

[10] Towards Global Video Scene Segmentation with Context-Aware Transformer , Yang, Yang, Yurui Huang, Weili Guo, Baohua Xu, and Dingyin Xia, 2023

[11] Qwen on HuggingFace , Alibaba Cloud, 2025