Multimodal Scene Representation Learning for Spatial and Temporal Understanding in Video
Deniz Erus Yijiang Liu Qilmeg Doudatcz
Final project for 6.7960, MIT
Outline

Introduction

Related Works

Model Architecture

Data and Training Setup

Results and Discussion

Conclusion

Implications and Limitations

References

Figure: Detecting a spatial change. (The top plot shows the similarity between consecutive frame embeddings over time. A sharp drop in the curve around frame 15 marks a scene change).

Introduction

In films and short videos with narrative content, editing is a common tool used by directors to highlight key moments and guide the viewer’s attention. However, compared with the real physical world, editing weakens the continuity of time and space in the “story world”. A character may instantly appear in a far-away location in the next shot, and long actions may be compressed or skipped entirely. Filmmakers use these media to guide the viewer’s attention and construct a directed encounter with space, yet architects do not assume that visitors move through environments with such fixed viewpoints. This raises an important question. Can a model learn a representation of video that reflects the identity of a space across many views, lighting conditions, and points in time?

Our project addresses this question by learning multimodal scene representations that can tell when different clips refer to the same physical environment. Rather than building a full continuity-checking system, for the scope of this project, we focus on how video and text features organize scenes in an embedding space and how this structure changes under multimodal supervision.

We combine visual embeddings from a VideoMAE encoder with text-based features from Qwen2-VL and study how these signals shape the geometry of the video embedding space. We use low-dimensional projections and similarity matrices to see how scenes organize by location and how they recur across an entire film. This is a key step in identifying continuity concerns such as lighting inconsistencies or shifts in temporal order.

Within this goal, we explore how much spatial structure is already present in a self-supervised video model and how much additional signal emerges through multimodal supervision. To do this, we compare three settings on the same videos:

By contrasting these methods, we examine the degree to which a self-supervised video model can recover spatial scene categories without human labels and how much its representation improves when guided by semantic structure from an external teacher.

Model Architecture

Our model builds upon a VideoMAE-based visual backbone and extends it with a scene-level embedding head and a text-guided contrastive learning module. The architecture is designed to combine visual understanding, scene segmentation, and world-model reasoning within a unified framework. Thus, the proposed architecture for semantic distance understanding contains three components: Understanding Backbone(VideoMAE) + Scene Embedding Head + LLM-based Semantic Distance Module (Contrastive Layer).

Video Understanding Backbone (VideoMAE)

The base of our system is the pretrained MCG-NJU/videomae-base transformer encoder. We load a fine-tuned checkpoint that has already been trained on the FineVideo dataset for video understanding tasks. To adapt this backbone for scene-level semantic modeling, we support two kinds of supervisory signals: Frame-level labels, where each frame receives an individual label; and Scene-level shared labels, where all frames within a timestamp interval share one scene label. These two settings allow the model to shift from coarse video understanding toward the more structured task of scene segmentation.

During fine-tuning, only the last K layers of VideoMAE’s transformer encoder (K = 2–4) are unfrozen. This preserves low-level visual representations while allowing high-level scene semantics to adapt to the downstream objective.

Scene Embedding Head (MLP)

To produce a scene-level representation, we append a lightweight MLP head to the output CLS token: MLP (“Linear”→"ReLU"→"Linear"). The MLP compresses the high-dimensional VideoMAE features into a compact embedding space in which scene relationships can be reasoned about. This embedding space is used both for segmentation tasks and for the semantic distance learning described below.

Semantic Distance (LLM based) Contrastive training

While VideoMAE provides strong visual features, it lacks world knowledge needed for higher-level scene reasoning.Film editing often omits intermediate transitions because humans can infer continuity from common-sense reasoning, and some videos take place in fictional universes where everyday expectations do not apply. This requires grounding in the actual visual content rather than generic knowledge. To address both limitations, we introduce a language-model-powered semantic distance module that uses text as an external world model.

For each scene, we extract metadata descriptions (scene description, props, context) and encode them using a large language model to obtain a normalized embedding ti for scene i. Semantic relatedness between two scenes i and j is measured as cosine similarity

sij = cos ( ti , tj )

These embeddings provide semantic relationships that reflect narrative continuity, spatial logic, and object-level reasoning.

We construct soft supervision weights from sij,

w ij pos = f ( sij ) ,   w ij neg = f ( sij )

where wposij controls how strongly we pull a pair together and wnegij controls how strongly we push it apart. We experiment with two implementations for f and g:

Linear mapping:

flin (s) =s ,   glin (s) = 1s

Squared mapping:

fsq (s) = s² ,   gsq (s) = (1s) ²

In both cases, high text similarity leads to a strong pull, low similarity leads to a strong push, and mid-range similarities receive relatively small weights. Squaring enhances this effect by emphasizing very similar or very dissimilar pairs.

To achieve this, first, given the video embeddings vi from the MLP head, we normalize them:

v ^ i = vi vi

and compute video-space similarity:

sim ij video = v ^ i T v ^ j

Loss of Positive term (pulling):

Lpos = E[ wpos ( 1 sim ij video ) ]

Loss of Negative term (pushing):

Lneg = E[ wneg max ( sim ij video m , 0 ) ]

And then we have the final objective:

L = Lpos + Lneg

This module forces the video embedding space to align with the semantic topology implied by the LLM, while still remaining grounded in actual visual evidence. In this model architecture, both video and language have complementary roles. A key motivation for this architecture is that video-based and language-based reasoning possess complementary strengths:

VideoMAE compensates for LLM limitations: 1. Fictional or stylized worlds violate real-world logic; 2. Visual cues reveal spatial layout, lighting, actor identity, etc.

LLM compensates for VideoMAE limitations:1. Editing creates discontinuities humans resolve through world knowledge; 2. Similar visual scenes may have very different narrative functions.

By coupling the two with contrastive learning, the model learns an embedding space that reflects both visual relations and narrative/world-level semantic similarity. Thus, the final representation is capable of supporting fine-grained scene segmentation and semantic distance reasoning within longer video sequences.

Figure: Scene distance of original VideoMAE model before training (The blue line shows the interpolated semantic distance, and the orange line shows visual distance. The red horizontal line is the semantic threshold. For LLM, semantic distance fluctuates strongly and often crosses the threshold, while for the original VideoMAE model, visual distance stays near zero with only small bumps.)
Figure: Semantic similarity matrix (Almost all visual distances are extremely small, clustered below about 0.02. This indicates that VideoMAE sees the majority of clips as visually very similar. The spatial layout and low level appearance of the footage remain stable across most of the timeline.)

Data and Training Setup

Dataset: We use the FineVideo dataset from HuggingFace, a large-scale collection of short video clips paired with detailed metadata. Each entry contains: raw video frames (mp4), structured scene annotations, object and prop descriptions, timestamps identifying scene boundaries, and contextual descriptions of each clip. This dataset is well-suited for training both visual scene classifiers and semantic video embedding models, because every scene is accompanied by rich natural-language descriptions.

Preprocessing and Scene Extraction: FineVideo organizes videos at the segment level and already has assigned labels, therefore we can treat each unique timestamp as one scene:

We randomly stream and store 3000 videos, covering 29628 scene segments.

Baseline head training: We begin with the pretrained VideoMAE-Base backbone and add a lightweight fully connected head to produce scene-level embeddings. In this stage the backbone is frozen, and only the head is trained. We use AdamW with a learning rate of 10-4 for the head and weight decay 10-4. This gives us a visual baseline representation for downstream analysis and for the teacher-supervised classifier.

Text-guided contrastive fine-tuning: After establishing a visual baseline and training several models with it, we introduce semantic supervision from text embeddings using our contrastive loss. At this stage, we partially unfreeze the last 2–4 transformer layers (we used K = 2 in early experiments, K = 4 in final experiments), the new learning rates for backbone (unfrozen layers) are 5 × 10-5, and 5 × 10-4 for MLP head. During this stage, the video embeddings are trained to reflect the semantic distances implied by the textual descriptions. Scenes that are close in text space receive stronger “pull” weights, and scenes that are far apart receive stronger “push” weights.

Teacher-supervised variant: To add explicit semantic structure, we also experiment with a teacher-supervised classifier. For each segment, we prompt Qwen2-VL with a short video clip and ask it to assign a location label such as “kitchen”, “living room”, or “outside”. These labels serve as targets for a linear classifier head on top of the scene embeddings. We keep the VideoMAE backbone frozen and train the classifier with cross-entropy loss. Comparing this teacher-supervised classifier to the unsupervised PCA baseline lets us qualitatively assess how much additional structure we gain from multimodal supervision by inspecting the PCA projections before and after training.

Results and Discussion

Figure: PCA of scene embeddings before and after Qwen supervision on a short sample video with three locations: outside (blue), kitchen (green), living room (red).

PCA Analysis of scene embeddings

We first analyze the unsupervised structure of VideoMAE embeddings using principal component analysis. For each scene, we take its CLS embedding, compute a 2D PCA projection, and color the point by Qwen2-VL’s location label.

Baseline (pre-supervision): Using embeddings from the pretrained VideoMAE model, PCA reveals that indoor vs outdoor scenes often drift into somewhat different regions of the 2D plane, this is most likely due to large brightness difference between the scenes. However, indoor categories heavily overlap. Some scenes from the same physical location appear far apart in PCA space when the camera angle changes. This indicates that the pretrained model does not enforce a“place identity”.

After Qwen-guided training (post-supervision): We recompute scene embeddings and PCA projections. The distribution changes in several ways: Scenes sharing the same location label form tighter groups. Different location categories become more separated, for example, kitchens and living rooms now occupy more distinct regions.

The PCA plots before and after supervision give a qualitative picture of how the embedding space is being reshaped: from a generic representation dominated by low-level visual similarity to a more location-aware embedding where place identity plays a stronger role.

Text-guided contrastive fine-tuning

The LLM-generated embeddings allow us to compute semantic relatedness between scenes, which serves as the supervisory signal for video embedding training. Qualitative inspection confirms that the LLM is able to assign reasonable similarity scores based on narrative context, object composition, and scene semantics.

Baseline VideoMAE Scene Similarity: Before introducing text-driven supervision, we evaluate the baseline VideoMAE model on a multi-scene video. The model produces extremely high similarity (0.9-1.0) across all scenes, regardless of actual semantic differences. This indicates that although VideoMAE captures visual content well, its embedding space lacks the semantic structure needed to distinguish narrative roles or scene types.

Fine-tuning with Linear Text-Similarity Weights: In the second experiment, we introduce contrastive supervision using linearly mapped text similarity as positive/negative weights. We unfreeze the last two transformer layers and train for 2000 steps. Training loss fluctuates between 0.30–0.27, showing only a mild downward trend. Despite the small change in loss, the model meaningfully restructures its embedding space. The fine-tuned model succeeds in separating different categories of scenes:

We assume the reason why loss barely changes but the model still learns is: first, the loss averages over all pairwise interactions in the batch. Even small adjustments in embedding geometry are diluted when averaged across dozens or hundreds of pairs. Second, the contrastive weights are soft and continuous, not binary. This makes the optimization landscape smooth, producing subtle numerical changes even when the embedding space is being noticeably reorganized. Third, semantic improvements mainly affect relative distances, but the global magnitude of the loss may remain similar.

Fine-tuning with Squared Text-Similarity Weights: In the third experiment, we modify the weighting function as

w pos = s 2 , w neg = ( 1 s ) 2

This mapping suppresses mid-level weights and concentrates samples into strong positives (≈0.7–0.8) and strong negatives (≈0.1–0.2). We unfreeze four transformer layers and train for 4000 steps. Training loss stabilizes between 0.11–0.13, with no clear downward trend. The model shows two striking behaviors:

  1. It very clearly separates transitional frames (fades, black frames, title cards). These consistently receive the lowest similarity in the embedding space.
  2. However, all narrative scenes collapse into an extremely tight cluster, producing similarity scores equal to 1 between nearly all meaningful scenes.
We assume the reasons are as follows: Squared weighting amplifies strong positives disproportionately, High text-similarity pairs dominate the gradient signal, pulling many narrative scenes tightly together. Unfreezing more layers increases model plasticity. With four layers updated, the model can reshape its embedding space more dramatically, risking representation collapse toward a single cluster for "narrative scenes." Last but not least, Most narrative scenes share similar objects, actors, or contexts. Their text embeddings form a dense cluster; after squaring, their weights become even more dominant.

Figure: Similarity heatmap of 20 scenes from the Star Wars movie

Conclusion

Overall, the project suggests that a frozen, self-supervised video backbone already encodes some useful spatial and appearance structure. But it is not a stable world model of place. Multimodal supervision from an LLM-based teacher can significantly improve alignment between visual embeddings and scene identity, even when labels and text are only pseudo-annotations. PCA projections and similarity matrices provide a useful view into how these representations are organized and how they change with supervision.

Implications and limitations

This project shows that self-supervised video models, especially if there is teacher supervision, can capture meaningful scene structure. The VideoMAE provides embeddings that can sometimes approximate the teacher's decision about whether two segments belong to the same location. This suggests that a good amount of spatial information is already present in the representation.

Qwen2-VL makes it possible to annotate many segments without manual work. This makes it possible to train stronger heads and to fine-tune video transformers for scene understanding.

This project also has several limitations. The experiments are limited to a small number of videos due to storage and compute constraints. This makes the conclusions more qualitative rather than solid quantitative results. Another limitation is that we use teacher labels as the ground truth. If the teacher is systematically mislabeling the scenes, the student will reproduce these errors and we will not be aware. A more rigorous solution would be to add human annotations of scene types and boundaries.

Future work could scale this idea up to use more data, try different teachers and architectures, and check performance in automated continuity checking while editing videos.


References:

[1] Movies2Scenes: Using Movie Metadata to Learn Scene Representation , Chen, Shixing, Chun-Hao Liu, Xiang Hao, Xiaohan Nie, Maxim Arap, and Raffay Hamid, 2023

[2] Shot Contrastive Self-Supervised Learning for Scene Boundary Detection , Chen, Shixing, Xiaohan Nie, David Fan, Dongqing Zhang, Vimal Bhat, and Raffay Hamid, 2021

[3] VideoMAE/FINETUNE.Md at Main · MCG-NJU/VideoMAE , GitHub, n.d., Accessed December 1, 2025

[4] MovieNet , Huang Qingqiu, Xiong Yu, and Rao Anyi, Accessed November 23, 2025

[5] MCG-NJU/VideoMAE , Multimedia Computing Group, Nanjing University, (2022) 2025, Python, March 23, Released November 22

[6] Modality-Aware Shot Relating and Comparing for Video Scene Detection , Tan, Jiawei, Hongxing Wang, Kang Dang, Jiaxin Li, and Zhilong Ou, 2025

[7] VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training , Tong, Zhan, Yibing Song, Jue Wang, and Limin Wang, 2022

[8] Vid2Seq: A Pretrained Visual Language Model for Describing Multi-Event Videos , Antoine Yang, and Arsha Nagrani, Accessed December 1, 2025

[9] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking , Wang, Limin, Bingkun Huang, Zhiyu Zhao, et al., 2023

[10] Towards Global Video Scene Segmentation with Context-Aware Transformer , Yang, Yang, Yurui Huang, Weili Guo, Baohua Xu, and Dingyin Xia, 2023

[11] Qwen on HuggingFace , Alibaba Cloud, 2025