From Frames to Shots: A Deep Learning Perspective on Multimodal, Graph-Based, and Transformer Video Summarization – A review

Authors

  • Ali H. Ahmed Department of Computer Science, College of Science, Al-Mustansiriya University, Baghdad, Iraq.
  • Ibraheem N. Ibraheem Department of Computer Science, College of Science, Al-Mustansiriya University, Baghdad, Iraq.
  • Dheyaa A. Ibrahim Department of Financial and Banking Sciences, College of Administration and Economics, University of Fallujah, Fallujah, Iraq.

DOI:

https://doi.org/10.55145/ajest.2025.04.02.009

Keywords:

Deep Learning, Video Summarization, Graph

Abstract

Video summarization has become a vital solution for handling the explosive growth of video data across domains such as surveillance, education, entertainment, and healthcare. As visual media increasingly dominate digital communication, users and systems alike require fast, semantically rich access to content without viewing entire videos. Deep learning has fundamentally transformed this task, enabling models to detect, rank, and condense relevant segments into concise summaries that retain meaning, context, and narrative coherence.

However, this change is still hindered by some problems that keep coming back: the large variety of video formats and domains, the non-uniform temporal structures of the content, and the restricted scalability of annotated datasets that are used for supervised learning. However, the diversity of video sources, inconsistency in temporal structure, and limited access to labeled training data pose persistent challenges. Traditional frame-based models often suffer from redundancy and fragmented outputs, while supervised methods are constrained by annotation cost and domain generalization. Many summarization systems still under-address multimodal fusion, temporal alignment, and long-range semantic reasoning, and benchmark evaluations rarely account for cross-modal contributions or human subjectivity in summary preferences. This review offers a comprehensive and technically grounded survey of 46 deep learning-based approaches, organized around five foundational techniques: multimodal representation and fusion, segment/shot-level summarization, graph-based modeling, transformer architectures, and learning paradigms including supervised, unsupervised, and self-supervised frameworks. By structuring the discussion through architectural innovations rather than individual models or datasets, we identify core methodological patterns, highlight the evolution of learning strategies, and analyze the impact of unit granularity and modality integration on summarization quality. We conclude with an original synthesis of trends, research gaps, and future opportunities in real-time, hybrid, and label-free summarization design. Key results from our comparative analysis show that segment- or shot-based methods comprise over 70% of modern models, reflecting a broad shift away from frame-based summarization. Additionally, transformer-based architectures, often combined with GNNs or hierarchical encoders, have overtaken RNNs as the dominant sequence modeling strategy. Examples from our analysis include the observation that over 70% of recent models now incorporate segment- or shot-level units rather than isolated frames, while transformer-based architectures—often fused with GNNs or hierarchical encoders—have replaced RNNs as the dominant modeling paradigm. Similarly, our comparative tables reveal that intermediate fusion techniques consistently outperform early and late strategies, especially when paired with attention-based alignment.These results show what kind of architectural and learning design decisions implementation are features, which mean better coverage of semantics, higher scalability and performance—thus giving practical insights to the researchers and developers who work on the next generation of summarization systems.

Downloads

Published

2025-08-31

How to Cite

Ahmed, A. H., Ibraheem , I. N., & Ibrahim, D. A. (2025). From Frames to Shots: A Deep Learning Perspective on Multimodal, Graph-Based, and Transformer Video Summarization – A review : . Al-Salam Journal for Engineering and Technology, 4(2), 108–128. https://doi.org/10.55145/ajest.2025.04.02.009

Issue

Section

Articles

Similar Articles

1 2 3 4 > >> 

You may also start an advanced similarity search for this article.