Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available.
Currently, most of the commonly-used datasets are based on short videos and brief sentence queries. This setup limits the model in developing stronger abilities like long-term contextual multimodal understanding that bridges long-form videos and long-text queries. Besides, shorter queries that describe detailed events, are more prone to causing the risk of semantic ambiguity in referring expressions, i.e., the occurrence of one-to-many correspondence between queries and moments, which will adversely affect the model learning. In this work, we curate and present a large-scale dataset called SynopGround, in which over 2800 hours of narrative videos with human-written synopses are manually annotated with dense timestamps. Different from the short general descriptions like ``She steps closer.'' that are widely used in previous datasets, we use synopses involving both high-level expressions conveying abstract concepts and concrete descriptions picturing specific details. As shown in table above, there are very concrete descriptions for visible activity concepts like ``go to the cabin'', as well as extremely concise and abstract expressions like ``spent a happy time'' in the query from our dataset. Such language queries are more challenging, more unambiguous and can enforece the model to learn long-term cross-modal reasoning over higher-level concepts and storylines.
Multi-Paragraph Video Grounding (MPVG) and two representative dataset samples. Given a video and a synopsis Q that contains N paragraphs {Q1, Q2, ..., QN}, the model should predict the corresponding temporal interval for each paragraph Qi in the form of starting and ending time. It is much more complex and requires to understand both short-term intra-paragraph semantics and long-term inter-paragraph dependencies for video grounding, which connects the complex temporal structures of long videos with the complicated semantics of long paragraphs.
(a): Genre distribution of TV dramas. (b): Normalized duration of target video segments. (c): Number of queries per video. (d): Normalized start timestamp distribution. (e): Normalized end timestamp distribution. As shown in figure (a) above, the TV dramas used in our dataset cover a wide spectrum of genres, which demonstrates the diversity of the collected data. In figure (b) above, we show the normalized duration of the target video segments. Most of the target video segments cover less than 20% of the full video, which can be challenging for the model to correctly localize. In figure (c) above, we visualize the distribution of the number of queries/paragraphs in each synopsis, and most synopses are composed of 5-13 paragraphs. Exploring the contextual information among these paragraphs is important for achieving promising performance in our multi-paragraph video grounding task. In figure (d) and (e) above, we visualize the temporal distributions of the starting timestamps and ending timestamps of the target video segments. Both of them approximately present a uniform distribution, which ensures the model cannot benefit much from the distribution bias.
Comprehensive comparisons of our SynopGround with the existing video grounding datasets. As shown, the videos in our dataset are much longer than those in Charades-STA, ActivityNet-Captions, DiDeMo, TACoS and Ego4d-NLQ. Although the average video duration in our dataset is shorter than that of MAD, our total duration of videos is more than twice that of MAD, showing that our dataset is at a large scale. Furthermore, the duration of target segments in our dataset is significantly longer. This requires the model to capture the full picture of the holistic story conveyed in the language queries and the video, which is challenging. Besides, our dataset is the first benchmark to introduce paragraph queries, and the average number of words in each query is significantly larger than that of other datasets, which greatly reduces the semantic ambiguity of the queries and brings more challenges to the model's cross-modal understandinging abilities. Moreover, our synopsis queries involve both abstract expressions and concrete descriptions, enabling the model to learn semantic concepts at more diverse abstraction levels.
Overview of our proposed Local-Global Multimodal Reasoner (LGMR). It consists of a local-global temporal encoder for structural long-term temporal modeling and a local-global iterative decoder to adaptively extract the local subparagraph features guided by global paragraph semantics and reason through the local and global semantics for multi-paragraph video grounding. The video encoder decomposes the temporal correlations of long videos into intra-window and inter-window parts for efficient long-term temporal modeling. The query decoder first extracts subparagraph representations with a set of learnable queries guided by the global semantics of paragraphs, and then repeatedly conducts cross-modal reasoning within and across the local and global queries.
As shown above, we evaluate the performance of our proposed LGMR on the challenging multi-paragraph video grounding task and compare it with the existing state-of-the-art methods DepNet and PRVG. The comparison results demonstrate that our proposed model achieves the best performance and outperforms others by a significant margin, which validates the effectiveness of the proposed LGMR method for addressing MPVG. In addition, we conduct detailed experiments to verify our proposed idea to model and reason the local-global structures of long queries. First, the model using only local queries for cross-modal decoding process achieves a significantly lower performance than our final model. Secondly, we observe significant gains in performance when jointly modeling the local and global queries from the long text inputs during decoding, showing the importance of our local-global query modeling. In addition, we find that utilizing a cross-level interaction module can even further improve the performance, which suggests the benefits of mining and reasoning the complementary local-global information.
@inproceedings{
tan2024synopground,
author={Tan, Chaolei and Lin, Zihang and Pu, Junfu and Qi, Zhongang and Pei, Wei-Yi and Qu, Zhi and Wang, Yexin and Shan, Ying and Zheng, Wei-Shi and Hu, Jian-Fang},
title={SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses},
booktitle={ACM MM},
year={2024}
}
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Usage and License Notices: The dataset and code are intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.