An Inverse Partial Optimal Transport Framework for
Music-guided Movie Trailer Generation

Yutong Wang1*, Sidan Zhu1*, Hongteng Xu2, Dixin Luo1†
1Beijing Institute of Technology 2Renmin University of China
*Indicates Equal Contribution. Indicates Corresponding Author.

ACM Multimedia 2024

TL;DR:   Given a raw video and a piece of music, we can generate an audio-visual coherent and appealing video trailer/montage.

Abstract

Trailer generation is a challenging video clipping task that aims to select highlighting shots from long videos like movies and re-organize them in an attractive way. In this study, we propose an inverse partial optimal transport (IPOT) framework to achieve music-guided movie trailer generation. In particular, we formulate the trailer generation task as selecting and sorting key movie shots based on audio shots, which involves matching the latent representations across visual and acoustic modalities. We learn a multi-modal latent representation model in the proposed IPOT framework to achieve this aim. In this framework, a two-tower encoder derives the latent representations of movie and music shots, respectively, and an attention-assisted Sinkhorn matching network parameterizes the grounding distance between the shots' latent representations and the distribution of the movie shots. Taking the correspondence between the movie shots and its trailer music shots as the observed optimal transport plan defined on the grounding distances, we learn the model by solving an inverse partial optimal transport problem, leading to a bi-level optimization strategy. We collect real-world movies and their trailers to construct a dataset with abundant label information called CMTD and, accordingly, train and evaluate various automatic trailer generators. Compared with state-of-the-art methods, our IPOT method consistently shows superiority in subjective visual effects and objective quantitative measurements.

Pipeline

Image Description

Dataset

We construct a new public comprehensive movie-trailer dataset (CMTD) for movie trailer generation and future video understanding tasks. We train and evaluate various trailer generators on this dataset. Please download the CMTD dataset from these links: [CMTD Google Drive]. We also provide a music video dataset (MV) for pre-training process. Please download the MV dataset from these links: [MV Google Drive], MV videos are a subset of [SymMV dataset].

Image Description

Visualization

Image Description
Image Description

Gallery:  Trailer Comparison

Trailer Index:   (1) Official trailer  (2) Ours  (3) PPBVAM  (4) V2T  (5) M2T


Movie:  300: Rise of an Empire


Movie:  The Hobbit 2


More baseline trailer videos can be downloaded from [Google Drive]

BibTeX

@inproceedings{wang2024inverse,
        title={An Inverse Partial Optimal Transport Framework for Music-guided Trailer Generation},
        author={Wang, Yutong and Zhu, Sidan and Xu, Hongteng and Luo, Dixin},
        booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
        pages={9739--9748},
        year={2024}
  }