Montage: injecting textual context into video editing

Publication Type	honors thesis
School or College	School of Computing
Department	Computer Science
Faculty Mentor	Thomas C. Henderson
Creator	de Freitas, Christopher
Title	Montage: injecting textual context into video editing
Date	2021
Description	Montage is a video editing plug-in designed to save an editor time and streamline their workflow in Adobe Premiere Pro. Footage dragged into the Premiere timeline is securely and automatically uploaded to the Montage server, and the audio is then analyzed to produce transcripts, topics, sentiments, and speaker spans. This added context is presented to users in the Montage Storyboard and Spotlight panels. The Storyboard displays the transcripts of users' clips, and is synchronized with the timeline, so changes to transcript order in the Storyboard are reflected in clips themselves. The Spotlight houses Montage's search functionality, allowing users to look for clips by the words contained within them, rather than sifting through file names. Together, the Montage extension and the Montage server together allow users to peek into their footage without having to re-watch it themselves, saving time and energy.
Type	Text
Publisher	University of Utah
Subject	video editing automation; transcript-based editing; workflow optimization
Language	eng
Rights Management	(c) Christopher de Freitas
Format Medium	application/pdf
ARK	ark:/87278/s6yfysfx
Setname	ir_htoa
ID	2930286
OCR Text	Show ABSTRACT Montage is a video editing plug-in designed to save an editor time and streamline their workflow in Adobe Premiere Pro. Footage dragged into the Premiere timeline is securely and automatically uploaded to the Montage server, and the audio is then analyzed to produce transcripts, topics, sentiments, and speaker spans. This added context is presented to users in the Montage Storyboard and Spotlight panels. The Storyboard displays the transcripts of users’ clips, and is synchronized with the timeline, so changes to transcript order in the Storyboard are reflected in clips themselves. The Spotlight houses Montage’s search functionality, allowing users to look for clips by the words contained within them, rather than sifting through file names. Together, the Montage extension and the Montage server together allow users to peek into their footage without having to re-watch it themselves, saving time and energy. ii TABLE OF CONTENTS ABSTRACT ii INTRODUCTION 1 MOTIVATION 3 METHODS 8 LIMITATIONS 13 CONCLUSION 14 REFERENCES 16 iii 1 INTRODUCTION Video editors often work with huge libraries of footage to create memorable and entertaining experiences. Any movie theatre blockbuster with a 90 minute runtime started out as several thousands of hours of chaotic, disorganized footage spread over a dozen hard drives. Cutting and labeling a video or movie from those clips is a monumental task. For editors, the biggest time sink comes from two problems: editors don’t know what’s in a clip until they watch it, and editors have limited human memory. Editors are asked to fix these problems by wasting their hours away watching and re-watching footage. See Figure 1 for a look at Premiere’s timeline. Each bar is a clip, but the only information editors have about what’s inside each clip is the file name. To order these clips the right way, you’d have to watch each piece of footage multiple times to sort them. These massive video reviews take away valuable time editors could otherwise be using to edit the footage they have. Given the complex, tedious, and often inefficient nature of video review and editing, we present a simple and elegant solution: Montage. 2 Figure 1 - Adobe Premiere Pro, no Montage Montage is an Adobe Premiere extension created to streamline the editing process in two important ways. First, Montage incorporates speech-to-text and information retrieval methods to instantly find footage corresponding to sentences, individuals, or topics and efficiently captions and tags each second of footage. Montage provides all the tools needed to know the footage inside and out and, in turn, makes for better editing and happier editors. However, all this rich and concise information would go to waste without Montage’s second unique feature: the integrated timeline video editor. The Montage extension will allow users to rearrange, cut, and delete parts of the transcript and have those changes reflected in the Premiere timeline, allowing an editor to edit footage without even touching the video. On top of that, Montage’s quick clip preview allows editors to drag and drop various clips of footage directly onto their timeline. Montage’s accessible interface and powerful features make it a valuable resource for editors working on any project. 3 MOTIVATION The Users Video editors with enormous amounts of footage can have a hard time parsing through their catalogs when compiling a project. This is true in cases ranging from YouTubers with dozens of hours of footage to edit by themselves, to professional teams of editors who work together to develop comprehensive corporate projects. Montage satisfies any breadth and depth of work that needs to be done. Their Problem Hours of footage, especially footage with spoken audio, can be a nightmare to sift through and align. Even footage as simple as B-Roll can be frustrating to search through to find needed clips. With footage that has spoken audio in it, the problem is exacerbated. Thumbnails can’t effectively show audio, so editors won’t be able to make an informed decision on if and how to use a clip without watching it. Editors have human brains with limited memory, so in projects with tons of raw footage, they often find themselves re-watching individual clips multiple times to determine if they can be used in a project. A task like this is inefficient and frustrating, and prevents editors from maximizing their potential on a project. 4 Our Solution We approached this problem from two directions. First, editors need to be able to know what’s in a clip without re-watching it multiple times. Creating a product that reduces the need for editors to re-watch dozens of hours of footage can cut back significantly on the time an editor spends on editing, and in turn, improve project productivity. Second, editors need to be able to leverage whatever added context we provide them to make editing faster. By providing a clean, intuitive, and simple way of cutting and recutting footage together by transcript text rather than by clip, Montage allows editors to focus solely on the story they want to tell, rather than the tedious intricacies of editing their footage manually. Figure 2 - Simple View of Montage’s data transfer scheme 5 Figure 3 - Montage Server Internal Pipeline Feature Summary Montage consists of two panels: the Storyboard and the Spotlight. Montage automatically sends footage dropped in the timeline to the Montage server, as shown in Figure 2. Once it reaches the server, the footage is processed through the pipeline outlined in Figure 3. See the Methods section for more details on how the server generates its response. On response, the Storyboard, shown in Figure 4, receives and displays transcripts for each clip. Those transcripts can be dragged and dropped in any order, and Montage reflects those changes on the Premiere timeline. The Storyboard gives editors another angle, allowing them to edit by text rather than exclusively by footage. 6 Figure 4 - Montage Storyboard Figure 5 - Montage Spotlight 7 The Spotlight, shown in Figure 5, was designed to tell editors exactly where footage exists based on simple queries done on words spoken in the clips. For example, if an editor is looking for all the clips where people are arguing about politics, topic modeling combined with a sentiment analysis would be able to effectively segment and locate the clips where politic-related words were spoken with aggressive or malign sentiment. The overall design goal was to develop a product that is powerful enough for professional use but intuitive and simple enough to allow people with very little editing experience to effectively cut together footage. This would allow anyone to follow a script, and edit a video with that script using Montage. Montage is, at its core, software intended to maximize the time editors spend editing in Adobe Premiere, and minimize the time they spend looking for information already sitting in the timeline. My Role I was responsible for the development of Montage’s server (de Freitas, 2021). As such, I’ll focus mainly on the work done behind the scenes to power the Storyboard and Spotlight UI elements. The work done on the server includes: the server’s API, web interface, audio extraction and conversion, database caching, topic modeling, sentiment analysis, and speaker diarization. The server itself was written in .NET Core, the text and audio analysis was done in Python and C++. The server was 8 hosted on an AWS T2 medium EC2 server, and our website was registered at www.montage.tk. METHODS Montage’s Spotlight and Storyboard panels are powered by a variety of information extraction (IE) techniques. After footage is sent to the server, the pipeline outlined in Figure 5 begins. Each individual step is described below. Audio Extraction and Standardization Audio is extracted from the video, upsampled or downsampled to a 8 kHz sample rate with a 16 kbps bit rate, then converted to a WAV file all using FFMPEG. We found this standardization greatly improved transcription speed and consistency between different footage, with a negligible impact on accuracy. Because each standard audio format requires a different codec for conversions, Montage will only accept audio files in WAV, MP3, M4V, AAC, FLAC, or Ogg formats, and will only accept video files in MP4, MOV, WMV, WebM, F4V, AVI, AMV, or MTS formats. These formats should cover the vast majority of use cases, and include all the most common (along with some uncommon) audio and video formats. Speech to Text For speech to text, we initially used the simple CMU Sphinx model developed by Carnegie Mellon University (Lamere et al., 2003), but that ended up being too 9 computationally expensive with the fairly cheap AWS machine we were running out server on. To offset this cost, we outsourced the speech to text procedure to Azure instead. Azure has a speech to text API they offer as part of their Cognitive Services API. Azure’s speech to text model was significantly more accurate than our own, and although an API call introduces network latency, profiling indicated the trade-off on overall response time was in Azure’s favor. Topic Modeling Topic modeling is a machine learning technique for discovering hidden semantic structure within text. Topic modeling clusters words which co-occur in similar contexts to create clusters of words called “topics.” It can then be inferred which documents (in our case, Tweets) correspond most to which topic. This technique is considered “unsupervised,” because there is no ground truth or statistical prior presented beforehand. We simply observe which words appear together or are used in similar contexts, and cluster those words accordingly. For example, if the phrases “bucket of water” and “bucket of sand” appear often in a text, “bucket,” “sand,” and “water” may appear in the same topic. Further extending this notion of “context surrounding a word” can give us complex topic clusters and make searching for information far easier. State of the art topic modeling typically uses variants of the Latent Dirichlet Allocation (LDA) model described in Blei et al. (2003). The particular model I used was the Biterm Topic Model (BTM), a derivative of LDA, as it was developed specifically with short text, like tweets, in mind (Yan et al., 2013). See Figure 6 for the plate diagram which shows the difference between BTM and prior 10 methods. Both the mixture of unigrams and LDA models documents as a collection of words, which can introduce a lot of sparsity when each document is short (as is often the case when editing many clips together). To alleviate that concern, BTM models the words directly, without a document substructure. Instead, unordered pairs of words which co-occur in short contexts are grouped together into biterms, which makes each corpus a dense collection of biterms rather than a sparse collection of short documents. Figure 6 - Topic Model Diagrams: (a) LDA, (b) mixture of unigrams, and (c) BTM Sentiment Analysis Sentiment analysis is a method for assigning human emotionality to words or phrases in text. Humans express their emotions using a mixture of body language, intonation, and an understanding of context and rapport with each other. The text on its own won’t be enough to make any guaranteed inferences about the sentiment of a conversation, but with enough data we can certainly try. VADER (Hutto & Gilbert, 2014) is a classifier able to predict whether a span of text is generally positive or negative. It was trained from a large sentiment dictionary using 11 responses from questionnaires like the one in Figure 7. Montage took advantage of VADER by predicting the sentiments of each word in the generated transcript, and returned it to the user in labeled spans as shown in Figure 8. Figure 7 - VADER Sentiment Analysis Questionnaire Figure 8 - Sentiment Analysis Diagram Speaker Diarization Speaker diarization is an audio clustering method used to distinguish spans of audio which have different speakers from each other. We implemented a slightly simpler version of the speaker segmentation algorithm in Kotti et al. (2008). Audio features are extracted, then clustered. Because the number of speakers, and thus the number of expected clusters, is unknown for each clip, we chose agglomerative clustering 12 over a faster partition based clustering method like k-means. Hierarchical clustering strategies infer the number of clusters after clustering is complete, but are also much slower, typically being O(n^2) vs k-means’ O(n) complexity. We could have added a UI element that asks the user how many speakers are in each clip, which would enable us to use a faster clustering strategy, but our design goal was to make Montage as hands off as possible, so we felt the trade-off was necessary. After clustering is complete, time spans can be reconstructed from each feature vector, since the bit rate and sample rates are always the same after conversion. Each time span is labeled with an ordinal number to indicate which speaker is speaking for which periods, as is shown in Figure 9. The intended use case for diarization in this context would be for things like interviews, where extremely long clips might need to be cut at every speaker change before editing can even begin. Rather than slowly scrubbing through the footage to find the speaker shifts, Montage would find and suggest places for you, limiting the editors work to cutting a few seconds of footage rather than a few hours. Figure 9 - Speaker Diarization Diagram 13 Data Caching In an effort to reduce computation time, the Montage server stores a copy of the transcript along with the generated topics, sentiment vector, and speaker spans in a MySQL database indexed by a hash of the raw footage file itself. The server does not store the footage, audio, or any of the file’s metadata. When the exact same footage file is sent back up to Montage, it will return the previously completed analysis results. Because the Montage extension has no access to a Premiere user’s file system, it can’t save any of those results locally, so this caching prevents Montage from having to re-analyze each clip in a project each time it’s opened and closed. LIMITATIONS The Montage server is a powerful audio processing tool with an open API, enabling users from anywhere to gather insight into clips sent to it. When paired with the Montage extension for Adobe Premiere Pro, it becomes a powerful editing tool saving editors time and providing a text-based editing experience in addition to the traditional footage based one. The Montage front end was unfortunately harried by the Premiere API, which has some well-known undefined and inconsistent behaviors. This resulted in many of the server features lacking a front-end component to match, such as the sentiment analysis and speaker diarization. Those features are still, however, complete, and our open API enables power users to get that information with a single API call. Additionally, logged in users have access to 14 all the information we cache about their audio on our website, montage.tk. That information will include the sentiment analysis and speaker diarization results. This workaround ended up inspiring an additional feature on our website, empowering users to control their own data. Users are provided the means to edit or delete any of the information stored in our database under their user ID. Other limitations are partially due to our lack of funds as students. We put a lot of work into profiling and optimizing the communication between the Montage server and extension, but there was only so much we could do with 4GB of ram on a T2 AWS machine. The server has a processing queue to prevent requests from being lost, and it handles clips between five and ten minutes fairly easily, but any longer and things slow dramatically. Although there are certainly more stones to turn over with respect to optimization, more funding could enable Montage to work well with footage up to an hour long. CONCLUSION Applications designed to save time should be fast and invisible. That was the design philosophy we started Montage with, and in the end, we succeeded in making an extension that can save editors time. Our vision hasn’t been fully satisfied yet though. Future development on Montage would include a lot of additions to the UI, providing editors with richer information not just about what was said, but about the sentiments at play and the speakers involved. One feature that was tossed due to 15 computational constraints was object detection, where video frames would be sampled and scanned for common objects, providing visual context on top of textual context to each clip. Image processing, however, turned out to be prohibitively slow in the environment we had set up, but with more funding and time it would be an excellent addition. Developing Montage was a formative and enriching experience, I look forward to applying the skills I learned in the future. 16 REFERENCES Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993–1022. de Freitas, C. (2021). Montage Server [Source Code] github.com/christopherdef/montage-server Hutto, C. J., & Gilbert, E. (2014). VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. The International AAAI Conference on Web and Social Media (ICWSM), 10. Kotti, M., Moschou, V., & Kotropoulos, C. (2008). Speaker segmentation and clustering. Signal Processing, 88(5), 1091–1124. https://doi.org/10.1016/j.sigpro.2007.11.017 Lamere, P., Kwok, P., Gouvêa, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf, P. (2003). The CMU SPHINX-4 Speech Recognition System. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 1, 2–5. Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A Biterm Topic Model for Short Texts. Proceedings of the 22nd International Conference on World Wide Web, 1445–1456. https://doi.org/10.1145/2488388.2488514 Name of Candidate: Christopher de Freitas Birth date: November 24, 1997 Birth place: Salt Lake City, Utah Address: 6210 Steeple Chase Lane Salt Lake City, Utah, 84121
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6yfysfx