Utilizing Large Language Models for Semantic Search and Summarization of International Television News Archives

Abstract

Among many different media types, the Internet Archive also preserves television news from various international TV channels in many different languages. The GDELT project leverages some Google Cloud services to transcribe and translate these archived TV news collections and makes them more accessible. However, the amount of transcribed and translated text produced daily can be overwhelming for human consumption in its raw form. In this work we leverage Large Language Models (LLMs) to summarize daily news and facilitate semantic search and question answering against the longitudinal index of the TV news archive. The end-to-end pipeline of this process includes tasks of TV stream archiving, audio extraction, transcription, translation, chunking, vectorization, clustering, sampling, summarization, and representation. Translated transcripts are split into smaller chunks of about 30 seconds (a tunable parameter) with the assumption that this duration is neither too large to accommodate multiple concepts nor too small to fit only a partial concept discussed on TV. These chunks are treated as independent documents for which vector representations (or document embeddings) are created. These vectors are clustered using algorithms like KNN or DBSCAN to identify pieces of transcripts throughout the day that are repetitions of similar concepts. The centroid of each cluster is selected as the representative sample for their topics. GPT models are leveraged to summarize each sample. We have crafted a prompt that instructs the GPT model to synthesize the most prominent headlines, their descriptions, various types of classifications, and keywords/entities from provided transcripts. We classify clusters to identify whether they represent ads or local-news that might not be of the interest of the international audience. After excluding unnecessary clusters, the interactive summary of each headline is rendered in a web application. We also maintain metadata of each chunk (video IDs and timestamps) that we use in the representation to embed a corresponding small part of the archived video for reference. Furthermore, valuable chunks of transcripts and associated metadata are stored in a vector database to facilitate semantic search and LLM-powered Retrieval-Augmented Generation (RAG). We have deployed a test instance of our experiment and open-sourced our implementation (https://github.com/internetarchive/newsum).

Details

Creators
Sawood Alam
Institutions
Date
2024-09-18 11:15:00 +0100
Keywords
information technology for dp; from document to data
Publication Type
lightning talk
License
Creative Commons Attribution Share-Alike 4.0 (CC-BY-SA-4.0)
Download
(unknown) bytes
Slides
here
Video Stream
here
Collaborative Notes
here

View This Publication