I need someone who can assemble a large dataset of podcast-related data. I need multiple variables that should be publicly available, but it will require someone able to scrape data at scale through APIs or other means. I am hoping for the following variables from a large sample of podcasts: podcast name, genre, number of episodes, rating, number of streams, number of hosts, host gender, and any other readily available information. The most difficult part of the project I anticipate is that I need a short audio sample of each host's voice, analyzed using OpenSmile in Python.
1. Obtaining podcast names + ratings. Write a python script that can, given the name of a podcast, either pull popularity metrics from Rephonic or scape usage statistics from Castbox and/or Podcast Addict.
2. Extract audio data and isolate host voice. Write a python script that can, given the name of a podcast, find that podcast via the itunes API, extract useful podcast metadata fields, and download podcast audio.
3. Create a python script to isolate the podcast host's voice without supervision by identifying speakers within an episode and segment audio into "distinct" audio segments. We would likely need audio processing to extract MFCCs of each segment and then cluster segments within an episode to find individual speakers and compute cluster centroid to identify "vocal signature" of a speaker. Clustering speakers across episodes should help find which vocal signatures are frequently, and cluster(s) with majority of data points should correspond to the voice of the host.
4. Using the OpenSmile package in Python, analyze the audio of each host(s) voice.