Dialogue Rate in Movies

Audio highly recommended. Rate of dialogue throughout a selected movie (each point is a 3-minute average of words-per-minute). Extreme points are detected and 1-minute clips are automatically extracted. Use the movie selector above to choose a movie. Click a clip thumbnail to view the scene.

Making it

Analysis and extraction was done using Python and FFMPEG. Data is then exported and plotted using D3.

Key steps:
1. Recieve any movie & subtitle file as input.
2. Clean subtitle file and calculate number of words spoken per minute.
3. Find the most extreme maximum/minimum every 20 minutes.
4. Automatically extract relevant 1-minute clip from the movie.
5. Plot everything.


The only input needed here is the path to a folder which contains an .mp4 movie file and an .srt subtitle file.
# Path to folder with .mp4 and .srt files
folder = '~/MoviesDialogue/FightClub'
# Movie title
label = 'Fight Club'
# Plot color
color = 'red'

Clean subtitle file and create word-min DataFrame

# Get path to subtitle and movie file using glob with suffix search
sub_file_path = glob.glob(os.path.join(folder, '*.srt'))[0]
movie_file_path = glob.glob(os.path.join(folder, '*.mp4'))[0]

# Read subtitle file
# Structure of file is:
# ...
# <TEXT (Possibly multi lined)>
# ...
with open(sub_file_path, encoding='utf-8', errors='ignore') as handle:
    content = handle.readlines()

# Find the indices of the empty rows ( == '\n') using numpy where,
# and split the original array to "blocks" representing each subtitle appearance
# +1 is added to the split indices so that each block starts with the line number and ends with the separator
sub_blocks = np.split(content, np.where(np.array(content) == '\n')[0] + 1)

# Iterate the sub blocks, position 1 is the timing line, position 2 and forward is the content.
# Create an array of dictionaries,
# containing the text (join of all content lines), and the rounded minute (read from timing line)
sub_min_text = [{'min':int(x[1].split(':')[1]) + 60 * int(x[1].split(':')[0]),
               'text':" ".join(x[2:])} for x in sub_blocks if len(x) > 0]

# Build DataFrame
sub_df = pd.DataFrame(sub_min_text)

Find the most extreme min/max every 20 minutes

# Calculate total words per subtitle block
sub_df['total_words'] = sub_df['text'].apply(lambda x: len(x.split()))

# for minutes without subtitle blocks at all.
# Roll the dataframe based on ROLL_MINS so that the output is less noisy, the value of every minute is now calculated
# As a mean

# Group the dataframe based on subtitle minute, and get sum of total_words per group
per_min_df = \
    # Reindex the dataframe for a minute based index (0 to last-min)
    # this is used so we'll also have a representation for minutes without any subtitle blocks
    .reindex(np.arange(0, np.max(sub_df['min'])))
    # therefore the next step is to fill those empty indices with 0 total words
    # Roll the dataframe based on roll mins, making the output less noisy
    # Solve the roll using mean value - every minutes "total words" will now be
    # the average total words of the last ROLL_MINS

# Use np.roll measure see the backwards change (current - previous) in total words
# and the forwards change in total words (next - current)
change_pre = per_min_df['total_words'] - np.roll(per_min_df['total_words'], 1)
change_post = np.roll(per_min_df['total_words'], -1) - per_min_df['total_words']

# Find maximum and minimum points, define extrema points as either one
maxima = (change_pre > 0) & (change_post < 0)
minima = (change_pre < 0) & (change_post > 0)
extrema_points = (maxima) | (minima)

# Create DF for extrema points, calculate distance of every extrema from the mean words
extrema_df = pd.DataFrame(np.abs(per_min_df.loc[extrema_points, 'total_words'] - per_min_df['total_words'].mean()))

# Chunk into segments, group by the segments and get index of extrema that's farthest away from the mean
extrema_df['segment'] = (np.array(extrema_df.index) / SEGMENT_SIZE).astype(int)
segmented_extrema = extrema_df.groupby('segment').agg(segment_top_extrema_min=('total_words', np.argmax))

# Information for plot, x y for line and x y for each extrema dot
line_x = list(per_min_df.index)
line_y = per_min_df['total_words']
extrema_x = segmented_extrema['segment_top_extrema_min']
extrema_y = per_min_df.loc[segmented_extrema['segment_top_extrema_min'], 'total_words']

Extract 1-minute clips, export data for plot

This part assumes FFMPEG is installed on the machine. For explanation regarding the command used visit this superuser question for how to clip a video, and this FFMPEG page on re-encoding for reducing file size.

# Create folder if doesn't exist
plot_data_path = os.path.join('./plot_data', label)
if not os.path.exists(plot_data_path):

# Iterate the extremas, for each one call FFMPEG with
# 1. second of clip start
# 2. path to mp4 file
# 3. length of clip (1 min)
# 4. path to output file
for peak_min in list(segmented_extrema['segment_top_extrema_min']):
    os.system('ffmpeg -y -ss {} -i "{}" -t {} -c:v libx264 -x264-params "nal-hrd=cbr" -b:v 1M -minrate 1M -maxrate 1M -bufsize 2M "{}.mp4"'
              .format((peak_min)* 60 ,
                      1 * 60,
                      os.path.join(plot_data_path, str(int(peak_min)))))

# Dump all data for plotting in client
            'name': label,
            'color': color,
          open(os.path.join(plot_data_path, 'out.json'), 'w'))

Plot everything

So at this point the json file and the mp4 files are read into the D3 plot. I won’t be diving into the code at this time (sorry!) but everything is readily available on this Github repo.

Key steps/conclusions were:
1. As always with D3 (in my opinion), start with an example that’s most similar to your goal. This one did the trick for me.
2. D3 is drawn in SVG. To add non-SVG elements (such as the video containers) you need to wrap them in an <externalComponent> element.
3. I have used several different js carousel libraries, this time I tried glidejs and I think it takes the cake.

Happy viewing!