18

The Kinetics Dataset Explorer Using GIFs

 4 years ago
source link: https://towardsdatascience.com/the-kinetics-dataset-explorer-using-gifs-8ceeebcbdaba?gi=1f2d1bd27da3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Making it easier for humans to sift through the kinetics dataset

Feb 24 ·6min read

qMveye7.png!web

The Kinetics dataset explorer is a website containing the kinetics videos converted to GIF format. Making it easier for humans to sift through a large amount of temporal data quickly. The generated GIFs are from frames sampled at a lower frequency than the original video i.e. 4 fps vs 25 fps and the resulting frame rate is adjusted to 16 fps. This means a 10 second video can be viewed in about 2.5 seconds. The width and height are proportionally scaled down as well to reduce the file size and allow for faster load times.

The explorer can be found by going to the following link . The rest of the blog posts goes into the why (motivation) and how the explorer was created as well as the assumptions made.

The Problem

The kinetics dataset is one of the more prominent datasets for human action recognition. The dataset, however, is quite large, containing about 650,000 videos where each video has a duration of 10 seconds each. This means quickly sifting through the dataset to understand the category context can be challenging.

This is because normally videos tend to be viewed sequentially (as opposed to in parallel e.g. YouTube vs Giphy ). So in order to get the overall temporal structure of an action category within the video, one has to watch the entirety of the video. Therefore, if a video is 10 seconds long at a standard frame rate of 25 fps, the human eye needs to scan a total of 250 frames in order to grasp the temporal structure within the video.

The Solution

Videos are simply a sequence of images over time, where the order of image sequences is able to encode some form of temporal information e.g. a person dancing, a person walking or any form of displacement which our minds interpret as movement. The greater the sequence of images condensed in a single second, the more seamless the video appears and the closer to a real world interpretation e.g. a video with 60 fps appears to be more seamless than a video with 25 fps. This means that videos are a high density temporal encoding medium provided the frame rates are high enough.

In order to find a good solution to this problem, one needs to consider the temporal identity of an action category. By this, I mean the minimum time and frame rate that is needed to ensure with a high degree of accuracy that an action of a particular category has occurred. Thus, it can be argued that videos contains a lot of redundancy while encoding temporal information. As such, only a small number of frames uniformly sampled can encode the necessary temporal identity.

To figure out the correct frame rate, we need to consider at what temporal resolution does information become incoherent and doesn’t capture the temporal structure of the action in the videos in a meaningful manner.

The similarity to this problem in images is pixelation as shown below.

fM77Fv6.jpg!web

https://miscellaneousdetails.com/tag/mona-chalabi/

From the above image, we can kind of tell that the subject could be a tiger, however, it is pixelated i.e. a small number of pixels is used to represent the image. This means the spatial identity of the image is low as we are not able to clearly tell what it is.

Increasing the pixel count (thus lowering pixelation) will increase the spatial identity of the subject in question, the tradeoff being the image will probably have larger memory requirements. It can be argued the greater the spatial resolution, the higher the confidence of the spatial identity of the subject. There is a limit such that adding more pixels does not increase the spatial identity.

Applying the above logic to videos, we go from the continuous real world scenario which has the highest possible temporal resolution (by the nature of being continuous). To high quality video encodings which have a resolution of about 60 fps (to put in context, the upper limit of the human eye is about 1000 fps). With high quality videos, a lot of memory is used to store the temporal information but with a lot redundancy. As more frames from the video is uniformly removed, some temporal information is lost up to a point where the video encoding is incoherent. This means we are not able to tell the action category based on the temporal information i.e. the temporal identity indeterminate. (However videos also contain spatial structure, as such it’s easy to tell what’s going on in the video purely based on the image e.g. going swimming vs playing soccer. This factor is what quite a lot of video based deep learning algorithms take advantage of and don’t really consider the time domain).

The GIF encoding format is a good candidate to solve this problem as:

  • They are usually at a lower frame rate than videos which means a lower temporal resolution/density.
  • Have default auto looping capability allowing for multiple GIFs to play sequentially.
  • Have mass adoption in web browsers.

Therefore converting the kinetics videos to GIFs can allow for a more efficient method for sifting through a large video dataset.

Implementation Details

Implementing the GIF to video conversion is quite easy as is shown below.

subprocess.call([
'ffmpeg',
'-i', str(video_file),
'-filter_complex',
"[0:v] fps=4,scale={scale}:-1".format(scale='180'),
'-f', 'image2',
'-loglevel', 'warning',
'-y',
png_file
])subprocess.call([
'ffmpeg',
'-i', png_file,
'-f', 'gif',
'-framerate', '16',
'-loglevel', 'warning',
'-y',
str(gif_file)
])

The ffmpeg library is used to do the conversion where the first subprocess call is used to uniformly sample the original video at 4 fps and store the data as images. (It is technically possible to create a GIF representation directly in this process however it will play at 4 fps but will still take 10 seconds for the video to play straight through). The filter_complex argument rescales the videos to a width of 180 px while maintaining the aspect ratio

The second subprocess call is used to create a GIF from the generated png files at a frame rate of 16 fps, which brings the total length of the GIF to 2.5s which is a 75 % reduction of the length.

Disadvantages of GIFs

GIFs, however, have some disadvantages one of them being at times they can have higher memory requirements than videos (both cpu and storage). And has been suggested in that it’s better to convert GIFs into video because of this problem. There are other file encodings such as webp which is more efficient than both, however, it is not as widely adopted as the GIF format.

The main concepts, however, remain the same, which is to reduce the temporal resolution by uniformly extracting the frames from the source video. Combining the extracted frames with a new frame rate to reduce the duration relative to the source video.

Implications to Deep Learning

Some deep learning papers have mentioned the idea of sub-sampling frames e.g. the Let’s Dance: Learning From Online Dance Videos paper. However as far as I know, no research has been conducted concerning the optimal sub-sampling frequency in video data. And as standard video data (e.g. in YouTube) contains a lot of temporal redundancy, an interesting area of research can be at what temporal resolution is temporal identity lost and what are the factors affecting it.

In the meantime the GIF method can act as a stop gap in case you have a large amount of video training data that needs to be subsampled as its computationally expensive to train using each frame of the video. Simply extract the frames at uniform intervals and convert to GIFs and look through your data. If a human can tell what is happening at a reduced temporal resolution, then a deep learning algorithm should, in theory, be able to identify the action sequence. Increase the extraction intervals until you a human can’t tell what action is occurring in the frames. At this point the temporal identity of the video sequence becomes incoherent.

Conclusion

Hopefully, the Kinetics explorer tool will be useful to individuals looking to perform deep learning using the Kinetics dataset.

We have also looked at the reasoning behind converting videos to GIFs and decreasing the temporal resolution. If there are any logical inconsistencies with my reasoning, please add it in the comment section below. I am always looking to learn and refine my knowledge as well.

References


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK