0

估算 YouTube 影片總量的方式

 8 months ago
source link: https://blog.gslin.org/archives/2023/12/24/11552/%e4%bc%b0%e7%ae%97-youtube-%e5%bd%b1%e7%89%87%e7%b8%bd%e9%87%8f%e7%9a%84%e6%96%b9%e5%bc%8f/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

估算 YouTube 影片總量的方式

Hacker News Daily 上看到「How big is YouTube? (ethanzuckerman.com)」這篇,原文在「How Big is YouTube?」。

算是個老問題了,而且應該是統計學上比較簡單的方法。先列出作者最後的成果:「TubeStats」。

作者用的方法是觀察 YouTube 的 vid:

Here’s how this works: YouTube URLs look like this: https://www.youtube.com/ watch?v=vXPJVwwEmiM

可以分析出來 vid 包括了 64-bit 的資訊,這個資料型態對工程師來說,看起來就很像是 uniformly distributed:

That bit after “watch?v=” is an 11 digit string. The first ten digits can be a-z,A-Z,0-9 and _-. The last digit is special, and can only be one of 16 values. Turns out there are 2^64 possible YouTube addresses, an enormous number: 18.4 quintillion. There are lots of YouTube videos, but not that many. Let’s guess for a moment that there are 1 billion YouTube videos – if you picked URLs at random, you’d only get a valid address roughly once every 18.4 billion tries.

然後就是隨機去產生 vid 去掃,這個方法跟 drunk dialing 的行為很像,算是 random sampling 的方式:

We refer to this method as “drunk dialing”, as it’s basically as sophisticated as taking swigs from a bottle of bourbon and mashing digits on a telephone, hoping to find a human being to speak to. Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often. Kevin Zheng wrote a whole bunch of scripts to do the dialing, and over the course of several months, we collected more than 10,000 truly random YouTube videos.

另外在 2011 年就有提出來利用 autocomplete 機制去算:

By comparing our results to other ways of generating lists of YouTube videos, we can declare them “plausibly random” if they generate similar results. Fortunately, one method does – it was discovered by Jia Zhou et. al. in 2011, and it’s far more efficient than our naïve method. (You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.) Kevin now polls YouTube using the “dash method” and uses the results to maintain our dashboard at Tubestats.

目前他們的預估大約是 13B 左右的影片,換算大約是用掉 33.63 bits 了 (233.6):

In our case, our drunk dials tried roughly 32k numbers at the same time, and we got a “hit” every 50,000 times or so. Our current estimate for the size of YouTube is 13.325 billion videos – we are now updating this number every few weeks at tubestats.org.

而這邊提到的 32768 * 50k 會中一次的部分,這邊的大約是 30.61 bits,這樣加起來是差不多 64 bits 沒錯。

不過要注意的是,他們沒有給出 interval,所以 13B 的上下可能是一倍左右的差距 (6.5B~26B 之類的),這邊的數字當作概念比較好...

Related

HiNet 讓 YouTube 變快的方法:擋掉 210.71.222.0/24

在「How to stop TWC ISPs sucking at Youtube」這篇看到作者 (在美國) 抱怨時代華納 (Time Warner Cable,TWC) 連 YouTube 看影片的速度很慢,然後發現擋掉某個網段就快很多了... 看了 Hacker News 上的討論以及以前得知的架構,這些 IP 有可能是: YouTube 自己的 CDN 伺服器,以 appliance 的形式放到 TWC 內。 TWC 買 YouTube cache solution 丟自己機房。 如果要猜的話,我會猜前者... 然後同樣問題也在 HiNet 發生,實際測試後就找到 210.71.222.x 這個網段。 在 Linux 下是使用 iptables 擋,其他作業系統可以在原文裡找到說明:(我自己的 Linux 是放到 /etc/rc.local…

March 1, 2013

In "CDN"

把 YouTube 的 Dislike 數字弄回來

最近 YouTube 也在搞事,把 Dislike 的數字拔掉了,後來在 Greasy Fork 上面找了一下,看到有兩套方法可以把數字補回來。 第一套是「Return YouTube Dislike」這個方法,從程式碼裡面可以看到是透過 API 拉出來的: function setState() { cLog('Fetching votes...'); doXHR({ method: "GET", responseType: "json", url: "https://return-youtube-dislike-api.azurewebsites.net/votes?videoId=" + getVideoId(), onload: function (xhr) { if (xhr != undefined) { const { dislikes, likes } = xhr.response; cLog(`Received count: ${dislikes}`); setDislikes(numberFormat(dislikes)); createRateBar(likes, dislikes); }…

November 28, 2021

In "Computer"

下載 YouTube 影片的技術限制與繞過方法

Hacker News 上看到這篇「How They Bypass YouTube Video Download Throttling」在講 YouTube 防止下載的各種方式。 透過 API 拿到的 URL 直接抓很慢,大約 40-70KB/sec: However, attempting to download from this URL leads to really slow download: The speed is always limited to around 40-70kB/s. 這邊需要一個 javascript 環境計算出 n,帶入後續的 request 以「證明」你是官方的網頁 client: Since mid-2021, YouTube has included the…

August 15, 2023

In "Computer"

a611ee8db44c8d03a20edf0bf5a71d80?s=49&d=identicon&r=gAuthor Gea-Suan LinPosted on December 24, 2023Categories Computer, Murmuring, Network, ServiceTags big, dialing, drunk, platform, random, sampling, size, video, youtube

Leave a Reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Learn More)

Post navigation


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK