Kaydol

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza kaydolun.

Oturum aç

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza giriş yapın.

Şifremi hatırlamıyorum

Şifreni mi unuttun? Lütfen e-mail adresinizi giriniz. Bir bağlantı alacaksınız ve e-posta yoluyla yeni bir şifre oluşturacaksınız.

3 ve kadim dostu 1 olan sj'yi rakamla giriniz. ( 31 )

Üzgünüz, Flood yazma yetkiniz yok, Flood girmek için giriş yapmalısınız.

Lütfen bu Floodun neden bildirilmesi gerektiğini düşündüğünüzü kısaca açıklayın.

Lütfen bu cevabın neden bildirilmesi gerektiğini kısaca açıklayın.

Please briefly explain why you feel this user should be reported.

I found URLs to ~400,000 normally inaccessible video newscasts published by every major CBS affiliate TV station, hoping someone finds it useful

Hello!

[**tl;dr;** Viewable at https://romanport.com/p/cbsindex/viewer.html%5D __This has not downloaded any video, only URLs to videos__

A few days ago, out of curiosity, I looked at the API used on the website for one of my local TV stations, [WCCO](https://minnesota.cbslocal.com/video/category/news/). Normally, videos on that website disappear once they’re pushed past the 10th page (~2 weeks). However, I noticed that video IDs were stored (kind of) sequentially. I also found out that the server that handles the metadata for these IDs has no rate limit. You know where this is going…

I set up a program real quick to search through all video IDs and save valid IDs along with their metadata (in the .smil format) and HTTP headers. I was able to search these video IDs surprisingly quickly, at about 150/second. What I found was much more than I expected.

In total, I found metadata for **1,143,894** videos going back to 2015 published by **every major CBS affiliate TV station in the US**, of which I’d estimate **400,000** of which still have the videos accessible. It seems that the video files are removed from the server exactly two years after they’re published, but the metadata isn’t removed.

Obviously, I can’t download 400,000 videos. That’s what this subreddit is for though. I’m hoping someone will find this index useful. I think that having these clips stored safely in a public archive would be beneficial to archiving history, but I don’t have the disk space or the internet connection to do so.

## The index

I’ve built a simple web viewer to view videos that are likely still watchable. You can access it below, just keep in mind that it downloads a 70 MB JSON index file. https://romanport.com/p/cbsindex/viewer.html

* I’ve also uploaded the raw index as well in two formats. One is a smaller, human-readable, txt file that lists the ID, URL, and (http header) timestamp. It only has one quality level and is only really useful for browsing. [Download (59 MB gzipped, 231 ungzipped)](https://romanport.com/p/cbsindex/output.txt.gz)

* The other is a more advanced, binary, file containing the raw data I downloaded. The data is stored in the following custom binary format, repeated until EOF, gzipped. The SMIL content is the metadata downloaded from the server. It contains 3-4 URLs to various bitrates/quality levels. [Download (231 MB gzipped, 1,705 ungzipped)](https://romanport.com/p/cbsindex/output.bin.gz)

Binary format (old Reddit seems to break this if it’s following bullet points)

Name | Size | Offset | Info
============|======|========|=========================================
Magic | 4 | 0 | “DATA” in ASCII
ID | 4 | 4 | The original ID it was requested with
Header Len | 2 | 8 | Length of the saved HTTP headers in-file
Content Len | 2 | 10 | Length of the SMIL data
Headers | ? | 12 | Raw HTTP headers from the request
Content | ? | ? | Raw SMIL file (XML)

Unfortunately, the date of these files is a bit hard to pin down. The HTTP “Last-Modified” header does contain a date, and so does the URL path, but these are often conflicting. I’ve also found files with dates in the future and dates close to the unix epoch. There’s likely another API that can be accessed to get this information though.

It appears that the video files are automatically removed from the server about two years after they’re published, so it’d be most important to download those first. This is just a guess after browsing through the files.

## How the index was built

While looking through the API requests made by WCCO’s website, I discovered that I can get metadata for video IDs at the URL “http://cbslocal-download.storage.googleapis.com/anv-videos/variant/<ID&gt;.smil“. I also noticed that IDs are (kind of) sequential. What I mean by kind of is that there are gaps between valid IDs.

I took the ID of the latest video I could find, “5,540,889“, and just started counting backwards. The program I wrote could get 150/second, so I let it run overnight. When I next checked on it, it was at ID “891,880“ and stopped finding valid IDs. I stopped it at that point.

## Downloading content

As I said, this is just an index. I have not actually saved any videos. I just have URLs to videos.

I threw together a WinForms downloader that’ll download from a certain region/station within two dates. Wrote it in 25 minutes, might have bugs. Windows build [here](https://romanport.com/p/cbsindex/CbsDownloaderBin.zip), CSharp source [here](https://romanport.com/p/cbsindex/CbsDownloaderSrc.zip). When you first run it, it’ll download the 231 MB index file.

Benzer Yazılar

Yorum eklemek için giriş yapmalısınız.

15 Yorumları

  1. This post is why I subscribe to this sub ?. I will learn and eventually join the club

  2. For anyone else parsing the binary file: the integers are stored in little endian.

  3. thanks for not only posting great stuff, but also knowing that old reddit breaks charts after bullets and preventing that in your post. truely a hero we do not deserve.

  4. How much storage would it use if you downloaded all of the videos?

  5. I ‘member a hunt for a prolific Oprah show interview with Trump. Was that found and is it possible that it will be in this archive?

  6. On a similar note, I found a Google Drive somewhere with (apparently) whole (or large) collections of the classic Mexican comedy shows [El Chapulín Colorado](https://drive.google.com/drive/folders/1MqYxgKa-cav7KgDi_7Agr4GfO6YdRuZA) and [El Chavo del 8](https://drive.google.com/drive/folders/15UwIEiSEBf7RGHMshtc8kajF0dNit3uL), which [will no longer be broadcast for the foreseeable future](https://en.wikipedia.org/wiki/Chespirito#Ending_of_his_series). I plan to download them all at some point; if anyone else could help preserve these classics, it would be appreciated!

  7. I work IT for a CBS affiliate and while we’re pretty big, we’re nowhere near as big as these major cities. Right off hand without actually looking into it… I wonder what software their using for archiving? And seemingly “on the cloud”? Cause we get a 50TB LTO. And we’ve used about 27TB of it in the last 6 years, but still the lookup speed is nowhere near this fast. Recently I’ve heard about a new system going to go in maybe 4th quarter, I wonder if this is it!?

    Anyway I digress.

    Awesome find though.

    Edit: This actually looks like ENPS or something. I’m like 90% sure it’s not Avid INews. The naming structure looks straight from the news software though. “Slug names”… This is cool as hell.

    Edit 2: I just realized you said this isn’t archived video… which means… well I don’t know of anything going directly to air via web, cause of the obvious potential “gotta have it LIVE” problems, but I guess this could be used for their websites’ videos? Either way, bad ass.

  8. nice find, tho sadly i also lack sufficient disk space and internet speed..