Kaydol

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza kaydolun.

Oturum aç

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza giriş yapın.

Şifremi hatırlamıyorum

Şifreni mi unuttun? Lütfen e-mail adresinizi giriniz. Bir bağlantı alacaksınız ve e-posta yoluyla yeni bir şifre oluşturacaksınız.

3 ve kadim dostu 1 olan sj'yi rakamla giriniz. ( 31 )

Üzgünüz, Flood yazma yetkiniz yok, Flood girmek için giriş yapmalısınız.

Lütfen bu Floodun neden bildirilmesi gerektiğini düşündüğünüzü kısaca açıklayın.

Lütfen bu cevabın neden bildirilmesi gerektiğini kısaca açıklayın.

Please briefly explain why you feel this user should be reported.

here is a one liner command for downloading certain file types from a webpage using lynx, awk, grep and aria2

hello, i was looking for a command to download certain file types from [archive.org](https://archive.org), i searched a lot and everyone was recommending using wget which worked but was very slow so after some trial and error i found something that worked for me.

`lynx –dump` [`https://archive.org/download/alice_in_wonderland_librivox`](https://archive.org/download/alice_in_wonderland_librivox) `| awk ‘/http/{print $2}’ | grep -E ‘.mp3’ | grep -v “_64kb” | aria2c -i – -c -x 16 -j 4`

lynx is a cli web browser that is very fast and can easily and quickly load the page,

pipe it to awk to get all of the links in the page,

pipe the result to grep with the -E flag to find every mp3 links in the page,

pipe it again with the -v to remove every link that has the word “_64kb” in it since we don’t need duplicates of these mp3 files (skip this step if the files you want to download don’t have multiple qualities like this archive.org page),

finally pipe the result to aria2 (the package name is aria2c in most linux distros) -c tells aria2 to continue downloading partially downloaded files if there are any, -x 16 sets the max connections per server to 16 and -j 4 sets aria2 to download 4 files at a time, you can set this number higher if your internet speed can handle it.

​

here is another example for downloading both jpg and png files from the same page:

`lynx –dump` [`https://archive.org/download/alice_in_wonderland_librivox`](https://archive.org/download/alice_in_wonderland_librivox) `| awk ‘/http/{print $2}’ | grep -E ‘(.jpg|.png)’ | aria2c -i – -c -x 16 -j 4`

this command works with other websites too, if you know a way to improve this command please leave a comment and let me know how to improve it.

if you are on windows you can use msys2, cygwin, git bash for windows or wsl/2 to have access lynx, awk and grep, aria2 already has a windows version.

​

edit: here is another solution that removes the need for awk, thanks to u/MultiplyAccumulate for the info

`lynx –dump –listonly –nonumbers –hiddenlinks=ignore` [`https://archive.org/download/alice_in_wonderland_librivox`](https://archive.org/download/alice_in_wonderland_librivox) `| grep -E ‘.ogg’ | aria2c -i – -c -x 16 -j 4`

​

edit2: here are some bash/zsh functions to make downloading files easier:

`pg_dl () { lynx –dump “$1” | awk ‘/http/{print $2}’ | grep -E “.$2” | aria2c -i – -c -x 16 -j 4; }`

the above function downloads 1 file type from any page, use it like this : pg_dl weblink extension

`pg_dl+ () { lynx –dump “$1” | awk ‘/http/{print $2}’ | grep -E “(.”$2″|.”$3″)” | aria2c -i – -c -x 16 -j 4; }`

the above function is the same but for downloading two file types, use it like this: pg_dl+ weblink extension1 extension2

`pg_dl_r () { lynx –dump “$1” | awk ‘/http/{print $2}’ | grep -E “.$2” | grep -v “$3” | aria2c -i – -c -x 16 -j 4; }`

the above function is for downloading a filetype that has duplicate entries/files in a page (like our example page that has both normal and 64kb compressed mp3 files), use it like this: pg_dl_r weblink extension string_to_remove

Benzer Yazılar

Yorum eklemek için giriş yapmalısınız.

2 Yorumları

  1. Yeah, I frequently use lynx to get url lists and *grep to select which urls to download.

    why was wget slow? Because you were downloading too many redundant versions/files? Because you weren’t abusing archive.org by downloading multiple large files simultaneously? Many sites limit downloads because people were hitting them too hard and the terms of service on some specifically say not to hit the site harder than a normal interactive user would and many downloaders, including wget, include options to insert delays between downloads so as not to abuse the site. wget isn’t slow, internet archive is (see below). wget runs at the same speed as the official client.

    Lynx has some better options to output just the list of links. You want:

    lynx –dump –listonly –nonumbers

    Note that –listonly and –nonumbers aren’t listed in the online manual on the website but are in the manpage and lynx –help output. You can

    And you might occasionally want –hiddenlinks=ignore if you don’t want to see links that aren’t visible.

    The example given only downloads http/https urls, no ftp. cut or sed can be used instead of awk, but there is nothing wrong with awk and it does a nice job of matching and extracting in one simple command. But it turns out you don’t need any of those if you use lynx properly.

    Here is yet another command that extracts urls from html source, i.e. quoted strings that begin with http/https/ftp. It uses the closing quote to locate the end of the url. Then it removes the quotes at the beginning and end of line.
    curl https://www.ultimatebootcd.com/ | egrep -Eio ‘”(http|https|ftp):[^”]*”‘ | sed -e ‘s/^”//’ -e ‘s/”$//’
    The “-o” option on egrep tells it to match only the portion that matches. I have frequently used sed substitutions to extract the URL part but there could be more than one url per line, so this might work better – it should put each match on separate line.
    It won’t match other urls, like “geoURI” or urls that only appear in the text, not the output. If you want, you can make it only match href= and src= urls.

    Also, you can insert output of command pipeline into a wget/curl/aria/ia/go-internetarchive command, since some downloaders don’t necessarly like url lists on standard in.
    * using backticks: wget `command`
    * using $() which nests better: wget $(command)
    * using xargs: command | xargs wget
    xargs has an option to execute multiple copies in parallel. Very recent versions of curl have parallel download ability. Gnu parallel can also run commands in parallel. xargs has an option that you can specify how many filenames to pass to each command, for programs that have built in paralllelism. wget can take file list on stdin using “-i -“.

    wget –content-disposition sometimes results in saner filenames, getting rid of the url query strings if the server specifies a content disposition header.

    also note that there is an official command line interface to internet archive:
    https://archive.org/services/docs/api/internetarchive/cli.html
    sudo python3 -mpip install internetarchive
    ia list –location alice_in_wonderland_librivox
    Note that you need just the last part of the url (basename). It also has download options that let you specify glob patterns for filenames and specify which formats to download. It is rumored to have the ability to download in parallel, but I didn’t see that; there is an unoffical go-internetarchive that does.
    time ia download –format ‘128Kbps MP3’ –format ‘Ogg Vorbis’ alice_in_wonderland_librivox
    It does seem on the slow side, downloading 267 megabytes in 24 files, in 7:54 or 4.45 megabits/second on a connection I tested at 60Mbps. It also may have seemed slow since it only printed a single “d” for each file downloaded. But if internet archive is busy enough that they can only fill 1/12 of a relatively slow internet connection, I don’t really want to hit them harder. Wget was producing similar speeds, though expressed in megabytes per second rather than bits, so it wasn’t that the official client was deliberately going slower.
    time wget $(lynx –dump –listonly –nonumbers https://archive.org/download/alice_in_wonderland_librivox | egrep -i ‘(.mp3|.ogg)’ | egrep -v 64kb )
    Aria2 was quite a bit faster (1:37), if more abusive, but still didn’t saturate the wire, again probably indicating that internet archive is loaded and that any speed gains you experience are coming at other users’s expense.

    Internet archive blog has some (old) suggestions on using wget with internet archive. One interesting one is using the -A (allow) option to select file formats when crawling.
    http://blog.archive.org/2012/04/26/downloading-in-bulk-using-wget/