Kaydol

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza kaydolun.

Oturum aç

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza giriş yapın.

Şifremi hatırlamıyorum

Şifreni mi unuttun? Lütfen e-mail adresinizi giriniz. Bir bağlantı alacaksınız ve e-posta yoluyla yeni bir şifre oluşturacaksınız.

3 ve kadim dostu 1 olan sj'yi rakamla giriniz. ( 31 )

Üzgünüz, Flood yazma yetkiniz yok, Flood girmek için giriş yapmalısınız.

Lütfen bu Floodun neden bildirilmesi gerektiğini düşündüğünüzü kısaca açıklayın.

Lütfen bu cevabın neden bildirilmesi gerektiğini kısaca açıklayın.

Please briefly explain why you feel this user should be reported.

Introducing DataHoarderCloud (a new standard for hoarding and sharing)

Disclaimer: Posting this on behalf of my internetfriend /u/soul-trader who posted this yesterday. Got it removed by the Automoderator for “account age”. He did not calculate that in, ha!

Hello fellow hoarders. I have been part of this community for a long time, but this account was made specifically for this project.

Said project I have been working on the theory for about a year, and now I think I finally have a basis to bring to the public for review and input to improve the concept.

**The goal**

I actually got inspired to do this by some people making joke-comments about the contradiction of establishing a cloud for hoarders, as the view of many is that no cloud can really be trusted. So I meditated a little bit on the idea, and I actually realized that this is not true. There is one specific application where a cloud makes sense, and this is in saving space while still preserving content that would be deleted from the internet otherwise.

I noticed how everytime a post about site going down was going up here, quickly a torrent would form and 100-200 people would usually seed it by the end of the day. Now I know it might sound like I am going against the stream here, but I think this is 80-180 too many. Those people are just having it on their disks for no reason, as the purpose was already long fulfilled.

Or in other words, everytime there is content to rescue and backup, everyone just storms on it and at the end we have way too many copies. It completely lacks organization and I think if we had that our ressources could be allocated far more efficiently, being able to save more overall.

So my initial concept was to figure out a way how people could look up what other reputable people have saved to see what they still have to download and what would not be worth their time (if they are not interested in the content themselves of course) with a prospect to establish a coordination and sharing network across that basis later. But in time I saw the potential it could have for many more things.

**The process**

The first thought that came to my mind was of course to use hashes for the files, at which point I tried to figure out which would have enough security for this purpose. Turned out that SHA-256 plus file-size was way better suited than md5 because during the last few years md5 started to experience relatively affordable (as in computation cost) collision attacks.

At this point I got heavily inspired by the magnet-links torrents are moving towards today. I first researched it extensively and then tried to determine if it could be overtaken outright and if not locate the flaws of the system which needed to be adressed.

This research concluded that magnet-links are not qualified for the purpose I had in mind, not really because of the tecnical structure, but because of the way it was being used. Magnet-links and the torrent-framework itself suffer immensenly from basically the same files floating around under different hashes (because some provider put in their name in a readme file somewhere), which would clutter up any kind of database quite quickly, and trying to receive files from it is entirely dependent on who is always allocating ressources to keep the torrent up. Thus it is possible that a file is unavailable to get despite more than a few people having it saved on their disks.

I concluded that the best structure for a searchable file index would be the most simple that still avoids collisions of different *content*:

[4 bit] type of encryption algorithm (for backwards-compatibility only once sha-256 falls out of favour, not for differently encrypted files floating around. Thus for the next few years, all qualified files would be restricted to 0000 until agreed on otherwise)

[256 bit] in case of sha-256 the hash (calculated exclusively from the *content*, no filenames, file-attributes, file-size etc involved)

[44 bit] file-size for a maximum of ~4.5 TB

Which sums up to a 38 byte index per file, which is still quite large if you factor in that an average user around here seems to have up to 1M files (38MB), but it is as low as we can get today without any collisions.

This is the point where I realized the vast amounts of applicability of this as simple as possible structure (though in retrospect, it is not much different from the magnet-links, just simpler in design and with a focus on rules to achieve what we want). It does not require any system like the torrent-framework to function, if one file has specifically one community-accepted hash instead of a collection of files that are differently packed having one, it becomes extremely easy to search any distributed platform for it.

So this is where my process branched out, into refining the structure and the limits of the files accepted for it and building the theory for platform specifically aimed at acting as a database for it.

**The structure**

The structure basically describes the standard of indexes any program that generates them would need to hold themselves to.

I decided on a process aimed specifically at focusing on files others would want that have no collisions, so I decided to use two whitelists for accepted file-extensions and exceptional rules:

The first whitelist:

1. Executables, binary packages and isos (for software installations)

2. Document formats

3. **Lossless** video formats (no .mp4 etc because too many rips and repacks would essentially have a different hash)

4. **Lossless** audio formats

The second whitelist:

1. Lossy video formats

2. Lossy audio formats

3. zips (exclusively the zip format to avoid differently packaged identical files. All zips should be packaged with the same arguments, which still need to be specifed, input very welcome) This one is for all those exotic files, database files, scientific content, data-packages relying on each other (yet impossible to convert to binary-packages like it is the case with software), etc.

The first whitelist is focused on the most efficiency in terms of identical files with different hashes, the second is more for extended and casual use and sharing.

Note how none of the whitelist does **not** include image files to avoid accidental uploading of your wedding photos for which, with all respect, nobody other than your family cares about. If you want to share things like image-scans, old maps, digital art etc, a *single* file should be packed as zip and then hashed.

Additionally, none of the .zips are allowed to have an executable inside to avoid things that should belong in whitelist 1 to be spread out ad infinitum.

**The platform**

My concept of a platform leveraging this structure consists of a client and a server holding the index-tables.

The client provides an interface where you can select which files you want to add to your index-table and which way you want to hash them (individually, the default, or in .zips for folders with files relying on each other). Additionally to the index-file according to the norm described above, it builds a name-index to assist in searching through your files to add an additional convenience feature. You should be able to manage and search through all your files with one single software-solution.

The index-file will the be uploaded to a server through a keypair which will be saved in the database to determine the uploader and let only the uploader change their respective indexes.

The server would then add it to its own table as user-index pair and calculate how many copies of the file are saved in total which could be browsed publically.

This is where it became difficult. If you have a service dedicated to collecting an index of all files existing and the people who are owning them, you first have to have to deal with the massive amount of space needed to store the hashes for trillions of files and second you need to have a way to deal with possible attackers who maliciously inject nonexisting hashes or existing ones without owning them and third you need to take care of any legal complications todays political climate would bring to it (aka the bullshit concept of secondary file-providers torrents-sites are getting attacked with today, and related to that, “illegal numbers”).

So my idea is to have a maximum amount of files you can upload per IP-address (which unfortunately means to store IP-addresses related to files in a database) per month and deleting entries older than three months which get replaced by new ones (which should be done anyway, as the only way to ensure the person who confirmed that they own files is still owning them (or still alive for that matter) is to continually require updates) and forming a list of confirmed malicious static IP-addresses.

The second one is to hash the table-chunks themself, and spread the tables out to other nodes in way like the distributed hash-tables, to be requested on demand and updated/rehashed. Complete decentralization of this process could theoretically be done by a system similiar to block-chain to confirm the integrity of the master-nodes which have the privilege to update the IP-tables (and have a bigger number of them) and allow the server-system to not rely on one central node and be redundant instead.

Additionally, in the future this could be expanded into an ability to log on into the system with your key and communicate with other users and request files to be exchanged with another service of choice.

I think this could have the potential to be a true successor to magnet-links as this system also factors in the ressources which are considered dead by the torrent-system, in the way of establishing grounds for a simple request-network. Note how it is not the same as a P2P network like Gnutella, as it focuses on a much simpler unifying concept any other service could build upon. At the core it is just a simple lookup-service to check who else has your file so you are not forced to keep something a few reputable users already have and such is always available to you on request. A true cloud for data hoarders.

There are still a few more things I would like to talk about but as this post have become quite long I am taking a break from writing now. I am very interested to hear thoughts, suggestions, critique and am happy about any discussion.

Benzer Yazılar

Yorum eklemek için giriş yapmalısınız.

27 Yorumları

  1. This project sounds exactly like https://zeronet.io/ everything is up as long as someone is hosting this file

  2. >I concluded that the best structure for a searchable file index would be the most simple that still avoids collisions of different content:
    >
    >[4 bit] type of encryption algorithm (for backwards-compatibility only once sha-256 falls out of favour, not for differently encrypted files floating around. Thus for the next few years, all qualified files would be restricted to 0000 until agreed on otherwise)
    >
    >[256 bit] in case of sha-256 the hash (calculated exclusively from the content, no filenames, file-attributes, file-size etc involved)

    You should look into [multihash](https://github.com/multiformats/multihash), which is already a standard used by projects such as IPFS.

  3. I think you can break this apart into a couple separate problems:

    1. **P2P File Hosting / Global De-Duplication**. This already exists: [https://ipfs.io/](https://ipfs.io/) . IPFS is made to help host things forever, with de-duplication, with authority, with P2P, with hierarchy, and with fingerprinting. I’d recommend trying it out before thinking too much about — because IPFS is already pretty much the expert in that space.
    2. **Coordinating a mass distributed rip.** This is a little harder, because you’re going to have to find some authority and standard for how you can coordinate ripping everything from a single website, How will you forward information to other peers about how far you’ve gotten? Are we talking about the archive team’s format? Is it possible to even invent a standard for coordinating this work, when every website’s shape will be completely different, and when no-one is in charge?
    3. **Organizing the resulting backup.** Getting anyone to rip anything in a way that’s *relation-ally correct in all cases* and *useful in its organization for all uses* is going to be a heavily disputed problem. So much that I think it’s impossible to be right all the time when trying to preserve the structure (much less be correct even *most of the time*). Every website’s structural needs are completely different. You’ll end up having to write structures for all the maths and logics in a sort of foundational way, because all the foundational logics out there are used in all of the internet’s relations. Things like “Users”, “Channels”, “Topics”, “Forums”, “Embeds”… this is a perpetually growing problem. That graph’s shape is not something can you can perfectly plan for, nor is it immutable because everyone will make mistakes when trying to fill it out and connect the dots, and websites tend to change their structures too.

    I’m being *very terse* here. Personally I think we’d do better to skip past #2 for now, and just worry about #1 and #3. We can use existing solutions for #1 *wherever it is most convenient* for now, because we can’t really tell people where or when to host their stuff anyway. #3 is a solvable problem by just thinking about storage systems and relations a bit more abstractly. And #2 is a complete WTFBBQ — I don’t even know where to start on that one.

  4. Alright, here is the longer writeup I promised.

    So, your idea here is to make a catalog where people can look up what other people already have. To solve it, you offer a compact hashing mechanism with some rules for inclusion. There are a few basic difficulties here with a simple client-server hash catalog approach:

    * How do I know that the people listing stuff really have them? How can I verify it? A catalog needs to guard against lies and sabotage.
    * How can I get access to the files that others have? Files are useless if there is no way to get them.
    * How can I keep my contributions to the catalog up to date with minimal hassle? It needs automation.
    * How can I query the catalog with minimal hassle? It needs usability.
    * How can I motivate people to list their assets in the catalog?

    There are various solutions possible to cover all these angles but the most precious resource in the community is not storage space or novel ideas. The most precious resource is time of the contributors. Whatever solution is used must be very cost-effective in terms of using up contributors’ time. Creating a big system is only useful if it actually gets made and if people actually use it.

    IPFS was already mentioned as something in a similar vein. In fact, it is something more akin to a storage backend that is capable of doing distributed storage of the files but needs a fairly significant app layer on top to be of any use.

    Let’s review some basic facts about IPFS. What does it do?

    1. You can create a unique identifier for a file if you know its contents (a hash).
    1. If you know the unique identifier, you can obtain a file from the IPFS network.
    1. The file is provided by peers who “pin” it. Anyone can pin any file.
    1. To add more mirrors to a file, a peer will ask IPFS for the file and then pin it.
    1. To cease being a mirror, a peer will remove the pin.

    You can think of IPFS as a network connecting “pins” with downloaders. It is a directory of who has pinned what. You see the resemblance here?

    Now IPFS itself is just a storage mechanism. To be valuable to data hoarders, an app layer needs to be made on top for proper file management. Most importantly, hoarded files tend to be in various “collections” which is a concept that does not exist in IPFS (which deals with individual files). How can these groupings be made? You need a catalog system. You need mechanisms for obtaining the collections and for updating/maintaining them. If someone wants to mirror files, they need to understand what the catalog comprises.

    This all needs to work in a distributed manner, as centralized systems fail as people lose interest and stop maintaining them (see much of the archiveteam tooling that has fallen into disrepair). That is not to say it needs to be peer-to-peer (as that would open it up to spam and remove possibility to protect against poisoning attacks) but it needs to support mirroring of the catalog mechanism as well as the content.

    Given these basic requirements for survivability of the solution, what kind of user experience would be satisfactory?

    1. As a system designer, I do not want items in the catalog that are not actually available for download. That would complicate things massively. Online only.
    1. As a data hoarder with content to publish, I want an app that I can tell “Folder X should be published as ‘Ganna-Varvara Cartoons 1990-1995 x265 1.2Mbps AAC-HE MP4′” and have it take are of making this content available, publishing it in the catalog and doing all my bookkeeping.
    1. As a data hoarder with space and bandwidth to contribute, I want to see a list of collections, see how many mirrors they have and just hit a button to start mirroring one.
    1. As a system designer, I want to keep some honesty in the number of mirrors – to track their uptime at least and not count mirrors with poor uptime.
    1. As a data hoarder authoring a collection, I may wish to update the collection to add new content or modify existing content (including replacing/deleting). The app should make this transparent to me, publishing the changes I make without any needed effort.

    This is just a bare minimum solution and already quite a nontrivial piece of engineering. Yikes! I hope you have a government research grant standing by to pay for the development!

    Speaking of research grants, there is also Tribler, which is a kind of “anonymous torrents with feeds” system made with some research money. Perhaps not ideal for this use case but worth considering.

  5. > Lossless video formats

    I assume you mean source files like blu-ray and DVD images, instead of actual lossless video.

  6. I would encourage you to consider IPFS for the storage backend and to focus on making a good frontend for IPFS that manages file pinning (to avoid needlessly high duplication) and updates/upgrades for users.

    IPFS is a good storage concept that lacks proper productization into end user usable apps. This could be one such app.

    I will try to do a longer writeup within 10 days to better explain how the two might be merged. Do you have any off-reddit discussion group where the discussion could be better tracked?

  7. It sounds like your only problem with the torrent approach is that it doesn’t aim for a seeder size and discourage going significantly over or under that.

    Isn’t that something that could be perfectly solved with a private tracker community where the rules are set to encourage exactly 20 or so seeders, including prominently featuring any torrents under that size and discouraging the downloading of anything over that size unless you want the actual content?

    So someone could join, go to the list of under-seeded torrents, grab until hey hit their personal GB limit, and move on.

  8. > everytime there is content to rescue and backup, everyone just storms on it and at the end we have way too many copies

    A big thing in this community is that you back up data yourself because you don’t trust other people to do it “right”, or at least the way you want.

    If I think something is valuable enough to save, I want it on my own disks, so I can access it quickly and easily over my LAN, and I’m not dependent on anyone else.

    For example, the Internet Archive – great project, I fully support it. Will it still be around in 20 years? 50? 100? No idea. But I think I can keep my own data for as long as I’m alive.

    EDIT: Oh, and also [this](https://imgs.xkcd.com/comics/standards.png )

  9. If you’re serious, create an RFC and a GitHub and start on a POC

  10. This also breaks down for any controversial or illegal content. It might be good to backup roms for instance but if I were to do that I sure as hell don’t want to be on a list. That’s asking for trouble. I think most data at risk probably falls into this category.

  11. /u/soul-trader, I hope this comment is not too late, but I believe you should cooperate, or at least get in touch with, the people behind the web archive project [web.archive.org](https://archive.org/about/bios.php) as they certainly have a good experience in at least a part of what you’re trying to achieve.

    You have done a very good study, and I thank you lots for that 🙂

  12. I would definitely be interested in joining once there’s client software

    I think a p2p setup is almost necessary for this, even if it’s just as clunky as making every client also run a full server and randomly N clients are designated as servers for the rest.

  13. > [4 bit] type of encryption algorithm (for backwards-compatibility only once sha-256 falls out of favour, not for differently encrypted files floating around.

    sha256 is a cryptographic hash function, not an encryption scheme

  14. What you’re proposing sounds like IPFS.

    It has the same concept of “pinning” files, where you say that you want to keep a file available on the IPFS network, and as long as someone has that file pinned it’s accessible by anyone. It can change hands `N` different times, where A pins, B pins and A unpins, C pins and B unpins, etc. and it will always be accessible.

    Some comments on your structure:

    Don’t use bits, just bytes. Don’t save “half a byte” and make it a pain in the ass to work with, give every field at least one byte. If you’re going with bit fields, pack them into a byte and use that as a flag byte.

    4.5TB as a limit _today_ is dumb. There isn’t a single-file that I’ve seen that big, but there are a few torrents that size and if you’re building something new you should expect bigger stuff in the future. Go with 64 bits, that’s 8 or 16 exabytes and a limit you won’t see for at least 20 years.

    Client-server is going to be a bottleneck for any system of the scale you’re talking about. It will work for a long while even if it’s centralized on the /u/soul-trader server, but if adoption gets to the same scale as any of the other P2P systems then it’s gonna get weird. You could have “federation” of a sort, where you have multiple tiers or shared data pooling between separate instances, but something more like DHT will work better long-term.

    Don’t hash zips/archives directly, or at least give the option to hash the stuff inside them as well. That will help you avoid the “someone adds NFO” invalidating your content, and will help you dedup when someone takes all 30 RAR files and packages them as a single uncompressed / recompressed torrent. Same goes for content archives, if 50 different 4chan dumps have the same file you’d be better off indexing and storing it once. It would also solve a problem I encounter regularly, where I will repackage content with `advdef` or `zopfli` to get better compression for identical source bits.

    Limit per IP and would be rough. Figuring out a web-of-trust would be a better plan, and your blockchain is one of the only useful applications of that kind of technology! Same idea as bitcoin or GPG, I sign that I have / own / publish something and some other people vouch that they got matching content from me. Thinking a bit, that could be the solution for a lot of what you’re talking about – make a chain that says when someone hosts or stops hosting a thing (tied to your content hashing scheme) and you can chain everything from there. If I try to fetch from `$source` and it doesn’t have the thing I want I would publish a message to that effect, and eventually my `$source doesn’t have content XYZ` would override the original `$source is hosting XYZ` when enough other entities confirm that fact.

    Let me know if you go forward with this, I have a bunch of random stuff archived and would like to see how this kind of system would handle it. I also have some extreme weird-cases (edge cases of edge cases) that I would be curious if this approach would work.

  15. > [44 bit] file-size for a maximum of ~4.5 TB

    How long do you expect your project to last? What do you do when you need to handle bigger files?

    > [4 bit] type of encryption algorithm (for backwards-compatibility only once sha-256 falls out of favour, not for differently encrypted files floating around. Thus for the next few years, all qualified files would be restricted to 0000 until agreed on otherwise)

    > [256 bit] in case of sha-256 the hash (calculated exclusively from the content, no filenames, file-attributes, file-size etc involved)

    How will this avoid duplication of files? The point of hashing isn’t encryption, but to verify that two files are identical without comparing the entire file bit by bit. Because there can be collisions you’re looking for ways to avoid that collision. The simplest way is to record:

    * hash
    * size
    * some other data unique to that file, e.g. first/last x bits

  16. About limiting fraud uploads: how about instead of using a hash of whole file, split file in e.g. 16 chunks, hash these, and make “überhash” out of them (concatenate hashes into 1 string and hash resulting string). Any node that wants to check if file exists in the network would just need “überhash”, but if someone wanted to announce their IP as an owner of that file, they would have to present these 16 hashes. I think it would work only if the database was centralized, though, as otherwise nodes would be able to replay others’ messages.

  17. I know the torrent client Vuze uses Swarm merging to get identical files from different torrents: https://wiki.vuze.com/w/Swarm_Merging
    Might be usefull to look in to it

  18. Sounds like you want an alt coin that is earned by reputation and quality of archive. The prize would be… Karma? Some other imaginary internet point? Steem is super scammy to me since they premined, but they are kind of doing this but with “journalism” instead of “archiving.”

  19. what we need is a dynamic, public ceph network, with some agreed upon ratio of personal storage space and replicated public storage.

    example I add a 3tb node, in which I get 1tb of space, and 2tb is used for replication of some other content (of which is not my control).

    my 1tb gets replicated/redistributed elsewhere, and the other 2tb is used to store replicated/redistributed from other anonymous providers.

    using ceph to manage both the distribution and de-duplication and high availability of the network.

  20. Of course it’s a great and interesting idea. As others have mentioned, the tough part is implementing it in an easy to use way and getting people on board. Definitely curious to see what others say, especially people like u/-Archivist

  21. Ambitious idea. Similar ideas have been proposed many times over at r/trackers but have never gained steam or were dismissed by the community outright. Over there, the ideas of a new, improved data distribution/indexing platform rank up there with someone telling the community they’re going to write a newer, better Gazelle or Ocelot. The ideas are fantastic but actually mustering the manpower to make these dreams into reality are always the roadblocks.

    What you’re proposing sounds great and I truly hope it can gain momentum.

  22. Not even from this subreddit but I read through your entire post and would love to see this idea come to reality!

  23. Thank you, now it’s working. I hope we can get some discussion going!

    Now I think I should explain a little more on why I think this is a good idea. In the post I only said that I am going against the stream but I did not hook into this explaining why I am really not.

    I think, if we had some organization in the form of a list of who owns what and who is ready to share what we could have a much better decentral organization than we have now, not to mention all the other benefits it could bring if the format is expanded to other areas. If, out of the 100 guys who would normally jump on one torrent, 80 would jump on 4 other torrents instead and keep them up we could a(r)chieve so much more! The core idea is that seedboxes and disk-space are essentially a limited ressource and so it should be coordinated when to use it and thus how to make hoarding more efficient in a cloud-way.