Kaydol

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza kaydolun.

Oturum aç

Flood göndermek, insanların floodlarını okumak ve diğer insanlarla bağlantı kurmak için sosyal Floodlar ve Flood Yanıtları Motorumuza giriş yapın.

Şifremi hatırlamıyorum

Şifreni mi unuttun? Lütfen e-mail adresinizi giriniz. Bir bağlantı alacaksınız ve e-posta yoluyla yeni bir şifre oluşturacaksınız.

3 ve kadim dostu 1 olan sj'yi rakamla giriniz. ( 31 )

Üzgünüz, Flood yazma yetkiniz yok, Flood girmek için giriş yapmalısınız.

Lütfen bu Floodun neden bildirilmesi gerektiğini düşündüğünüzü kısaca açıklayın.

Lütfen bu cevabın neden bildirilmesi gerektiğini kısaca açıklayın.

Please briefly explain why you feel this user should be reported.

A plea for help: looking for best tools to help dedupe/consolidate father’s data

**The Situation:**
I’m trying to help my father with his data. For most of his life he has been a bit of a DH…..and he now wants to make sure that he doesn’t leave that mess to me. It is started to cause him anxiety and he is struggling to start the project. I want to help, and get him to a realistic starting place where he can take over the organization of the data.

He has a lot (including 9 track tapes dating back to an IBM s/360…….but that’s outside the scope of this project). The data is mainly documents, pictures, videos spread across many external hard drives. He looses track of what is backed up, and then will make backups of backups…….(I curse the day that Costco started selling 5-packs of external hard drives! there should be a warning!)

​

**The Question / Ask:**
I want to help him get everything deduped and centralized on a **single** drive and backed up to a NAS I have built him. I image there must be some good programs to help with this process…..something that indexes all the drives, dedupes, organizes…..and then copies data? maybe it’s better to copy everything first and then run a program? I’m open to any suggestions that will make this as painless as possible (and ensure this is a one-time event!). *(thank you!!)*

​

**Hardware Details:**

* 20-30 internal/external hard drives (SCSI, IDE, & SATA ranging from 16 to 512 GB)
* I have the means to read all of them
* post-organization plan: (I think we’re good on this front) I built (2) 16 TB NAS that backup to each other, and a **single** 2TB external SSD for him. *My gut is that after dedupe he’ll have less than 2TB, but if not I’ll trade the solid state for a single 5TB external.*

​

**tldr;** What’s the best tool (or tools) to collect and dedupe data from across many external hard drives?

Benzer Yazılar

Yorum eklemek için giriş yapmalısınız.

9 Yorumları

  1. >tldr; What’s the best tool (or tools) to collect and dedupe data from across many external hard drives?

    1. Backup before proceeding.
    2. Use the Deduplication analyzer, WinMerge, and dedupeGuru to identify the duplicated and modified files. Find out the estimating your storage savings by identifying the deduplicates and removing them.
    [https://www.starwindsoftware.com/starwind-deduplication-analyzer](https://www.starwindsoftware.com/starwind-deduplication-analyzer)
    [https://winmerge.org/source-code/](https://winmerge.org/source-code/)
    [https://dupeguru.voltaicideas.net/](https://dupeguru.voltaicideas.net/)
    3. Organize the actual data you have on your storage. [https://www.reddit.com/r/DataHoarder/comments/1endkt/what_are_the_best_ways_to_organize_an_incredibly/](https://www.reddit.com/r/DataHoarder/comments/1endkt/what_are_the_best_ways_to_organize_an_incredibly/)

  2. How are you at scripting? Many years ago I had a little program that compared the file sizes, and if it found two identical then it computed the hash to identify duplicates. Was extremely useful.

  3. Just in case I would make a backup of everything before deleting duplicates or editing data. I had a problem once where I thought I was deleting duplicates but was deleting the only copy of my files. Be careful!

  4. Others have given you really good advice. I would just add some ideas around the edges on how to think through this. Given the media you have mentioned it doesn’t sound like it’s a huge amount of data size-wise. That’s helpful as it means that if the process of getting things organized results in temporally duplicating things further (in an organized way) that’s likely ok.

    I would start with something others have suggested, label every piece of media. Then copy everything onto the NAS (maybe something like “/original/<media name>”. Now you have everything onto new (and hopefully redundant) media. Once everything is copied over, this is a great time to take a backup

    Next think about what you have and try to figure out the big categories. Put together a folder structure that you think will cover most of your files, do that in it’s own location (maybe “/organized/<topic>/<subtopic>” or whatever.

    Next, copy the “/original/” folder to “/staged/”. This means you now have two copies of everything. From there go through the “/staged” directory and start moving things from “staged” to the proper place in “organized”. When you empty a directory, delete it from “staged”. If you get stuck no something, you can always leave it and come back to it later. If you find things that are truly duplicates, delete them from “staged”

    This move strategy provides a sense of accomplishment because you can see the amount of work to get done going down. It also makes it really easy to track where you are as everything in “staged” still needs to be worked through. Finally you know you have a clean original copy at “/original” as well as on the original media so you are unlikely to mess things up in a way you can’t recover from easily.

  5. I’ve done this sort of thing before, although in a much smaller scalle (~6 drives, about 4TB raw), and it basically boils down to one thing: patience.

    First thing, label every drive, then setup a filesystem which natively supports deduplication on your NAS, such as BTRFS or ZFS. Then, drive by drive, import everything into there, each drive in a separate folder. Then start the deduplication (on btrfs you’ll need a tool like BEES, on ZFS, lots of RAM). That should cut down on raw space used while you import things.

    When you are done with that, you should make a few backups, to at least 3 different sites, like an external HDD (export the filesystem using btrfs/zfs send), or use borgbackup and a provider like Borgbase/Rsync.net/etc.

    At this point, you should have everything required to start pruning stuff. Everybody does this differently, but I like to start with dupeguru (or similar utilities), as it allows me to setup a “reference” folder and have that copy of the files be kept. For example, if you have one fairly organized music library on the drive, you can set it as the reference and then move any duplicates outside it. This way you keep the organized copy and get the duplicates out of your way.

    When you encounter useless stuff, don’t delete it, move it instead, that’s because if you accidentally delete anything (which will happen) you’d have to go and find that file again from the original drive. So move stuff into a separate subvolume/dataset and use that as a sort of recycle bin.

    After you are done with that, you should have each HDD folder “clean” and you can start merging them into one. This is a good place to take another set of backups.

    After that, you should have a single folder containting everyting you want, and the recycle bin folder containing everything else. Double check it, let it “marinate”, as sometimes you’ll think something is not usefull when it might, and after you are OK with it, delete the recycle bin folder.

    Now take another set of backups and enjoy.

    A few tips:

    * For reliably copying data use Rsync on Linux, also avoid external USB adapters, use only internal interfaces where possible.
    * If the filesystem on the disk is unmountable, use testdisk to copy it into the NAS, then create copy of it and only try to repair the copy. Photorec on a loopback mounted image if all attempts to repair the filesystem failed.
    * Try to centralize the “source of truth” afterwards, instead of random HDDs and devices everywhere, setup Syncthing (or similar utilities) to syncronize data across devices, and have one such device backup to your NAS (or have your NAS participate on the Syncthing sharing).

  6. App? Just take them and put on the disk/nas. Literally drag and drop. WAY overthinking it

  7. I like to image my old hard drives as .iso or .zip files because it avoids the temptation to delete stuff from the image, and that means I never question whether I’ve accidentally deleted something. Hard drive space is cheap again (finally) and it will only get cheaper, so just to be safe keep those images around forever, untouched.

    Consolidation+deduplication is useful as a way to quickly find stuff, so that’s step 2. Do this in a new directory.

  8. I don‘t think dedup is the right word here unless I miss understood you.
    You want to clean up/ remove all duplicate data right?
    Dedup in the way I know is for saving disk space on blocks that exist for a given dataset more than once.
    Assuming it is what I expect, there is a tool for Linux. Fdupes. It lets you decide on how to compare different files. The tool lets you also delete douplicate files automatically.
    This is the inital process.
    Afterwards it might be useful to makr a cronjob that runs maybe once a month and cleans out duplicates again.

  9. 1. Make an inventory – label drives – write down whats on them and hoe big They are.

    2. When you know what type of data you’re dealing with and how big the pile is, buy enough drives to store them on one machine. Segregate them roughly by category, date or what fits you most.

    3. When you realized that you can not fit all data on one drive, consider raid or mergerfs pool.

    4. If you can delete dupicates, because They are photo albums which often contains the same photos, fdupes can delete dupicates for you.

    5. If there is software, you be careful with deduplication, the same library can be in two programs. Hard linking works better here.

    6. Split your big categories into smaller ones, like when you have photos split by year, split them by months. Your target here is to split the data into 30min chunks of sorting.

    7. Sort with you father when you have time for it. Be respectful when you’re going to decide what to delete. Ask why he keeps it, why he feels like he needs it.

    During the sorting you can see how reasonable he is about his worries and what’s the source of the problem. Psychologists can help with that I’d he’s willing to go to one. Hoarding is manifestation of something usually. I know people who hoard for the pleasure of possession, i call them dragons. Those people usually does not have problems with Their collection and do not stress too much about it state as long as it’s slowly growing and it’s safe.

    But if your father stress about it, it’s likely that hoarding is manifestation of some kind of fear and professional can help him to realize that, while you will be sorting things together.