So I have finally finished deduping my 60,000 odd song collection. There were several thousand of what I consider duplicates.
First, for those interested in how so many duplicates snuck in: Partly history, I initially ripped several thousand CDs using Itunes and then Winamp. to m4a to listen to on an Ipod. At first I didn't even mark which CD's were ripped, and they got mixed back into the collection as they got played in my home CD player. Then I started using stick-on "dots" which unfortunately fall off. But anyway, I then got a good "Home Theater" receiver and decided to start over, ripping using dBpoweramp to FLAC, in a different directory on my NAS. But in actuality, I first ripped the CDs that (supposedly) had never been ripped to m4a, and, a year or so ago, started reripping the CDs previously ripped to m4a. While I tried to remove the rerips from the m4a directory, it was hard to find some of the rips, and as I had purchased perfect tunes, I figured I'd just wait and find the duplicates later.
So a few months ago, I began to see all too many duplicates, so I ran Dedup. Since this was going to be a slow process for me, I exported the duplicate list to Excel, and used that to locate the duplicates on the file server.
Well, the issue is that Dudup really works too well. It found thousands of duplicates or likely duplicates, and the Excel export doesn't tell which is which, or allow filtering.
It certainly found the duplicate albums (which is what I was after) but on a track by track basis. It, as I expected, found the many duplicates that were on "best of" or compilation CDs, which, unless the whole compilation cd was in fact a duplicate CD, I didn't want to delete. It found remixes done by the same artist, live versions, "disco mixes", etc. I didn't want to delete those. It found cover versions, by other artists, a lot of them. It found instrumental versions, even ones on totally different instruments, like steelband versions of calypso songs (my collection is heavily Caribbean.) It took me over two months to go through the spreadsheet line by line and separate the wheat from the chaff, deleting only the true duplicates. I did find some interesting issues, I had quite a few more duplicate copies of CDs than I realized, and if the title or artist of a track was slightly different when the second copy got ripped, I ended up with duplicate tracks in the directory for that CD album, with slightly different filenames. Of course, Dedup found those. I also found that some producers had a bad habit of re-releasing CDs, particularly compilation CDs, with a different title and different artwork, but the same tracks. I even found one where in addition to changing the title and artwork, the producer had shuffled the tracks around. Same tracks, but in a different order.
Now here's my wishlist item: Make options in Dedup which would only report duplicate albums, i.e. where all or most of the tracks are duplicates, instead of every duplicate track. And a second option which would simply scan each album directory for duplicate track numbers. Flag all the directories with a duplicate track 1 or 2 or 15, or whatever. In my case, that tells me there is some kind of duplication or filing error. (I did find a couple of compilation albums I had ripped that had identical titles, and therefore all the tracks for both albums ended up in the same directory.)
While I found it interesting to see all the extraneous "duplicates" that were really cover versions or "best of" or "live"tracks, the deduping process would have been much faster and less painful if the software could be set to simply identify duplicate albums, and secondarily to identify directories with duplicate track numbers.
First, for those interested in how so many duplicates snuck in: Partly history, I initially ripped several thousand CDs using Itunes and then Winamp. to m4a to listen to on an Ipod. At first I didn't even mark which CD's were ripped, and they got mixed back into the collection as they got played in my home CD player. Then I started using stick-on "dots" which unfortunately fall off. But anyway, I then got a good "Home Theater" receiver and decided to start over, ripping using dBpoweramp to FLAC, in a different directory on my NAS. But in actuality, I first ripped the CDs that (supposedly) had never been ripped to m4a, and, a year or so ago, started reripping the CDs previously ripped to m4a. While I tried to remove the rerips from the m4a directory, it was hard to find some of the rips, and as I had purchased perfect tunes, I figured I'd just wait and find the duplicates later.
So a few months ago, I began to see all too many duplicates, so I ran Dedup. Since this was going to be a slow process for me, I exported the duplicate list to Excel, and used that to locate the duplicates on the file server.
Well, the issue is that Dudup really works too well. It found thousands of duplicates or likely duplicates, and the Excel export doesn't tell which is which, or allow filtering.
It certainly found the duplicate albums (which is what I was after) but on a track by track basis. It, as I expected, found the many duplicates that were on "best of" or compilation CDs, which, unless the whole compilation cd was in fact a duplicate CD, I didn't want to delete. It found remixes done by the same artist, live versions, "disco mixes", etc. I didn't want to delete those. It found cover versions, by other artists, a lot of them. It found instrumental versions, even ones on totally different instruments, like steelband versions of calypso songs (my collection is heavily Caribbean.) It took me over two months to go through the spreadsheet line by line and separate the wheat from the chaff, deleting only the true duplicates. I did find some interesting issues, I had quite a few more duplicate copies of CDs than I realized, and if the title or artist of a track was slightly different when the second copy got ripped, I ended up with duplicate tracks in the directory for that CD album, with slightly different filenames. Of course, Dedup found those. I also found that some producers had a bad habit of re-releasing CDs, particularly compilation CDs, with a different title and different artwork, but the same tracks. I even found one where in addition to changing the title and artwork, the producer had shuffled the tracks around. Same tracks, but in a different order.
Now here's my wishlist item: Make options in Dedup which would only report duplicate albums, i.e. where all or most of the tracks are duplicates, instead of every duplicate track. And a second option which would simply scan each album directory for duplicate track numbers. Flag all the directories with a duplicate track 1 or 2 or 15, or whatever. In my case, that tells me there is some kind of duplication or filing error. (I did find a couple of compilation albums I had ripped that had identical titles, and therefore all the tracks for both albums ended up in the same directory.)
While I found it interesting to see all the extraneous "duplicates" that were really cover versions or "best of" or "live"tracks, the deduping process would have been much faster and less painful if the software could be set to simply identify duplicate albums, and secondarily to identify directories with duplicate track numbers.