0
10

[–] PuttItOut 0 points 10 points (+10|-0) ago 

A flexible solution would be to detect duplicates (assuming you are doing hash based detection) and have both URLs point to same backing image. This would require a *:1 mapping of URL to destination IMAGE. It sounds like right know you have a 1:1 mapping.

If you do account based removals (i.e. logged in user deletes an upload associated with their account), you simply remove that mapping of the URL to that particular backing image. If an image has zero URL maps associated with it, you can delete the image safely as it has no valid remaining references.

Just my quick thoughts on duplicates. Keep up the good work.

Let us know what you decide to do.

0
2

[–] RotaryProphet 0 points 2 points (+2|-0) ago 

If the files are stored as-is on the filesystem (i.e., not in a database) you could probably run a script in the background to replace duplicates with hard-links; the advantage is that all the duplicates point to the same data (Space saving), it's seamless (Requires no code changes), and the data persists as long as there's at least one copy of it, and is deleted when the last link is deleted.

0
2

[–] MichelleObamasPenis 0 points 2 points (+2|-0) ago 

Just leave it. Not a big deal.

0
1

[–] d0c5 0 points 1 points (+1|-0) ago 

Let them ride. People do things for a reason.

1
1

[–] JustGuessing 1 points 1 points (+2|-1) ago 

What qualifies as duplicates?

If the user uploads various sized images of same picture, would that count as duplicate?

1
1

[–] brandon816 1 points 1 points (+2|-1) ago 

Don't worry about removing duplicate images for now. When the time comes, maybe you can just have them redirect internally to the original image, so you still don't have broken links.

The ideal solution is actually easy to do. On uploading an image, take an SHA1 hash (or some other alternative) of the filedata, compare that with the rest in the database, then compare filedata for any matches. That way, you aren't comparing for giant strings, unless you are already pretty sure that it's going to match, and an SHA1 hash is short enough that you can create a non-unique index on it to speed up that comparison without crippling your upload time.

If you think you might want to try this, please save yourself the hassle and start taking a hash of image data and storing it next to the data ASAP, if you aren't already. That way, you won't have to fix so many entries later if / when you want to start comparing file data, and it really doesn't cost you any CPU, bandwidth, or storage space to do if you decide not to do anything with it for the moment.

0
1

[–] Phen 0 points 1 points (+1|-0) ago 

Could you retain the URLs but internally have them all point to the same data?

Then maybe on an image page, you could list out all URLs that point to the same image to make it easier to reverse search.

0
0

[–] stolencatkarma 0 points 0 points (+0|-0) ago 

Use hashes to decide if files are the same. If so just redirect. No reason to keep more then one copy.

1
-1

[–] Lord_of_the_rats 1 points -1 points (+0|-1) ago 

i suggest u use a low-quality hash algorithm, very fast but many collisions.

if hash 1 suggests two images are duplicate, use better hash e.g. sha256 to verify and make sure errors from low quality hash didn't cause legit image 2 b deleted