Treat http:// and https:// guid as being the same when fetching
authorMagnus Hagander <magnus@hagander.net>
Mon, 19 Mar 2018 11:11:03 +0000 (12:11 +0100)
committerMagnus Hagander <magnus@hagander.net>
Mon, 19 Mar 2018 11:11:03 +0000 (12:11 +0100)
This means that if the same blog post shows up both under http:// and
https:// (for example when somebody changes their URL from http to
https, but it still contains the old posts), we will treat them as being
the same and not fetch a second copy of it.

We handle both http->https (common) and https->http (would probably
indicate a misconfiguration) scenarios.

hamnadmin/hamnadmin/register/management/commands/aggregate_feeds.py

index ab3d9092243ade3c017b84e4ff4fcbfba217a278..e9256146e57f6713da589c13a9c1e7ef8048d6e1 100644 (file)
@@ -75,8 +75,18 @@ class Command(BaseCommand):
                                                for entry in results:
                                                        self.trace("Found entry at %s" % entry.link)
                                                        # Entry is a post, but we need to check if it's already there. Check
-                                                       # is done on guid.
-                                                       if not Post.objects.filter(feed=feed, guid=entry.guid).exists():
+                                                       # is done on guid. Some blogs use http and https in the guid, and
+                                                       # also change between them depending on how the blog is fetched,
+                                                       # so check for those two explicitly.
+                                                       if 'http://' in entry.guid:
+                                                               alternateguid = entry.guid.replace('http://', 'https://')
+                                                       elif 'https://' in entry.guid:
+                                                               alternateguid = entry.guid.replace('https://', 'http://')
+                                                       else:
+                                                               alternateguid = None
+                                                       # We check if this entry has been syndicated on any *other* blog as well,
+                                                       # so we don't accidentally post something more than once.
+                                                       if not Post.objects.filter(Q(guid=entry.guid) | Q(guid=alternateguid)).exists():
                                                                self.trace("Saving entry at %s" % entry.link)
                                                                entry.save()
                                                                entry.update_shortlink()