Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
311 views
in Technique[技术] by (71.8m points)

Firebase Functions: hosting rewrite to dynamically generate sitemap.xml with more than 50000 links

I′d like to dynamically generate a sitemap.xml containing all static and dynamic user links (through uids from Firestore) with Cloud Functions when a user or a crawler requests https://www.example.com/sitemap.xml. I already managed to implement a working version using sitemap.js (https://github.com/ekalinin/sitemap.js#generate-a-one-time-sitemap-from-a-list-of-urls) and Firebase Hosting rewrites. However, my current solution (see below) generates one large sitemap.xml and only works for up to 50000 links which is not scalable.

Current solution:

Hosting rewrite in firebase.json:

  "hosting": [
      ...
      "rewrites": [
        {
          "source": "/sitemap.xml",
          "function": "generate_sitemap"
        },
      ]
    }
  ],

Function in index.ts

export const generateSitemap = functions.region('us-central1').https.onRequest((req, res) => {

  const afStore = admin.firestore();
  const promiseArray: Promise<any>[] = [];

  const stream = new SitemapStream({ hostname: 'https://www.example.com' });
  const fixedLinks: any[] = [
    { url: `/start/`, changefreq: 'hourly', priority: 1 },
    { url: `/help/`, changefreq: 'weekly', priority: 1 }
  ];

  const userLinks: any[] = [];

  promiseArray.push(afStore.collection('users').where('active', '==', true).get().then(querySnapshot => {
    querySnapshot.forEach(doc => {
      if (doc.exists) {
        userLinks.push({ url: `/user/${doc.id}`, changefreq: 'daily', priority: 1 });
      }
    });
  }));

  return Promise.all(promiseArray).then(() => {
    const array = fixedLinks.concat(userLinks);
    return streamToPromise(Readable.from(array).pipe(stream)).then((data: any) => {
      res.set('Content-Type', 'text/xml');
      res.status(200).send(data.toString());
      return;
    });
  });
});

Since, this scales only to about 50000 links, I′d like to do something like https://github.com/ekalinin/sitemap.js#create-sitemap-and-index-files-from-one-large-list. But it seems like I′d need to actually create and temporarily store .xml files somehow.

Does anyone have experience with this issue?

question from:https://stackoverflow.com/questions/66066333/firebase-functions-hosting-rewrite-to-dynamically-generate-sitemap-xml-with-mor

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As you noted, this isn't scalable and your costs are going to skyrocket since you pay per read/write on Firestore, so I would recommend rethinking your architecture.

I solved a similar problem several years ago for an App Engine website that needed to generate sitemaps for millions of dynamically created pages and it was so efficient that it never exceeded the free tier's limits.

Step 1: Google Storage instead of Firestore

When a page is created, append that URL to a text file in a Google Storage bucket on its own line. If your URLs have a unique ID you can use that to search and replace existing URLs.

https://www.example.com/foo/some-long-title
https://www.example.com/bar/some-longer-title

If may be helpful to break the URLs into smaller files. If some URLs start with /foo and others start with /bar I'd create at least two files called sitemap_foo.txt and sitemap_bar.txt and store the URLs into their respective files.

Step 2: Dynamically Generate Sitemap Index

Instead of a normal enormous XML sitemap, create a sitemap index that points to your multiple sitemap files.

When /sitemap.xml is visited have the following index generated by looping through the sitemap files in your bucket and listing them like this:

<?xml version="1.0" encoding="UTF-8"?>
  <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
      <loc>https://storage.google...../sitemap_foo.txt</loc>
    </sitemap>
    <sitemap>
      <loc>https://storage.google...../sitemap_bar.txt</loc>
    </sitemap>
  </sitemapindex>

Step 3: Remove Broken URLs

Update your 404 controller to search and remove the URL from your sitemap if found.

Summary

With the above system you'll have a scalable, reliable and efficient sitemap generation system that will probably cost you little to nothing to operate.

Answers to your questions

Q: How many URLs can you have in a sitemap?

A: According to Google, 50,000 or 50MB uncompressed.

Q: Do I need to update my sitemap everytime I add a new user/post/page?

A: Yes.

Q: How do you write to a single text file without collisions?

A: Collisions are possible, but how many new pages/posts/users are being created per second? If more than one per second I would create a Pub/Sub topic with a function that drains it to update the sitemaps in batches. Otherwise I'd just have it update directly.

Q: Let's say I created a sitemap_users.txt for all of my users...

A: Depending on how many users you have, it may be wise to break it up even further to group them into users per month/week/day. So you'd have sitemap_users_20200214.txt that would contain all users created that day. That would most likely prevent the 50,000 URL limit.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...