How do I build my own feed generator?
Write a script that reads a website and outputs an RSS feed, with guidance on scheduling, server etiquette, and ethics.
This guide covers the high-level concepts you need to consider when building a system that checks websites for new content on a schedule and publishes it as RSS feeds. It doesn’t include any actual code or step-by-step build instructions — just the ideas around finding data sources, extracting content, respecting servers, and hosting the output so LightWatch can subscribe to it.
This is for people who are comfortable with code or coding agents. I don’t offer support for any of the concepts here — it’s a starting point for your own DIY project. You can provide this URL to a coding agent to get a project started with these considerations baked in.
Ethics
Before building a feed from a site that doesn’t offer one, think about why they don’t. Some sites simply haven’t gotten around to it, or haven’t considered it because it isn’t an obvious choice. Your approach should differ based on what you’re building and the sites you’re targeting.
Personal use
It is my opinion that, for personal use, a feed scraper is analogous to checking a website regularly for updates. You’re doing what a browser does, just on a schedule. As long as your scripts are polite this is fine.
I also believe that hotlinking images is also reasonable for personal use. Your frequency of use is similar to that of what you’d be doing if checking via a browser. Cross site concerns might make this challenging, but ethically, I think it’s fine.
The same goes for User-Agent behaviors. A lot of sites block unrecognized User-Agents, and you can trick them by providing a browser User-Agent instead. In a public context this is sketchy, but for personal use, I think it’s a reasonable requirement to be able to consume content in your preferred manner.
Redistribution
Redistributing a feed you’ve created is different. The first question you should ask yourself is Is what I’m doing appropriate? If you are trying to provide an ad-free version of ad supported content — no. Are you creating a curated aggregate feed that canonically points to the original feeds as the source? Yeah! Ultimately, the safe choice is to reach out to the content provider to get their permission, so be smart and respectful.
Once you share your feed, you’re no longer one person. You’re creating an uncontrolled stream of traffic the source hasn’t accounted for.
If you decide to create a redistribution:
- Attribute the source. Link back to the original page for every post.
- Don’t circumvent the source’s business model. If a site doesn’t offer a feed because they rely on ad revenue, creating a feed that strips out the ads and sharing it takes money from the creator. That’s an easy no.
- Host images yourself. When you’re generating an unknown number of subscribers, you’re pushing bandwidth costs onto the source without their consent. If you can determine it’s legally OK to do so, you should download and host the images yourself.
- Check the
robots.txt. This file exists for the server to make its wishes clear on exactly this matter. - Use an honest User-Agent. A public crawler should be identifiable and blockable. Don’t try to cheat obvious signals that the site doesn’t want you scraping their content.
It seems silly to have to say this, but remember to be a good person with how you approach this. Treat others how you would like to be treated and you’ll probably end up in the right direction.
What you’re building
Ok, now that we’re all confident we’re not being a-holes, let’s get into it. You want a system that runs on a schedule, checks websites for new content, and publishes that content as RSS feeds that LightWatch can subscribe to. The high-level pieces are:
- A pipeline that fetches content and turns it into structured post data.
- A feed writer that outputs valid RSS XML from that data.
- A scheduler that runs the whole thing periodically.
- A host that serves the generated feed files somewhere LightWatch can reach.
If you’re only generating feeds from one or two sites, this can be a single script, but you might want to consider a more robust pipeline if you intend to do more than that.
The pipeline
Most content sources follow the same two-level pattern:
- An index — a page or endpoint that lists posts (a blog homepage, a gallery page, an API response).
- Articles — the individual posts linked from the index, which often have richer content than the index itself.
Sometimes the index contains everything you need. Other times the index just has titles and thumbnails, and the good stuff is on the article pages.
Design your pipeline as two agnostic steps:
- Fetch the index and return a list of post objects (title, URL, date, thumbnail, etc.). How you fetch and parse the index is up to the implementation — it could be an API call, structured data extraction, or HTML parsing. What matters is that it returns the same structured result regardless.
- Enrich each post by optionally following its URL and extracting additional content — higher resolution images, more media, better metadata. This step is additive: it layers data on top of what the index already provided rather than replacing it.
This separation keeps your system flexible. You can swap out the index fetcher for a different source without changing the enrichment step. You can skip enrichment entirely for sources where the index has everything. And you can add new sources without rearchitecting the whole thing.
Caching
Keep a local cache of posts you’ve already processed (a JSON file, a SQLite database, whatever works). On each run, check new posts against your cache and skip anything that’s already been processed. This makes each run faster and avoids putting unnecessary load on the server by re-fetching pages you’ve already seen.
A side effect of this is that you can also provide feeds with more content than the source if you keep the cached data round.
Filtering
You might want a filtering system that lets you exclude posts by keyword, user, tag, or other criteria. This can be handy for situations like “I want this feed but not these annoying users.” Building this in from the start is easier than adding it later. Consider filtering at the global and also the source level. They each have their benefits, but it’s more likely you’ll care per-source than globally.
Finding your data source
Scraping is great and all, but — actually no, scraping sucks. It’s slow, hard to do, and fragile. Avoid it if you can.
First check if there’s a better data source:
- Public API — Some sites have a documented API you can use directly. Check for a developer portal or API docs. This is the most reliable option.
- Internal API — Open the site in a browser and inspect the network traffic. Many sites load content from internal APIs that return structured JSON. If you can find one, use it. The data will be cleaner and less likely to break when the site updates its design.
- Structured data in the page — Check the
<head>for JSON-LD scripts, Open Graph tags, or microformats. Many sites embed structured metadata for SEO that you can extract without parsing the visual layout. - HTML parsing — If none of the above work, parse the HTML directly.
Whatever data source you use, the goal is the same: get structured post data into your pipeline.
HTML-specific concerns
If you’re resorting to HTML parsing, there are some specific things to watch for.
Extracting media
When parsing HTML for images, check for srcset attributes and extract the largest available size. Many sites serve responsive images, and the srcset will have much better versions than the src attribute alone.
Many sites also use lazy loading, which means the real image URLs aren’t in src or srcset at all — they’re usually in data attributes like data-src, data-srcset, data-lazy-src, etc. Build your image extraction with a hierarchy: prefer srcset and src first, but fall back to data attributes that look like they contain URLs.
For videos, when available, extract the poster image in addition to the video URL. LightWatch uses poster images if they are provided, rather than generating them, which will provide a quicker loading experience than a video with no preview. See the feed optimization guide for exactly how to include these in your RSS output.
Getting larger images
Many sites serve thumbnailed or downsized versions of images, but larger sizes are available at a predictable URL. Often it’s a simple string substitution — changing thumb to full, _small to _large, or w=400 to w=1600 in the URL. It’s worth inspecting a few image URLs from a site to see if there’s a pattern.
Some website platforms have standardized URL schemes for image sizes. If you can identify the platform (Shopify, WordPress, Squarespace, etc.), you can often reverse-engineer the URL to request the largest available version. This is worth building into your pipeline as a per-platform transform.
Server etiquette
Let’s go back to that not being a-holes thing. We should be respectful to the source servers. Smart ones let us know what they’re cool with us doing. And for the rest, we can make some default choices that keep us polite citizens.
Conditional requests
Every request should send If-None-Match (ETag) and If-Modified-Since headers if the server provided them on the previous response. If the server returns 304 Not Modified, skip parsing entirely. This reduces load for both you and the server.
Rate limiting
If a response comes back with a 429 Too Many Requests status or a Retry-After header, stop and wait the specified time before trying again. Don’t retry immediately.
If the server sends Cache-Control: max-age=3600, it’s telling you the content won’t change for an hour. Don’t fetch more often than that.
Error handling
If the server returns a 5xx error, back off exponentially. Don’t hammer a server that’s struggling.
Scope
If you only need the front page, don’t crawl the entire site. Fetch one page, extract what you need, and stop.
Platform coordination
If you’re generating multiple feeds from the same platform (e.g., several Instagram users or Tumblr blogs), build a shared client for that platform that coordinates rate limiting across all feeds. A full check across 20 feeds shouldn’t fire 20 requests simultaneously — stagger them and share rate limit state so one blocked request pauses the whole batch.
User-Agent and robots.txt
For personal use, using a browser User-Agent string is reasonable. You’re making the same requests a browser would, just on a schedule. A server that serves content to your browser shouldn’t get to decide that the same request from a script is forbidden.
If you’re building something for redistribution, set an honest User-Agent that identifies your tool, and respect robots.txt. A public crawler should be blockable. See the ethics section above.
Hotlinking images
For personal use, hotlinking is fine. You’re one person making the same requests a browser would.
For redistribution, host the images yourself — if it’s legally appropriate to do so. Many subscribers hotlinking through your feed pushes bandwidth costs onto the source without their consent.
Output format
Now the good part. You’ve got your data. It’s time to build the feeds. Actually, this part is pretty easy. LightWatch is pretty robust at figuring out what content is available. If you put the content in the feed using standard HTML conventions, LightWatch will figure it out.
However, to be more specific, LightWatch supports RSS 2.0, Atom, and JSON Feed. I personally recommend RSS. See the feed optimization guide for the exact properties and media formats that LightWatch supports.
Running it
Scheduling
Your system needs to run on a schedule. Some options:
- Cron on a server or VPS.
- GitHub Actions with a scheduled workflow.
- AWS Lambda or Cloudflare Workers with a scheduled trigger.
- Shortcuts Automation on a Mac that’s always on.
How often to run depends on how often the source updates. For most sources, every 24 hours is perfectly reasonable. You could tighten that to every 4–6 hours for more frequently updating content. If you find yourself tempted into a frequency closer to 1 hour, consider whether it’s necessary. Sources that update that often are better sampled anyway.
Hosting
Once your script generates the RSS files, they need to be somewhere LightWatch can fetch them:
- GitHub Pages — Push your generated XML files to a public repo with Pages enabled. Free, reliable, and easy to automate with GitHub Actions.
- S3 or Cloudflare R2 — Upload the files to an object storage bucket configured for public access.
- Any static hosting — Netlify, Vercel, or similar. Anywhere that can serve a static XML file over HTTPS.
- A local server on your network — Run a simple HTTP server on a device at home and point LightWatch at it using a
.localaddress. This works well if you have a Mac or Raspberry Pi that’s always on. The trade-off is that LightWatch can only check the feed when you’re on the same network.
Go forth and build
Let’s be honest, this is the perfect project for a coding agent. It can yield great personal returns, but it’s complicated and low priority. Once it’s up and running, you can even have the agent generate your scraping configs. Give your agent this url and get it going! Or build it yourself if that’s your thing. Either way, don’t @ me about it.
If there’s one final thing I can leave you with it’s this: You can make anything a feed. LightWatch is a visual feed reader watcher. You can follow a lot of different kinds of content this way, and building your own feed generator is the step to unlocking that. I have a feed called Cool Beans that is just new releases from artisanal dry bean brands. You could make a feed from a SaaS company’s about page to see when they add employees. The weirder and more niche the ideas you come up with, the more personally satisfying the results will be. Have fun.