Building a Custom Bluesky Feed, Part 3: Polish and Ship

This is Part 3 of 3. Part 1 covers how the feed generator works and the initial strategy to collect possibly related posts. Part 2 covers the labeling tooling, ML classifier, and storage design.

We started with a feed that was arguably serving a triathlon post here and there. We improved it a lot by labelling data and teaching the feed generator what is actually related to triathlon. But until that point, the ordering was still purely chronological.

For a niche topic like triathlon, that's a problem: not all recent posts are quality posts and sometimes it is nice to boost posts that probably bring more substantial content to the community. That is what we meant by being recent doesn't mean being relevant.

A Reddit-style ranker

I ended up with a weighted scoring function combining age and engagement — inspired loosely by how Reddit ranks posts. Recency gets the highest weight. Replies get a small penalty since they're usually less useful out of context. The ranker also respects the Accept-Language header, so posts in languages the user prefers get a small boost.

ageScore := 1 - age/maxAge
engagementScore := math.Log1p(engagement) / math.Log1p(maxEngagement)

score := weightedScores{
	{ageScore, weightAge},
	{engagementScore, weightEngagement},
	{languageScore, weightLanguage},
	{replyScore, weightReply},
}.Value()

The log scale for engagement matters: a post with 100 likes and a post with 10 shouldn't be 10× apart in score — that way popular posts surface without drowning out everything else forever.

The problem was that this was all hardcoded and fine-tuning these weights was not practical at all.

Everything in a config file

Up to this point, the filter rules — account DIDs, regex patterns, exclusions — were in Go code. The ranker weights were constants. I decided to extract all of it to a TOML file:

[feed]
did  = "did:plc:3272gdrjsuikiff7qsgokgas"
rkey = "aaahlw3uvkhgq"

[filter]
trusted_accounts = [
  "did:plc:bdg6sni7k7gq7hrgck6h3aky", # triathlete
  "did:plc:leidqgx3be72rmeiwvdzvnes", # world triathlon
  "did:plc:qcbkud2rb5mp3petgcof47ps", # challenge family
]
patterns = [
  '(?i)\btriathlon\b',
  '(?i)\b70\.3\b',
  '(?i)\biron ?man\b',
  # ...
]
exclude = ["douglas0bd0-20", "VesselAlert"]

[ranker]
disabled          = false
candidates        = 400
cutoff_days       = 7
weight_age        = 0.5
weight_engagement = 0.4
weight_language   = 0.1
weight_is_reply   = -0.1
engagement_like   = 1.0
engagement_reply  = 2.0
engagement_repost = 3.0
engagement_quote  = 4.0

This was the step that made the whole thing feel genuinely reusable. Want a cycling feed? Write a cycling TOML file. Want no ranker, just pure chronological? Set disabled = true. A small tweak to the architecture and the same binary can serve multiple feeds — one process, multiple config files.

Going live

Here's my favorite part: remember that old SkyFeed triathlon feed from 2023? It has a record on atproto. Even if the record itself was created by SkyFeed, it was created on my PDS — so it's actually mine.

It was pointing to SkyFeed's servers, but since on atproto I have control over my data, I just updated that record to point to the PC on my shelf.

The twenty-odd followers of that feed were now being served with a completely different backend, seamlessly. And it just… worked. None of them commented or complained or unfollowed, which is not necessarily a good thing, but it is still something. I find the content better, but I am obviously biased.

The whole process described in parts 1, 2 and 3 (this one) took a few hours of coding spread over two months — from that first rough start on a cold Ottawa winter Sunday, to going live with a trained ML model, a configurable ranker, and multi-feed support. And it is still running from a PC at home — although now I run 3 FoundationDB clusters. I've learned something, eh?

The infra

To be clear, this old computer is a 2018 Intel Core i5 with 6 cores, 16GB DDR4 and 256GB SSD — which is a good old one. I share these resources between a couple of pet projects, so I Proxmox sets up a couple of virtual machines to be actually used for the custom feed.

The feed

The feed is the process consuming the firehose, receiving HTTP requests for the feed, and serving the post list), and it runs in a single VM with:

  • 1 CPU
  • 1 GB of RAM
  • 16GB allocated for storage

CPU usage is always under 15% and memory is around 80-90%; storage is insignificant, less than 7GB including the OS (a Debian-based server).

FoundationDB cluster

For that, I use three VMs with:

  • 1 CPU each
  • 4GB of RAM each
  • 16GB allocated for storage each

Again, this is super overengineered just because I wanted to play with FoundationDB. It would be just fine with a simple SQLite.

Anyway, CPU is always under 5%, and memory hardly ever is over 50% — but 4GB is the minimum required by FoundationDB, so I set it up like that. Storage is super low too, all around 7GB.

Usage

Unfortunately I have not set up good monitoring tools, but from the logs I have learned this project:

  • saves roughly 200 posts per day
  • serves around 260 HTTP requests per day.

The HTTP server numbers do not include requests responded with cache at the edge (outside my application).

The database now has almost 7,000 posts, 60% of which were manually labelled, and less than 30% of the total are related to triathlon. The CSV dump with the full data has less than 4MB (or less than 1MB if compressed):

$ tri-bsky-feed stats
Total posts: 6714
Labeled    : 4011 (59.7%)
Related    : 1778 (26.5%)
Unrelated  : 2233 (33.3%)
{"time":"2026-03-12T11:33:14.020436-04:00","level":"INFO","msg":"stats","elapsed":"47.665137ms"}

$ tri-bsky-feed export data.csv
{"time":"2026-03-12T11:33:20.127103-04:00","level":"INFO","msg":"exported","posts":6714,"file":"data.csv","elapsed":"219.552723ms"}

$ ls -laGh data.csv
-rw-r--r--@ 1 cuducos  staff   3.9M Mar 12 11:33 data.csv

$ xz data.csv

$ ls -laGh data.csv.xz
-rw-r--r--@ 1 cuducos  staff   984K Mar 12 11:33 data.csv.xz

You can find out everything triathlon-related going on on Bluesky at the Triathlon feed. And, surely, the code is open source — change a TOML file and you have a different feed, about anything, for any community. Way to go!