Building a Custom Bluesky Feed, Part 2: Iterating Over It

This is Part 2 of 3. Part 1 covers how the feed generator works and the initial strategy to collect possibly related posts.

The regex filter I built in Part 1 has a problem it can't solve itself: \biron ?man\b matches posts about the Ironman as in the swim-bike-run thing, as well as posts about Tony Stark's hero persona with equal enthusiasm. Same for T100 that could be the professional triathlon series, or a fancy camera. Regex has no context. It can't tell the difference.

A possible fix for that is a classifier — something that understands the content well enough to know that a Marvel fan is not (necessarily) my target audience. But classifiers need labeled data. And labeled data means I needed to sit down and tell the computer, one post at a time, which ones were actually triathlon-related.

A schema with two jobs

Before labeling anything, I needed a proper data model. Each post needs to carry two different kinds of information: its content (for labeling and ML training) and its relevance state (for feed serving). The Post struct does both:

type Post struct {
	URI            string   `json:"uri"`
	AuthorDID      string   `json:"author_did"`
	Text           *string  `json:"text"`
	// ... plus different other fields with post content

	IsRelated      *bool    `json:"isRelated"`      // label added manually, by a human
	RelatedScore   *float32 `json:"relatedScore"`   // ML prediction
}

To re-use the existing schema I had in place, I decided to extend it in two ways:

  1. The existing storage would now include these extra fields for human-added labels and classifier score
  2. Adding to it an index of relevant post URIs — that's what the feed skeleton endpoint reads.

No runtime scanning, no per-request classification or sorting: just a sequence of AT URIs that were already classified and ordered, ready to be served to users.

Building the labeling tools

I built a CLI labeler first — a loop that pulled the next unlabeled post, showed it, and moved it to "related" or "unrelated" based on the input. Related? Press y. Not related? Press n. Later on, I added a web UI so I could use it on my phone and share with friends.

That was actually fun. Labeling posts that could be about triathlon, about Iron Man shenanigans, or heavy metal songs became my new go-to doom scrolling activity. After a few weeks I had one thousand posts labeled.

The data loss incident

Around that time I also moved storage to FoundationDB — not because I had to, but because I wanted to try it. Don't judge me, geeks have fun trying different databases, ok?

Turns out running a FoundationDB cluster with a single node on an old home PC is a good way to hit a low quorum issue and lose all your data. Which is exactly what happened.

That's what pushed me to build export and import commands — to back up and restore the entire database. And since I had lost the collected posts, I also built a backfill command, using tap to grab more posts to my database.

From that painful lesson, the workflow became way safer. I could collect posts both live with firehose, and past posts with the backfill tool. And I could label them in my terminal or on the go with my phone. On top of that, now I can export all this data and save it in my backups, just in case.

With some help from friends who were beta testers of the web UI, I had around two thousand labeled posts.

Training the model and integrating it

With that amount of posts, and mostly labeled posts I could trust, I was ready to use some basic machine learning. A simple random forest classifier trained on those two thousand posts landed at 97% accuracy. Impressive.

Since I used Python for that bit, the classifier was exported to ONNX so it could be also used inside the Go server, without any external runtime.

The classifier does not run at serving time, but at ingestion time. Every post coming through the firehose gets scored immediately before being saved. The classifier only sets RelatedScoreIsRelated is never touched by it, since that field is reserved exclusively for human labels:

func (m *Classifier) ClassifyPost(post *posts.Post) error {
	y, _, err := m.Classify(post.Contents())
	if err != nil {
		return fmt.Errorf("could not classify post: %w", err)
	}
	post.RelatedScore = &y
	return nil
}

When serving the feed, both signals are combined: a post is included if it was manually labeled as related, or if its RelatedScore is above the threshold — whichever applies. The code is ugly, but ius something that looks like:

// explicitly labeled as NOT related by a human — skip
if p.IsRelated != nil && !*p.IsRelated {
	continue 
}

// no human label and low ML score — skip
if p.IsRelated == nil &&  p.RelatedScore < relatedThreshold {
	continue 
}

Classification runs only once per post, at ingestion — never on each feed request.

The feed now had high-quality content, served automatically. But there was still one rough edge: the feed was chronological. And triathlon is not the most active topic on Bluesky — long stretches of silence punctuated by some post here and there that might or might not be that relevant, even if related to triathlon.

Being relevant is not just about being recent. But let's save that to our final sprint — stay tuned for part 3!


You can find out everything triathlon-related going on on Bluesky at the Triathlon feed. And, surely, the code is open source. Way to go!