A competitive price-intelligence system has a chicken-and-egg problem before you write a line of the interesting code: it needs a market to watch. Real competitor prices mean scraping — a legally fraught, anti-bot arms race that proves nothing about my engineering. So I drew a hard line: the acquisition is synthetic and narrated; the intelligence — matching, uncertainty, anomaly detection — actually runs.
1. Synthetic data that behaves like a real market
The naive version of fake data is a random walk — prices jittering every second. It looks alive and means nothing. Real grocery prices are the opposite: sticky. A price holds for weeks, then moves once, for a reason — a competitor repositioning, a cost pass-through, a seasonal event. Most of the time, nothing meaningful happens.
So the generator models that: long-stable prices with rare, event-driven moves. And the competitor side is messy on purpose, because that's the real problem:
- titles are noisy and inconsistent —
"Doritos Tex Mex 150g envio gratis","yogur natural 4 uds danone"; - some carry an EAN/barcode, many don't;
- coverage is sparse and ragged — not every competitor stocks every product;
- and occasionally a product's title mutates under a stable URL, simulating a site quietly swapping what sits behind a link.
It publishes all of this to an in-process event bus exactly the way a real crawler would. That's the whole point of the seam:
Swap the synthetic generator for real scraper workers and nothing downstream changes — they publish to the same subject. The fake part is a stand-in shaped like production, not a shortcut around it.
2. A clock I can fast-forward
A sticky market creates a tension: if nothing moves for weeks, a visitor lands on the page and sees… nothing. A demo of stillness is a demo of nothing.
The fix isn't to make the data lie — it's to move time. The whole system runs on a simulated clock that compresses real seconds into sim-seconds by a configurable factor:
// At factor 120, a sim-day passes in ~12 real minutes, so a market that
// moves "a few times a week" actually moves while you watch. Set the factor
// to 1 and the exact same code is a real-time scraper — the acceleration is
// a knob, not a fiction baked into the logic.
func (c *Clock) NowMs() int64 {
real := time.Since(c.bootReal)
return c.bootSimMs + int64(real.Seconds()*c.factor*1000)
}
Two details I'm glad I got right. On boot it seeds ~90 sim-days of backdated history, so there's no cold start — and those historical rows persist and get matched but never fire alerts, so a restart doesn't flood the feed with stale "news." And on a returning boot it resumes the sim clock from the last stored observation, so ages and freshness stay coherent across restarts. The UI only ever shows relative time, so sim-time can climb forever without anything looking wrong.
3. A vector database, in-process, as a rebuildable projection
The load-bearing problem is entity resolution: is this competitor's messy title the same product as one of ours — and the same size? Get it wrong and every number downstream is a lie.
I used vecdb from the toolbelt library — a pure-Go, in-process vector index. No Pinecone, no sqlite-vec, no network hop. On boot I embed our whole catalog into a flat cosine index, and treat it as a rebuildable projection: it's a pure function of the catalog, so I never persist it — I rebuild on startup and let it grow when a product is promoted. That fits the system's CQRS spine, where an append-only event log is the only source of truth and everything else is derived.
Matching is tiered, and the tiers exist to express confidence:
if ean != "" { /* deterministic exact-join — trust it */ }
hits := index.Search(3, embed(title)...) // nearest products by cosine
conf := confirm(hits[0].Similarity, parsed) // adjust by size + brand agreement
switch {
case conf >= 0.80: autoAccept() // price flows onto the board
case conf >= 0.58: sendToReview() // a human decides
default: markAsWhitespace() // a gap, maybe an opportunity — not a guess
}
The confirm step is where the size trap dies: two embeddings can look near-identical for "Coca-Cola 2L" and "500ml", so raw similarity is boosted or slashed by hard attributes parsed from the title. Uncertain matches don't silently pollute the board — they land in a review queue, and a drifted title flags its mapping as suspect instead of comparing two different things forever.
4. The alert brain — what "signal over noise" actually means
A wall of N×M prices drives zero decisions. The product is the noticing, and that runs today as a deterministic engine over the price stream. It distinguishes a few kinds of extraordinary:
- Leadership lost — a competitor newly drops below our price on a product (edge-triggered, so a rival that simply stays cheaper doesn't re-alert forever);
- Magnitude — a single move far larger than that product's own recent behaviour;
- Coordinated cluster — several competitors undercutting us in the same category within a short window, which reads as a repositioning, not noise;
- System alerts — a mapping gone suspect (drift) or a coverage gap. These are deliberately distinct from market alerts: one says "the market moved," the other says "I might be wrong about something."
None of this needs a model — it's the boring, reliable core, and getting it right is what lets everything above it stay trustworthy.
5. The AI layer 🚧 — under construction
🚧 This is the part I'm actively building. Everything above runs today. The two "AI" surfaces currently run on local, deterministic stand-ins behind clean interfaces — and that's deliberate: I wanted the pipeline, the uncertainty handling, and the projections correct first, with the models as swappable leaves I drop in without touching anything else.
Where it stands and where it's going:
Matching — embeddings. Today's embeddings are lexical (feature-hashing over tokens and character n-grams). They're genuinely good at brand/size/spelling overlap and keep the binary self-contained, but they only match shared tokens, not meaning — "refresco de cola" won't pull "Coca-Cola" the way a semantic model would. Next: swap in a small neural embedding model behind the same interface; the vector index, confidence gating, and review queue don't change.
Normalization — a fine-tuned small model. Turning a messy title into {brand, variant, size, unit} is, today, regex plus a brand vocabulary. It's the right kind of task to fine-tune a small model on — not to call a big LLM for on every item:
- it's high-volume (every competitor item, every collection run) and bounded (a fixed output shape), so a small local model is cheap, fast, and has no per-call cost;
- it keeps competitor data out of a third party's hands;
- and the training data is easy to distill: use a capable LLM once to label a few thousand titles into structured fields, then fine-tune the small model on that. The big model is the teacher; the small model does the shift work.
Briefing — narration, not invention. The home briefing already computes every figure deterministically in SQL; a templated narrator turns them into prose today. The model upgrade is to have a capable LLM narrate those facts — strictly forbidden from inventing a number — and then the real frontier: a decision loop that turns the briefing into a recommendation with a rationale and a guardrail. The "so what / now what," not just "here's what happened."
The reason it's structured this way is the point I most want to make: the seams were designed before the models exist, so "add the AI" is a swap, not a rewrite.
The result runs as a single Go binary — embedded bus, embedded database, server-rendered UI streaming live updates — deployed by copying one file to a VPS. It's live at pi.manulobato.com, resets nightly, and you're welcome to poke the review queue and break things.