// instruments · laserbrain judge

the judge

A model can't always be benchmarked, but its output can be judged. Paste a task and a response — or two, to compare — pick what you're grading on, and a hosted model returns a scored verdict. It's a rubric run on a model, the way alice is a persona run on a model.

mode

// the task they answered

// response A

// response B

// judge on

// what this is, and isn't

This is one hosted model's scored opinion (Meta's llama-4-scout on Cloudflare), not a benchmark and not the last word. LLM judges are real but noisy — they can be swayed by length, order, and confidence, so this one is told to watch for that, and pairwise (A vs B) is steadier than scoring one in a vacuum. Treat it as a fast second read, not a verdict. The laserbrain oscillator field isn't doing the judging; a language model is — same honest split as alice.