Methodology
Every number on Ref Geek traces to a calculation. This page documents the formulas, sample minimums, and interpretation notes for each stat we publish. Each section has a stable anchor link so you can cite or reference a specific formula directly.
Stats are computed by aggregate refresh jobs that read from event tables (penalties, game assignments, on-ice players). User-facing pages always read from the materialized aggregates; we never query raw event tables on the read path.
Attribution model
Every penalty in a game attributes to both refereesassigned to that game. A ref's season stats are computed across every game they worked, regardless of partner. This produces robust per-ref sample sizes (~70 games per active ref per season) and matches how hockey people reason about officials — by name, not by pair.
When the NHL API or a broadcast specifically identifies which referee made a call, we store that as metadata for incident-level review and AI-writer claims, but it does not change primary stat calculation. Pair-level analysis exists as a small-sample drill-down, never as a primary stat.
Core per-60 ratesspec §2
Period-rate (P1 / P2 / P3) bars are calls × 3 / games — true per-60 of period time given each regulation period is 20 minutes. Score-state buckets (tied / one-goal / blowout) remain per-game because we don't yet ingest time-in-state granularity.
Per-call-type ratesspec §2.5
descKey values map to canonical codes via penalty_type_nhl_mappings; bench-coded variants like INTERFERENCE_BENCH collapse into the same canonical code as their on-ice counterpart but are preserved in penalty_events.metadata.descKeyfor downstream stats that need the distinction (Discipline, future PP-creation flag).The “League pct” column is sample-filtered: refs with fewer than 3 calls of that type (or fewer than 10 games overall) don't contribute to the percentile distribution.
Stripe Scorespec §8.1
Higher = more conventional / consistent. Tier bands map score ranges to labels (Strong, Stable, Volatile, Outlier) for at-a-glance reading.
Consistency Indexspec §8.2
Home Cooking Coefficientspec §4.1 / §8.3
Leverage Gradespec §4.9 / §8.4
Gap Rhythmspec §8.9
Discipline Ratingspec §8.6
penalty_severity. Bench minors are identified by NHL descKey match (BENCH or *_BENCH variants like INTERFERENCE_BENCH, UNSPORTSMANLIKE_CONDUCT_BENCH). Legacy rows without the descKey metadata fall back to a player-null heuristic.Known limitation: the failed-coach-challenges term is currently stubbed at 0 until coach-challenge ingestion ships. The other two terms are live.
Coach Challenge Record (ref)spec §6.3
coach_challenges_total= number of challenges in this ref's games (any type). coach_challenges_upheld + coach_challenges_overturnedcount only non-league-initiated challenges where outcome has been determined; the sum may be less than total when outcome is still NULL or when a challenge was league-initiated. Reads from coach_challenge_eventsjoined to the ref's assigned games.Outcome detection runs inline during ingestion via a score-state heuristic (UPHELD if a goal was removed; OVERTURNED if score state held; NULL when the signal isn't clear). Ratings should not over-interpret refs with low total counts; challenges are rare events.
Make-Up Window Ratespec §1.3 / §8.8
Pattern only — never describe as the ref "evening things up." The stat measures how often cross-team penalty pairs cluster, regardless of cause.
First Penalty Timingspec §1.1
Gap Between Calls (avg)spec §1.2
Penalty Density by Periodspec §1.5
Late-Period Suppressionspec §1.6
Power Play Chain Ratespec §1.11
penalty_duration_minutes × 60 seconds.Known limitation: the heuristic infers PP-creation from severity + non-offsetting status. The penalty_events.power_play_resulted column is not yet populated by ingestion, so coincidental majors and similar edge cases may be miscounted. Tightens once power_play_resulted wires up.
First / Final 5-Minute Deltaspec §1.9
period_time_elapsed_seconds ≤ 300. Closing 5 = P3 with period_time_remaining_seconds ≤ 300. Each window is 5 minutes per game, so the denominator is 5 × games_total. Positive = starts hot, finishes soft. Negative = finishes hot.OT Call Ratespec §1.8
ot_calls / games_with_ot. Per-overtime-game rate, not per-total-game. The ref's overall games count includes every regulation finish, so dividing OT calls by all games understates the true frequency by an order of magnitude. games_with_ot is computed from games.went_to_overtime per ref-game pairing. NULL when the ref has worked no OT games in scope; the period bar chart on the ref profile shows the OT bar empty in that case.Known limitation:the spec target is calls per 20 minutes of overtime. Regular-season OT is sudden-death 5-min, so most OT periods end short of 5 minutes. Once shift-chart data flows through more reliably we'll switch the denominator to actual OT-minutes-worked × (1 / 20) × 60 for true per-20-min comparability across regular-season and playoff OT.
High-TOI vs Low-TOI Call Ratiospec §7.1
(high_toi_calls / low_toi_calls) × 0.455. Bucket each penalty by the penalized player's TOI tier classification: STAR + TOP_6 → high_toi_calls; DEPTH + FOURTH_LINE → low_toi_calls. The 0.455 factor is the approximate ratio of average ice time between the two tiers (~10 min/game vs ~22 min/game), so the output is normalized to 1.0 = proportional to ice time. <1.0 means the ref calls fewer penalties on top-tier forwards than expected (star leniency); >1.0 means the inverse. NULL when either bucket has zero calls.Known limitation: the 0.455 TOI-exposure constant is a fixed approximation. True per-tier ice time varies game to game; a future tightening will use actual TOI distributions from player_toi_tier_classifications. MIDDLE_6 tier is excluded from both numerator and denominator to keep the contrast clean.
Score-State Calling (close vs blowout)spec §4.6
tied = score_margin = 0, one-goal = |margin| = 1, blowout = |margin| ≥ 3. Counts come from penalty_events.score_margin(set during ingestion); rates are calls / games_officiated. Same bar pattern as Penalty Density by Period.Known limitation: these are calls per game, not per time-in-state. A ref's tied-score rate is influenced by how often their games are tied at all. Once scoring-play timestamps land we'll switch to true rate- per-minute-in-state.
One-Goal Game Suppressionspec §1.7
1 − (ref_share / league_share) where ref_share = ref's P3 calls with |score_margin| ≤ 1 / ref's P3 callsand the league share is the same ratio computed across every ref in scope (sum-of-sums, so refs with more games weight correctly). Positive = ref calls fewer P3 close-game penalties than the league norm. NULL until both shares populate.Known limitation:The spec target is a true rate-per-time comparison (P3 calls per 20 min in one-goal games vs overall P3 rate). That requires per-game time-in-state tracking we don't have yet. The share-based proxy is meaningful relative to other refs but doesn't reflect raw exposure ratios. Tightens once scoring-play timestamps land and league_baselines stores time-weighted shares.
Crunch-Time Discipline (player)spec §7.4
period ≥ 3 (3rd period or overtime), period_time_remaining_seconds ≤ 300 (last 5 minutes of the period), and |score_margin| ≤ 1(one-goal game). Drawn penalties don't count — discipline is about avoiding the box, not earning calls. Stored in player_officiating_stats.crunch_time_disciplineas a per-game rate (numeric(5,3)). Higher = worse discipline when the result's on the line.Known limitation:Per-game (not per-60) because we don't track ice time spent incrunch-time situations. A player's rate is partially driven by how often his team plays close games at all. Once time-in-state ingestion lands the rate becomes per-60 with crunch-time TOI as denominator.
Penalty Differential vs Teamspec §3.1
penalties_for_team − penalties_against_team, where "for" means penalties called on the team's opponent (giving them a power play) and "against" means penalties called on the team itself. Positive differential means the team gets more PPs than they give up under this ref. Per-game differential normalizes by games worked together.PP Opportunities Differentialspec §3.2
pp_opportunities_for_team − pp_opportunities_against_team. Strict subset of Penalty Differential: counts only penalties that actually created a power play (a penalty on the team's opponent for "for", on the team itself for "against"), filtered by the same heuristic as Power Play Chain Rate (severity in MINOR/DOUBLE_MINOR/MAJOR with no offsetting partner). The PP scoreboard view of the same relationship: a +6 PP Diff means the team got 6 more power plays than they gave up under this ref.Often close to Penalty Differential since most penalties create PPs. The difference shows up most when there's a fight or coincidental call cluster, where penalties offset and don't move the PP scoreboard.
Most Favorable / Unfavorable Refsspec §3.4 / §3.5
Framing per spec: "Largest positive / negative differential." Never "biased against," "targets," or "favors." The stat is a pattern; we do not attribute intent.
Home / Road Differential (team)spec §3.6
(penalties_against_road / road_games) − (penalties_against_home / home_games)for each team. Positive = team takes more penalties on the road than at home; negative = team takes more at home. Computed in the team-season-officiating refresh from the same per-game stats that populate penalties_against, split by is_home.Surfaced as the “Home / Road” KPI on the team profile. NULL when the team hasn't played at least one home and one road game in scope.
Team Discipline Indexspec §3.8
(misconducts + bench_minors + failed_coach_challenges) / games. Per-game composite of self-inflicted disciplinary penalties on this team. Misconducts come from penalty_severity IN (MISCONDUCT, GAME_MISCONDUCT). Bench minors are identified by canonical TOO_MANY_MEN OR descKey-based bench match (BENCH, *_BENCH variants); legacy rows fall back to a player-null heuristic on minor severity.Known limitation: the failed-coach-challenges term is currently stubbed at 0 until the coach_challenge_events ingestion lands. The other two terms are live.
Team Call-Type Profilespec §3.7
taken_count (penalties on this team of this type), drawn_count (penalties on the opponent of this type while this team played), and per-game rates taken_per_game = taken_count / games, drawn_per_game = drawn_count / games. League baselines are the AVG of the per-team rates across all teams in scope (per canonical_code). The team profile shows the top types taken and top types drawn, with a vs-league delta column.Sample sufficiency requires the team to have played at least 5 games AND total count (taken + drawn) of that type ≥ 3. Below threshold the row is still rendered but the league baseline isn't emphasized.
Sample minimums
Stats below the sample threshold are still shown but tagged with sample_sufficient = false. We exclude them from leaderboards and from league percentile distributions so a ref with a handful of games doesn't skew the league baseline.
- Per-ref season stats: 10 games.
- Per-(ref, team) pairings: 5 games.
- Per-call-type rates: 3 calls AND ref ≥ 10 games.
- Pair (ref + ref) stats: navigation only; no primary metrics.
Framing rules
Every stat description on the site follows three rules:
- Describe a pattern, never imply intent. "Largest negative differential" instead of "biased against."
- Tie controversial labels to a formula. "Phantom Call" only with reviewer confirmation.
- Make small samples explicit. Numbers below threshold ship with a warning, not silently.
AI writer pipeline
Ref Geek's articles — pre-game scouting, post-game recap, weekly column, playoff series preview / review — are produced by a deterministic pipeline that drafts from grounded data, validates the output, and routes for human review at the right altitude. The pipeline below is the same shape across every content type.
- Grounded.Every article is drafted from a structured Data Pack pulled from Ref Geek's aggregate tables — ref season stats, team season officiating, player TOI tier classifications, finalized penalty events. No free-form opinion, no facts not in the pack.
- Validated.Every draft runs through five checks before a publishing decision is made: banned phrases (no “targets,” “biased,” “wants to,” “obviously”), word-count band per content type, tone (no exclamation points, no emoji, low all-caps tolerance), hallucination (every cited number must trace to a Data Pack value or a reasonable rounded variant), and name consistency (every proper noun must appear in the Data Pack's name roster). Validation failures route to human review regardless of importance score.
- Routed by stakes.Pre-game and post-game articles in the AUTO band (importance 0–30) auto-publish. STAGED band (31–60) auto-publishes with a “not yet reviewed” banner that the future review queue can clear retroactively. MANDATORY band (61+), all weekly columns, and all playoff-series pieces require Senior Reviewer signoff before publication.
- Auditable. The public accuracy page tracks every article: count published per content type, reviewer signoff rate, articles in queue, articles rejected, and any logged corrections. Reviewer-edits and corrections are stored on the article row so the audit trail is queryable.
- Voice.Pattern language only. The article generator never asserts referee intent, never uses judgment language, never characterizes a referee as “biased” or “favoring” a side. The system prompt is verbatim identical across every content type so the same voice rules apply uniformly.
- Honest about missing data.When a Data Pack field is null — for example, NHL play-by-play didn't attribute a specific call to a specific referee that week — the prompt is told to skip that section rather than fabricate a plausible-looking number.
Corrections policy. Articles that ship with a factual error get logged in article_corrections with type (minor / significant / retraction), description, and reviewer attribution. Recent corrections appear on /accuracy.