These graphs are live. This piece has no bundled recording, so the panels below stay empty until audio is flowing. Hit Follow (microphone) and play — chroma, FFT, spectrogram, onsets and the alignment map all come alive from your mic in real time. Or open a piece with a demo recording to watch the AI follow a performance.
Live signal · chroma 12-bin
RMS 0.00 peak 0.00 conf —
The 12 pitch-class energies — how much of each note name (C, C#, D…) is sounding right now, octave-folded. This is the feature the offline DTW and the live OLTW aligner actually match the score against: a bar's chord profile, not raw pitches.
Audio waveform · live
The raw amplitude over the last fraction of a second — the sound pressure itself. Loud attacks spike; sustained notes settle. It's the time-domain view; the chroma and FFT below are what the aligner derives from it.
FFT · log-frequency spectrum
The frequency content of this instant, on a log axis (so each octave is the same width — the way pitch is heard). Peaks are partials of the notes being played; collapsing the octaves of these peaks gives the chroma vector up top.
Spectrogram · scrolling time × log-freq
The FFT above frozen across time. Each new audio frame paints one pixel-wide column on the right edge; older columns scroll left. X = time (most recent on right). Y = log-frequency (low at bottom). Brightness = energy in dB. This is what the FFT panel looks like when you watch its history.
Onset detection · novelty
Spectral novelty — how much the spectrum just changed. Peaks mark note onsets (new attacks). The aligner weights these moments: a fresh attack is a strong anchor for "we just moved to the next note/bar".
Confidence · model trust
How well the live audio currently matches where the model thinks you are in the score, 0–1. Green > 0.75 = locked. Amber 0.45–0.75 = uncertain. Red < 0.45 = lost — that's when to tap a bar or hit the anchor button to re-sync it.
Notes heard · AI vs score
Hit Play — each detected onset that passes the harmonic gate lands here, with the score's expected pitches.
Each cell is one note the aligner heard (the harmonic gate already filtered noise / talking / page rustle out). Top line = pitch-class(es) played, bottom line = the score's expected pitch-classes at the matched beat, green bar = match. The matcher backlog (Puckette / Dannenberg, Theory §7a) will fuse this stream into the DTW cost; for now the strip is the eyeball into what the gate is letting through.
Alignment map · audio → score
x = seconds of audio · y = position in the score. The curve is the model's mapping of this performance ; a wavy path over a heatmap = a real alignment, a straight diagonal = a linear estimate. Hit Play (demo mode) to see it move.
The "neural map": the heatmap is the chroma match cost between every moment of the recording (x) and every bar of the score (y) — the dark valley is the path of best match. The gold line is the model's chosen alignment through it; the dot is where playback sits. With no precise alignment for a piece you get a plain diagonal instead.
Live cost matrix · scrolling DTW landscape
x = score position (0 to end), y = audio time (recent at top). Dark = good match, bright = poor match. Yellow trail = the OLTW warping path. Live as you play.
This is the OLTW cost matrix CONSTRUCTED LIVE as you play, not a precomputed offline render. Each tick (~2s) the aligner sends a slice of cost cells + the warping path it chose; we accumulate them top-to-bottom. A clean yellow ridge tracking the dark valley = the aligner is locked. A wandering yellow = it is lost.
Alignment map · audio → score
x = seconds of audio · y = position in the score. The curve is the model's mapping of this performance ; the dot is where playback is on it. A diagonal line = a linear estimate (no precise alignment for this piece).
This is the same "neural map" mirrored in the Signals tab: the heatmap is the chroma match cost between every moment of the recording (x) and every bar of the score (y) — the dark valley is the path of best match — and the gold line is the alignment the model chose through it. With no precise trace for a piece you get a plain diagonal: a constant-rate estimate.
AI tracking · measure over time
m. — page — lag —
Which bar the model thinks is sounding, plotted over time as playback runs — a staircase climbing toward the end of the piece. m. = current measure, page = which page that's on, lag = ms since the last alignment tick (how fresh the estimate is).
state —
feat —
qps —
dtw —
gold Δ —
hes× —
notes —
harm —
The decomposed confidence sources — why conf took its value this chunk. state is the OLTW regime (warmup until the first chunk lands, following in steady-state, creep when the harmonic gate freezes the cursor, reacquire on a wide re-search, lost when a chunk fails). feat is the chroma representation the DTW runs against. qps is the live tempo EMA. dtw is the mean per-step cost in [0..0.52] — anything above 0.52 trips the lost flag. gold Δ is how far the DTW estimate sits from the gold-trace prior when one is loaded. hes× is the Phase D per-region slack multiplier from your previous takes (>1 = the model widened the search band here because you historically diverged in timing). notes reserved for the Phase C sequence matcher. harm is the harmonic-gate peak/mean — under 1.85 = treat the onset as non-musical.
Tempo curve · across the performance
Beat-by-beat tempo (score-quarters/sec ×60) read off the precomputed alignment trace, smoothed. When a piece carries several recordings they overlay (P1, P2…).
Rubato & rhythm · expressive timing
REFRESH
Pick a piece with an alignment trace, or hit REFRESH, to read its rubato — the tempo elasticity, the ritardandi / accelerandi, and a synthesis of the performance's timing.
Rubato = expressive deviation of played timing from steady/notated timing. We read it as the local slope of the alignment trace (score-quarters per audio-second): >1 = rushing (accel / stringendo), <1 = stretching (ritard / agogic hold). The headline rubato number is std/mean of that local tempo. Because the read comes off the aligner, a mis-track (wrong-bar then snap-back) shows a fake "rush" — so a gesture under low alignment confidence, or a musically implausible swing, is flagged uncertain , and a mostly-unsure trace says so instead of over-claiming. Note resolution: /api/rhythm sharpens this to per-onset timing deviation when heard onsets are available.
The same idea as Fig. in Shengchen Li's QMUL PhD thesis (Dixon group): plot instantaneous tempo against beat index for a whole performance. Two performances of the same piece are not free of each other — the curves share shape (where one slows, the other tends to). For us each curve comes from a trace's (audio second → score quarter) ticks: local tempo = Δquarter ÷ Δaudio ×60 BPM, then a 5-tick moving average. A flat-ish line = a near-metronomic performance (or a linear pseudo-trace); big dips = ritardandi / fermatas.
Drift log · copy & send for analysis
COPY
CLEAR
chroma — 12 pitch-class energies. conf — alignment trust 0–1 (green >0.75 locked · amber 0.45–0.75 uncertain · red <0.45 lost). lag — ms since the last tick. speed — Δscore ÷ Δaudio: 1.00x = on tempo, <1 = the AI is lagging, >1 = racing ahead. Every anchor you set is logged too. COPY and paste it back and I can pinpoint where it drifts and why.
Source: —
Trace: —
FPS: —
Trace metrics · this performance's mapping
REFRESH
Pick a piece with an alignment trace, or hit REFRESH.
What these say about the sound → visual map. Each precomputed trace is a list of (audio second → score position) ticks — the table reads it back.
covers audio — how much of the recording the trace spans. <100% means the cursor falls back to a linear estimate past that point.
monotone — the trace should never move the cursor backwards in the score. 100% = clean; below that, the cursor jumps back somewhere (a bug, or a real repeat the chroma latched onto).
tempo — the overall score-quarters-per-second of this performance (and the equivalent BPM if the beat is a quarter note). rubato = how much the local tempo swings around that average (0 = a metronome, 0.3+ = heavy give-and-take).
vs metronome — the headline number. How far the alignment pulls away from a straight line, in seconds. Near 0 → the trace is basically linear and adds nothing over a clock. A few seconds → it is genuinely following the performer's timing — that's the gap a fixed page-turn timer would get wrong.
method / feature — offline-dtw / chroma_cqt = the chroma warp; offline-notes-dtw / basic-pitch-pianoroll-88 = the note-event front-end (sharper through trills + polyphony).
note-onset F1 — MIREX-style alignment accuracy at ±50/100/200 ms tolerance, scored against an aligned reference. Only shown when one pairs this recording (ASAP .match); our CC0 recordings have no such reference, so it reads “no reference” — everything above is reference-free.
Score bootleg · notehead matrix
x = score event order · y = grand-staff notehead position. This is the sheet-side bootleg layer used when full OMR is unreliable.
Bootleg score features intentionally throw away durations, key signatures, and most notation detail. They keep only notehead positions through time, which makes audio/PDF matching more robust on raw scans, repeats, and imperfect OMR.
Training bootleg · cues and anchors
Gold = listening cues. Blue = re-anchors. Red = trained page turns. These are the user-trained facts the live follower uses to reject bad jumps.
The tracker stack loads saved recording-training sessions for this piece. During Follow, low-confidence OLTW ticks are checked against these trained cues/reanchors; trained page-turn points can override generic page estimation when confidence is high enough.
Bootleg alignment path · audio → score
Current trace path rendered in bootleg coordinates. A clean monotone curve is safe; flat runs, spikes, or jumps flag reliability risk.
This mirrors the ISMIR bootleg idea in the app: sheet image/MIDI/audio can all be projected into sparse notehead matrices, then aligned with DTW. The full server-side image/audio bootleg extractor is in mvp/bootleg.py; this panel shows the active browser-side layers now.
Heat · DTW cost matrix & warping path
The DTW cost landscape between the audio chroma (rows = performance frames) and the score chroma (cols = score frames). Dark = low cost (good match), bright = high cost. The yellow path is the alignment the aligner chose. A clean ridge = the aligner found it. A broken ridge or a bright cloud = where it got confused.
piece
loading…
resolution
160 (fast)
220 (balanced)
320 (detail)
Render
score → —
—
First principles — the building blocks
Before the follower itself: the handful of ideas every later section leans on — from what sound even is to a computer up to what it means to align two performances . If anything below ever reads as jargon, the plain-language version is almost certainly one of these seven.
1 · Sound becomes a list of numbers
Sound is a pressure wave — air pushed and pulled. A microphone turns that pressure into a fluctuating voltage, and the computer measures that voltage at a fixed rate. Each measurement is a sample ; the rate is the sample rate . CD audio is 44,100 samples per second (44.1 kHz). Sheet Turner resamples everything to 16 kHz — by the Nyquist limit that captures frequencies up to 8 kHz, and everything carrying musical pitch sits well below that. So inside the app, "the audio" is just a long array of numbers between −1 and +1.
2 · Frequency, pitch and amplitude
A pure tone is the signal repeating at a steady rate. That rate is its frequency , in hertz (Hz, cycles per second), and the ear hears it as pitch — faster is higher. The orchestral tuning A is 440 Hz. How far the signal swings each cycle is its amplitude , heard as loudness (measured in decibels, a logarithmic scale, because hearing is logarithmic). Doubling a frequency raises the pitch by exactly one octave : 220, 440 and 880 Hz are all the note A — the fact idea 5 is built on.
3 · The Fourier transform and the spectrogram
A real musical sound is many frequencies at once. The Fourier transform is the mathematical tool that decomposes any signal into the pure frequencies that sum to it — the result is its spectrum (how much energy at each frequency). Music changes over time, so we don't transform the whole recording at once: the short-time Fourier transform (STFT) slides a short window — a few tens of milliseconds — along the signal and transforms each one. Stack those spectra side by side and you get a spectrogram : time across, frequency up, brightness for energy. The spectrogram is the raw material every later step reads.
4 · Harmonics and timbre
Pluck a string tuned to 220 Hz and it does not only vibrate at 220 — it also rings at 440, 660, 880, 1100…, whole-number multiples called the harmonic series . The lowest, the fundamental , fixes the pitch you name. The relative loudness of the upper harmonics (overtones) is the timbre — why a piano and a clarinet on the same note are instantly told apart. Timbre is also the follower's adversary: it must recognise pitch through wildly different timbres — your piano, the demo's piano, a phone mic's colouration.
5 · Pitch classes and the chroma vector
Western music names twelve pitch classes — C, C♯, D…B — and (idea 2) any two notes an octave apart are the same class. The chroma vector exploits this: it folds the whole spectrum onto those twelve bins, summing every octave's C-energy into one number, every C♯ into the next, and so on. The result is a compact 12-number fingerprint of "what harmony is sounding now" that barely depends on octave or instrument. That invariance is exactly what lets the follower compare your performance against a reference — see idea 7. Chroma is the workhorse feature of the live tracker.
6 · Onsets
An onset is the instant a note begins — the sharp jump in energy when a key is struck. Comparing each spectrogram frame to the one before and summing the increases gives the spectral flux , a curve that spikes at every attack. Picking those peaks gives the onset times; the spacing between them is a rough estimate of tempo . Onsets give the follower a clock to keep advancing through long held chords, where the chroma barely changes.
7 · Alignment, dynamic time warping and score following
Two musicians never play a piece identically — different tempi, rubato, pauses, mistakes. Alignment is the task of pairing each moment of one rendition with the matching moment of another (or of the score). The classic algorithm is dynamic time warping (DTW): lay the two sequences on the axes of a grid, score every cell by how dissimilar the two frames are, and find the cheapest forward-only path corner to corner — that path is the alignment, stretching and squeezing time as needed. Score following is alignment done live : the performance arrives one frame at a time and the path must extend immediately, with no sight of the future. That constraint, and the algorithm that meets it (online time warping), is what the rest of this tab is about.
The RUMAA pipeline · where this is heading
MusicXML
notes·dynamics·repeats
interleaved ABC
bar-patches · ~64 ch/bar
Score encoder
M3 (CLaMP2)
❄
audio
16 kHz mono
STFT → ResNet
pre-encoder
Audio encoder
12 fps
❄
Decoder ×6
self-attn
→ audio X-attn
→ score X-attn
→ gated FFN
3-ch LM head
T1 score-aligned performance transcription (what was played, time-stamped)
T2 performance-aligned score conversion (score notes, re-timed)
T3 edit-operation tags · Match / Insert / Delete / Repeat (handles repeat signs)
Our schematic of the architecture in Chang, Dixon & Benetos — RUMAA , WASPAA 2025 (Fig. 2; also arXiv:2507.12175 ). ❄ = frozen pretrained encoder. RUMAA is the offline trace generator we'd adopt once the authors release weights — see section 6. The live follower today is the chroma / OLTW pipeline below.
How the follower works
1 · From sound to features
The mic (or the demo recording) is a stream of samples. We slide a short window over it (a few tens of ms) and take its short-time Fourier transform — energy per frequency. Folding every octave onto one ring of 12 semitones gives the chroma vector : how much C, C♯, D… is sounding right now, independent of which octave or which instrument. Chroma is the workhorse because the same chord played by different pianos at different registers lands on roughly the same 12-bin shape — exactly what a score-follower needs.
2 · Onsets & tempo
Between two consecutive STFT frames, the rise in energy (spectral flux / novelty) spikes when a note is struck. Peaks in that novelty curve are onset candidates; their spacing is a noisy estimate of the local tempo. The follower uses this to keep moving forward even through a held chord where the chroma barely changes.
3 · Score following = online time warping
We pre-compute the score as a reference sequence (its chroma over score time ). The live audio is a query sequence that arrives one frame at a time. On-line Time Warping (Dixon, 2005) keeps a running cost matrix and, each frame, extends the cheapest monotonic path through it — never going backwards, sometimes pausing, sometimes skipping ahead — so the latest audio frame is matched to the score position whose chroma fits best given everything heard so far. The matched score time → bar → page is what turns the page.
4 · Confidence
If the local cost surface has one sharp minimum, the follower is sure (high conf ). If it's flat — lots of score positions fit roughly equally — conf drops and the cursor is essentially guessing. That's the number the tracking pill and the drift log report.
5 · The hard problems — polyphony, pedal, voicing, wrong notes
A pianist's recording is not the clean MIDI a follower wishes it were. The things that actually make this hard, and what each one does to the aligner:
Polyphony. Four voices of a fugue sounding at once is not a melody — it's a thick stack of pitches. A 12-bin chroma frame just sums them, so two harmonically-similar passages (a subject and its answer a fifth apart, a sequence repeated up a step) collapse onto nearly the same shape → the cost surface goes flat and the cursor can latch onto the wrong one. Fix: warp note events (an 88-key pianoroll from a transcriber), not 12 folded bins — that keeps register and voice-count, so the answer no longer looks like the subject.
The sustain pedal. Held pedal lets notes ring into each other — the chroma stops being "what's struck now" and becomes "everything still vibrating", and the onset/spectral-flux spikes that mark note attacks get buried under the wash. The cursor slows or stalls right where the music is most legato. Fix: a note-onset front-end (which already separates onset from sustain ) keys on the attacks; longer-term, model the pedal explicitly (it's in the MusicXML — RUMAA's score encoder reads pedal markings).
Voicing. Which note should the cursor sit on when a whole chord lands at once — the top voice the listener tracks, the bass, all of them? And in a fugue, the "current bar" is well-defined but the "current note" isn't. We map to bar / score-quarter (not a single note id) precisely so voicing ambiguity doesn't make the cursor flicker; the bouncing dot lands on the bar's notes as a group.
Wrong / extra / missed notes. A student playing along is not playing the score — a fluffed note, a dropped one, an added grace. A pure DTW will happily warp the score onto the mistakes (treating an inserted note as "the music sped up") and drift. Fix: an alignment that allows edit operations — Match / Insert / Delete — instead of forcing every audio frame onto some score position; that's exactly RUMAA's T3 stream (and its mistake-detection output flags the missing/extra notes). For now the human safety net is tap-to-anchor: hit the bar you're really on and the tracker re-bases.
Repeats & skips. A repeat sign means the same audio appears twice; a D.C. or a cut in performance means the score order isn't the play order. A naive warp must go forward only, so it either smears the repeat or jumps. Fix: handle the repeat structure in the score model — RUMAA emits a Repeat edit op rather than needing a manually unfolded score; until then we keep a "lost → re-acquire" heuristic in the live aligner and, again, tap-to-anchor.
Tempo extremes & rubato. A big ritardando or a Lisztian rush stretches the warping path faster than its bounded search window can follow; a held fermata over near-constant chroma gives it nothing to move on. Fix: a tempo prior (the aligner tracks a running beats-per-second and biases the path toward it — Arzt & Widmer's adaptive normalisation), plus the onset spacing as a backup clock through held chords.
Mic & room. Live mic adds room reverb, a noisy room floor, and (on a phone) a low-quality capture — all of which inflate the audio-side noise the chroma/onset features see. Fix: the confidence read-out is honest about it (low conf = "I'm guessing"), and the gating only auto-turns the page when confidence is above threshold.
The common thread: chroma is harmony, not notes , and a plain DTW must warp every frame onto something. So the fix runs in two moves —
The fix — recognising the actual notes, not an approximate position — is a note-level transcription front-end ahead of the aligner: Spotify's basic-pitch (Bittner et al., ICASSP 2022) turns the audio into note onsets, and the warp runs on that note-event sequence against the score's notes instead of 12-bin chroma. That's the "AMT + DTW" recipe that scores ~96–99% alignment F1 in the literature (RUMAA Table 2); basic-pitch is being wired into the offline trace pipeline and the Metrics tab will report which traces use it. Meanwhile the copyable drift log is the fastest feedback — it tells us exactly which bars still drift and why. (Two more rungs after that: a learned alignment cost à la Agrawal & Dixon, then RUMAA-class repeat-aware end-to-end alignment once weights are public — see Sources.)
6 · The end-state — RUMAA
The furthest rung: RUMAA — Chang, Dixon & Benetos, Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection (WASPAA 2025 · arXiv:2507.12175 ). Instead of warping features it's a transformer that reads the MusicXML score and the audio together and writes the alignment out as text — the pipeline is the diagram at the top of this tab.
The three output streams are proxy tasks that force the model to stay internally consistent — T1 = "what was actually played, time-stamped", T2 = "the score notes, re-timed to the performance", T3 = the explicit edit script lining them up. T3 is the bit that handles repeats : the model emits a Repeat op rather than needing a pre-unfolded score. Result: ~98% note-onset F1 (±50 ms) even on scores with repeat signs , where the DTW / HMM baselines (including the AMT+DTW recipe in step 5) collapse to 12–36% there.
Why it isn't the live tracker today: the paper is explicit — RUMAA is offline ("the model struggles with long audio over one minute … online use unexplored"), it's trained on (n)ASAP (~2 GPU-days), and there are no public weights yet. So the plan is to wire it in as the offline trace generator — the thing that builds the precomputed traces this app follows — the day the authors release it. Until then, step 5's basic-pitch note-event DTW is the best offline path we can run, and our own OLTW (Dixon 2005) is the live one. (Figure not reproduced — see the paper.)
7 (bis) · The three paradigms in one picture
Score-following paradigms across four decades
Three lanes showing the sequence-matching, HMM and signal/DTW traditions
with their landmark papers from 1984 to 2025; the live tracker today
runs on the signal lane (Dixon 2005 OLTW), with the sequence and HMM
layers queued as future fusions.
Sequence matching
note events vs expected notes
Vercoe / Dannenberg / Puckette
HMM · probabilistic
state model + Viterbi
Cano-Loscos-Bonada / Cont
Signal · chroma DTW
what we run today
Dixon OLTW → Matchmaker / RUMAA
V '84
D-M '88
P '90
P-L '92
P '95
C-L-B '99
O-D '01
Cont '07
Dixon '05 · OLTW (ours)
MATCH '05
Macrae-Dixon '10
Matchmaker '25
RUMAA '25
1984
1990
1995
2000
2005
2010
2025
Three parallel traditions of real-time score-following, 1984–2025. The live tracker in this app runs on the signal/DTW lane (Dixon 2005); the upstream sequence-matching and HMM layers above are the queued fusion paths (see § 7a below and the Sources tab for full bibliographic references and links).
7a · Sequence theory — the note-event paradigm
The OLTW path above is the signal paradigm: every audio frame is a 12-bin chroma, every score frame is a 12-bin chroma, and a continuous warping path forces a match. The other paradigm is sequence matching : detect discrete note events from the audio (pitch + onset), then match that sequence against the score's note sequence as a string-matching problem. The history is older than DTW for music:
Vercoe (1984) — The Synthetic Performer in the Context of Live Performance (ICMC 1984). The first published score-follower. Pitch-tracked monophonic input matched against the expected note list with simple recovery heuristics.
Dannenberg & Mukaino (1988) — New Techniques for Enhanced Quality of Computer Accompaniment (ICMC 1988). Adds expectation-based matching: rather than «just match the next note», score the alignment over a small window so wrong / extra / missed notes degrade gracefully.
Puckette (1990) — EXPLODE: A User Interface for Sequencing and Score Following (ICMC 1990). Simplifies it to first-exact-match against the list of upcoming expected events — the cheap baseline every later paper compares to.
Puckette & Lippe (1992) — Score Following in Practice (ICMC 1992). Formalises the two-level approach still used in IRCAM-style live electronics: try a fast exact-match first, fall back to a slower «skip-list» search that tolerates omitted or substituted notes.
Puckette (1995) — Score Following Using the Sung Voice (ICMC 1995). Pairs two pitch trackers in parallel — one fast and imprecise, one slower but more reliable — so the matcher has both quick guesses and well-vetted commits to choose between, instead of trusting a single noisy estimate.
Cano, Loscos & Bonada (1999) — Score-Performance Matching using HMMs (ICMC 1999). Bridges the two paradigms with a left-to-right HMM whose emissions are a vector of low-level audio features (energy, zero-crossing rate, fundamental frequency, plus their first derivatives); self-transitions model note duration, and Viterbi decoding produces the most likely score-position path. The probabilistic ancestor of Cont's ANTESCOFO (2007) line cited below.
Shared weakness, useful warning. Both the Puckette-style sequence matcher and the Cano/Loscos/Bonada HMM ride directly on the robustness of pitch / fundamental-frequency detection — a smeared chord, a pedal wash, or a fast trill that confuses the front-end propagates straight into the matcher's decisions. Our chroma-DTW path is less brittle there (it averages over a 12-bin folded vector instead of committing to single-pitch labels) but pays for it in resolution: chroma cannot tell a C from a C an octave up, so it can't tell a fugue subject from its answer. The harmonic gate we added in _detect_notes() (Score Wizard pattern: peak/mean ≥ 1.85 on chroma columns) is the cheap counterpart of Puckette 1995's «reliable» pitch tracker — it filters non-musical onsets out of the note-event stream before any matcher gets to see them.
The three paradigms are complementary , not competing. The signal path (chroma DTW) is robust to messy onsets but doesn't «know what note was played» — it never feeds back «you played a C, the score expects an E». The sequence path knows exactly that — the matched note pair, the substitution, the omission — but breaks when onset detection is noisy (the chord is smeared, the pedal swallows the attack, the trill is too fast). Layered together: OLTW carries the score position, the note-event matcher mirrors detected pitches against the score's expected pitches at that position, and the two streams correct each other.
What we run today already produces the inputs both layers need: mvp/align_online.py _detect_notes() emits per-chunk detected note events (pitch class + match flag against the score's expected pitches at the current beat), and the harmonic gate (Score Wizard pattern) filters non-musical onsets out of that stream. The piece is the Puckette-style matcher itself: take the recent detected-note window, slide it against the expected-note window from the score's pre-engraved timemap (cached in /static/precomputed/<id>.json), and emit a sequence-alignment confidence the OLTW can fuse with its chroma cost. Backlog item, with the sources above; not shipped yet — it needs an A/B against a hard piece (BWV 848 fugue) before going live.
7 · Bootleg score layer for raw PDFs
The TISMIR 2021 piano score-following-video paper is now one of our source-of-truth designs for scanned scores. Its core idea: convert both the audio and the sheet image into the same sparse bootleg score representation, then align those matrices. Audio becomes note events first (audio → AMT/MIDI → notehead positions). PDF becomes image features directly (PDF page → staff lines + filled noteheads → notehead positions). This is exactly why the new Bootleg Lab exists.
MIDI boundary: the app still has no user-facing MIDI surface: no MIDI playback requirement, no hardware MIDI dependency, no MIDI feature for the performer to manage. Internally, note events are allowed and necessary. Audio → AMT → note-event matrices is the bridge into bootleg alignment; it is a private recognition layer, not the product surface.
The important architectural consequence: OMR is not the only path . MusicXML/MEI is still the best structured layer when available, but raw scans need a fallback that survives bad recognition, filler pages, repeats, and unknown jumps. The bootleg layer discards durations/accidentals/key signatures on purpose and keeps the geometry that is stable across audio and image: notehead positions in time.
8 · Hierarchical DTW for repeats and unknown jumps
The same paper aligns every sheet line against the whole audio with subsequence DTW, then runs a second segment-level DP over line candidates. That lets repeats and skips happen at line breaks without permitting arbitrary chaos. Our first implementation is now in mvp/bootleg.py: note-event bootlegs, image bootlegs, subsequence line matching, and segment-path selection with penalties for repeats/skips. The next step is feeding real PDF page images and user recordings into that pipeline and showing the resulting line path in this Lab tab.
Timed mode is not recognition. It is an explicit opt-in fallback for rehearsal cases where the user knowingly wants a clock estimate. It stays off by default because page turns based only on elapsed time are fragile under rubato, repeats, pauses, and mistakes.
Reading the Signals panels
chroma 12-bin — the live pitch-class energies, brighter = louder.
waveform — raw amplitude over the last instant.
FFT log-frequency — the spectrum on a log axis (one octave = one unit), so harmonics line up.
onset novelty — spectral flux; spikes = note attacks.
confidence — the aligner's running trust, 0–1.
tracking · measure over time (Tracking tab) — the matched bar plotted against audio time; a clean monotone line = locked, a flat run = stuck, a jump = it re-snapped.
The science the follower stands on — what it runs today, and the work it's moving toward. ▸ = a method/library actually in the running code.
Score following · alignment methods
▸ Dixon (2005) — Live Tracking of Musical Performances Using On-Line Time Warping (DAFx 2005). PDF . The OLTW algorithm our live aligner runs (mvp/align_online.py).
Cont (2007) — Realtime Audio to Score Alignment with a Coupled Probabilistic Model (ISMIR 2007). PDF . The probabilistic/state-space camp behind ANTESCOFO-style score following; roadmap layer, not a replacement for OLTW.
Vercoe (1984) — The Synthetic Performer in the Context of Live Performance (ICMC 1984). archive . First published score-follower: pitch-tracked monophonic input matched against the expected note list. Foundational; the sequence-matching ancestor of every later note-event approach.
Dannenberg & Mukaino (1988) — New Techniques for Enhanced Quality of Computer Accompaniment (ICMC 1988). PDF . Expectation-based note-event matching; scores the alignment over a small window so dropped / added / substituted notes degrade gracefully.
Puckette (1990) — EXPLODE: A User Interface for Sequencing and Score Following (ICMC 1990). IRCAM archive . First-exact-match baseline of the sequence-matching paradigm; every later paper compares to this.
Puckette & Lippe (1992) — Score Following in Practice (ICMC 1992). Semantic Scholar . Formalises the two-level approach (fast exact match first, slower skip-list recovery on miss) still used in IRCAM-style live electronics.
Puckette (1995) — Score Following Using the Sung Voice (ICMC 1995). archive . Dual pitch tracker (one fast/imprecise + one slower/reliable) so the matcher can choose between quick guesses and well-vetted commits — the conceptual ancestor of our harmonic gate.
Cano, Loscos & Bonada (1999) — Score-Performance Matching using HMMs (ICMC 1999). Left-to-right HMM with multi-feature emissions (energy, zero-crossing rate, fundamental frequency + derivatives); self-transitions model note length, Viterbi decoding picks the score-position path. The probabilistic ancestor of the ANTESCOFO line.
Orio & Déchelle (2001) — Score Following Using Spectral Analysis and Hidden Markov Models (ICMC 2001). HAL PDF . The HMM-meets-spectral-features synthesis that surveys and combines the three earlier tracks (Vercoe / Dannenberg / Puckette / Cano-Loscos-Bonada). Cited as the source of the multi-feature emission and pitch-tracker robustness framing in the new Theory section.
Cont — Anticipatory Score-Following . HAL PDF . Anticipation/prediction layer for concert-grade latency once the base tracker is reliable.
Dixon & Widmer (2005) — MATCH: A Music Alignment Tool Chest (ISMIR 2005). PDF · code . Offline DTW with the forward-path constraint (the shape of our precomputed traces).
Macrae & Dixon (2010) — Accurate Real-time Windowed Time Warping (ISMIR 2010). PDF . Windowed OLTW — the bounded tempo-window trick our online aligner uses.
Arzt & Widmer (2008) — Automatic Page Turning for Musicians via Real-Time Machine Listening (ECAI 2008); Arzt, Widmer & Dixon (2012) — Adaptive Distance Normalization for Real-Time Music Tracking (EUSIPCO 2012). PDF . The page-turn application + the onset-feature normalization on our roadmap.
Agrawal & Dixon (2020/2021) — Learning Frame Similarity for Audio-to-Score Alignment (EUSIPCO 2020) · A Convolutional-Attentional Framework for Structure-Aware Performance-Score Synchronization (ICASSP 2021). EUSIPCO PDF · arXiv:2101.03937 . Learned alignment cost functions (a drop-in replacement for the chroma distance).
Park, Cancino-Chacón, Chiruthapudi & Nam (2025) — Matchmaker: An Open-Source Library for Real-Time Piano Score Following and Systematic Evaluation (ISMIR 2025). arXiv:2510.10087 · code . Reference OLTW-Dixon / OLTW-Arzt + the AR / AE / AAE / MAE metric suite the Metrics tab follows.
Chang, Dixon & Benetos (2025) — RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection (WASPAA 2025). WASPAA PDF · arXiv:2507.12175 . A transformer that aligns human-readable MusicXML with repeats straight to full-length audio, transcribes it, and flags wrong / missing / extra notes — 98.4 alignment F1 even with repeats, where DTW/HMM baselines collapse to 12–36. Offline (≤1-min chunks), no public weights yet; this is where the precomputed-trace generator goes when they release.
Henkel & Widmer — Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction (a copy is ingested at docs/Real-Time_Music_Following_in_Score_Sheet_Images_vi.pdf); Henkel, Kelz & Widmer (2020) — Learning to Read and Follow Music in Complete Score Sheet Images (ISMIR 2020). arXiv:2007.10736 . Follows audio against a photo of the page — no MusicXML needed. Research track.
▸ Dorfer / Henkel / Widmer line of work, represented here by the TISMIR piano-score-following-video system — Automatic Generation of Piano Score Following Videos . TISMIR article . Source-of-truth for the new Bootleg Lab: convert audio/MIDI and sheet images into sparse notehead-position matrices, then align them with hierarchical DTW.
▸ Bootleg score alignment — MIDI-Sheet Music Alignment Using Bootleg Score Synthesis . arXiv:2004.10345 . The direct model for our raw-PDF fallback: staff/notehead geometry first, full symbolic OMR second.
Müller — Fundamentals of Music Processing . DTW notebook · book resources . The baseline math for DTW, subsequence matching, and alignment-cost thinking.
Feffer (2022) — MeSA: Multi-Score Alignment . project page . Why PDF-in-the-wild still needs a human-verifiable system map and correction loop.
Onset / beat / note-level front-ends
▸ Dixon (2006) — Onset Detection Revisited (DAFx 2006). PDF . Spectral-flux onset novelty (the "onset detection" panel in Signals).
Dixon (2007) — Evaluation of the Audio Beat Tracking System BeatRoot (J. New Music Research). PDF · code . Beat tracking — a soft tempo constraint we could feed the warp path.
Bittner et al. / Spotify Research (2022) — A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation — basic-pitch (ICASSP 2022). project · arXiv:2203.09893 · code . The note-level transcription front-end being wired in ahead of the aligner — warp a sequence of note onsets , not 12-bin chroma.
Hawthorne et al. (2018) — Onsets and Frames: Dual-Objective Piano Transcription (ISMIR 2018). arXiv:1710.11153 . The piano-transcription baseline.
Hawthorne et al. / Google Research — Onsets and Frames implementation notes. Google Research page . This is the reference shape for audio → piano note events before bootleg or MusicXML alignment.
Li et al. (2023) — MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training . arXiv:2306.00107 · MERT-v1-330M . Candidate embedding layer for robust audio confidence when chroma/onsets disagree.
Pilataki, Mauch & Dixon (2024) — Pitch-aware Generative Pretraining Improves Multi-pitch Estimation with Scarce Data (ACM MM Asia 2024). PDF . Pretraining trick for transcription when annotated data is thin.
Murgul & Heizmann (2025) — Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer (SMC 2025). arXiv:2507.00466 .
Wang, Ewert & Dixon (2016) — Identifying Missing and Extra Notes in Piano Recordings Using Score-Informed Dictionary Learning (ISMIR 2016). PDF . The basis for a "you missed / added a note here" practice layer.
Transcription & OMR engines · symbolic encoders
Chang, Benetos, Kirchhoff & Dixon (2024) — YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures (MLSP 2024). arXiv:2407.04822 . The audio encoder RUMAA reuses; SOTA multi-instrument transcription.
Gardner et al. / Google Magenta (2022) — MT3: Multi-Task Multitrack Music Transcription (ICLR 2022). code .
Zeng et al. (2023) — CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic MIR (ISMIR 2023). arXiv:2304.11029 · code . The M3 score encoder RUMAA builds on.
Ríos-Vila / Calvo-Zaragoza line of work (2024–2025) — Sheet Music Transformer , Practical End-to-End OMR for Pianoform Music , and Sheet Music Transformer++ . SMT · Linearized MusicXML · SMT++ . Research path for a future GPU OMR backend that emits MusicXML-like sequences directly from piano pages.
IJDAR (2024) — A unified representation framework for the evaluation of Optical Music Recognition systems . article . Why we keep MusicXML/MEI plus bootleg/system maps instead of trusting one opaque OMR output.
▸ Audiveris — github.com/Audiveris/audiveris · handbook . The optical-music-recognition engine behind "Convert PDF → MusicXML" (self-hosted builds only).
Matt Zucker — Page Dewarp . article/code notes . Pre-OMR image correction for camera scans before Audiveris or bootleg extraction.
Datasets, benchmarks & evaluation
Foscarin et al. (2020) — ASAP: a Dataset of Aligned Scores and Performances for Piano Transcription (ISMIR 2020). PDF · code . Aligned piano scores + performances — most of the Piano-solo tab and the note-level alignment ground truth the Metrics tab scores against.
Hawthorne et al. (2019) — Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset (ICLR 2019). arXiv:1810.12247 · dataset . ~200 h of real Yamaha-Disklavier recordings — the source of genuine demo audio.
Cheng et al. (2021) — ACPAS: Aligned Classical Piano Audio and Score (ISMIR 2021). dataset .
MIREX — Real-time Audio to Score Alignment (Score Following) . protocol wiki . Alignment Rate @ 50/100/200/500/1000/2000 ms + mean/median absolute error (ms & beats) — the metric set the Metrics tab uses.
Public-domain corpora: OpenScore Lieder (the Lieder tab) · KernScores · Mutopia · IMSLP (editions for the PDF / OMR scores) · Open Well-Tempered Clavier (Kimiko Ishizaka, CC0 — the Bach fugue recordings).
Libraries we run on
▸ Verovio (reference book ) — the WASM music engraver that renders every score in the browser.
▸ pymatchmaker / matchmaker — real-time score-to-audio alignment (an alternative live path alongside our own mvp/align_online.py).
▸ CPJKU/partitura — parses MusicXML / MEI into the note arrays the aligner and the alignment map need.
▸ librosa — STFT, CQT-chroma, onset novelty, sub-sequence DTW.
▸ FastAPI — the server; audio + scores are served from Cloudflare R2 , the app deploys on Railway .
Learn the fundamentals
Valerio Velardo — Audio Signal Processing for Machine Learning . YouTube series . Waveforms, the Fourier transform, spectrograms, chroma and MFCC — intuition and maths side by side.
Xavier Serra & Julius O. Smith — Audio Signal Processing for Music Applications . Coursera (free to audit). UPF / Stanford CCRMA: the DFT, the STFT and spectral models, hands-on in Python.
Meinard Müller — Fundamentals of Music Processing . FMP Notebooks · textbook . Chroma, onsets, DTW and score following, each explained then implemented in a Python notebook.
musicinformationretrieval.com — Stanford CCRMA's hands-on MIR notebooks: chroma, onset detection, dynamic time warping in runnable Python.
ISMIR — the International Society for Music Information Retrieval. Its openly archived proceedings are the primary record of music-information-retrieval theory — score following, transcription and alignment.
Steven W. Smith — The Scientist and Engineer's Guide to Digital Signal Processing . dspguide.com (free book). Sampling, the FFT and digital filters in plain language.
Graphify · architecture map
An illustrative 3D map of how Sheet Turner's parts fit together — a force-directed architecture diagram, the same engine the ClawCorp atlas uses. Drag to orbit · scroll to zoom · click a node to fly to its neighbourhood · hover to light up what connects to it · type to search.
fit
reset
—
⛶ full screen
Open this tab to load the 3D graph…
The Theory tab draws the idea of the pipeline; this is an illustrative map of the architecture — nodes are components, links show what connects to what, colours are clusters the layout found on its own.
Costs · paid-API spend
7 days
30 days
90 days
REFRESH
Every paid API call (Gemini OMR vision, vision-bbox calibration, piece-chat LLM) records its USD cost in the unified ledger (mvp/lib/cost_ledger.py). This dashboard sums it per day and per feature so spend is visible at a glance instead of summing four endpoints by hand. Day boundaries roll at UTC midnight.
Daily spend