Live signal · chroma 12-bin
RMS 0.00 peak 0.00 conf —
The 12 pitch-class energies — how much of each note name (C, C#, D…) is sounding right now, octave-folded. This is the feature the offline DTW and the live OLTW aligner actually match the score against: a bar's chord profile, not raw pitches.
Audio waveform · live
The raw amplitude over the last fraction of a second — the sound pressure itself. Loud attacks spike; sustained notes settle. It's the time-domain view; the chroma and FFT below are what the aligner derives from it.
FFT · log-frequency spectrum
The frequency content of this instant, on a log axis (so each octave is the same width — the way pitch is heard). Peaks are partials of the notes being played; collapsing the octaves of these peaks gives the chroma vector up top.
Onset detection · novelty
Spectral novelty — how much the spectrum just changed. Peaks mark note onsets (new attacks). The aligner weights these moments: a fresh attack is a strong anchor for "we just moved to the next note/bar".
Confidence · model trust
How well the live audio currently matches where the model thinks you are in the score, 0–1. Green > 0.75 = locked. Amber 0.45–0.75 = uncertain. Red < 0.45 = lost — that's when to tap a bar or hit the anchor button to re-sync it.
Alignment map · audio → score
x = seconds of audio · y = position in the score. The curve is the model's mapping of this performance ; a wavy path over a heatmap = a real alignment, a straight diagonal = a linear estimate. Hit Play (demo mode) to see it move.
The "neural map": the heatmap is the chroma match cost between every moment of the recording (x) and every bar of the score (y) — the dark valley is the path of best match. The gold line is the model's chosen alignment through it; the dot is where playback sits. With no precise alignment for a piece you get a plain diagonal instead.
Alignment map · audio → score
x = seconds of audio · y = position in the score. The curve is the model's mapping of this performance ; the dot is where playback is on it. A diagonal line = a linear estimate (no precise alignment for this piece).
This is the same "neural map" mirrored in the Signals tab: the heatmap is the chroma match cost between every moment of the recording (x) and every bar of the score (y) — the dark valley is the path of best match — and the gold line is the alignment the model chose through it. With no precise trace for a piece you get a plain diagonal: a constant-rate estimate.
AI tracking · measure over time
m. — page — lag —
Which bar the model thinks is sounding, plotted over time as playback runs — a staircase climbing toward the end of the piece. m. = current measure, page = which page that's on, lag = ms since the last alignment tick (how fresh the estimate is).
Drift log · copy & send for analysis
COPY
CLEAR
chroma — 12 pitch-class energies. conf — alignment trust 0–1 (green >0.75 locked · amber 0.45–0.75 uncertain · red <0.45 lost). lag — ms since the last tick. speed — Δscore ÷ Δaudio: 1.00x = on tempo, <1 = the AI is lagging, >1 = racing ahead. Every anchor you set is logged too. COPY and paste it back and I can pinpoint where it drifts and why.
Source: —
Trace: —
FPS: —
Trace metrics · this performance's mapping
REFRESH
Pick a piece with an alignment trace, or hit REFRESH.
What these say about the sound → visual map. Each precomputed trace is a list of (audio second → score position) ticks — the table reads it back.
covers audio — how much of the recording the trace spans. <100% means the cursor falls back to a linear estimate past that point.
monotone — the trace should never move the cursor backwards in the score. 100% = clean; below that, the cursor jumps back somewhere (a bug, or a real repeat the chroma latched onto).
tempo — the overall score-quarters-per-second of this performance (and the equivalent BPM if the beat is a quarter note). rubato = how much the local tempo swings around that average (0 = a metronome, 0.3+ = heavy give-and-take).
vs metronome — the headline number. How far the alignment pulls away from a straight line, in seconds. Near 0 → the trace is basically linear and adds nothing over a clock. A few seconds → it is genuinely following the performer's timing — that's the gap a fixed page-turn timer would get wrong.
method / feature — offline-dtw / chroma_cqt = the chroma warp; offline-notes-dtw / basic-pitch-pianoroll-88 = the note-event front-end (sharper through trills + polyphony).
note-onset F1 — MIREX-style alignment accuracy at ±50/100/200 ms tolerance, scored against an aligned reference. Only shown when one pairs this recording (ASAP .match); our CC0 recordings have no such reference, so it reads “no reference” — everything above is reference-free.
The RUMAA pipeline · where this is heading
MusicXML
notes·dynamics·repeats
interleaved ABC
bar-patches · ~64 ch/bar
Score encoder
M3 (CLaMP2)
❄
audio
16 kHz mono
STFT → ResNet
pre-encoder
Audio encoder
12 fps
❄
Decoder ×6
self-attn
→ audio X-attn
→ score X-attn
→ gated FFN
3-ch LM head
T1 score-aligned performance transcription (what was played, time-stamped)
T2 performance-aligned score conversion (score notes, re-timed)
T3 edit-operation tags · Match / Insert / Delete / Repeat (handles repeat signs)
Our schematic of the architecture in Chang, Dixon & Benetos — RUMAA , WASPAA 2025 (Fig. 2; also arXiv:2507.12175 ). ❄ = frozen pretrained encoder. RUMAA is the offline trace generator we'd adopt once the authors release weights — see section 6. The live follower today is the chroma / OLTW pipeline below.
How the follower works
1 · From sound to features
The mic (or the demo recording) is a stream of samples. We slide a short window over it (a few tens of ms) and take its short-time Fourier transform — energy per frequency. Folding every octave onto one ring of 12 semitones gives the chroma vector : how much C, C♯, D… is sounding right now, independent of which octave or which instrument. Chroma is the workhorse because the same chord played by different pianos at different registers lands on roughly the same 12-bin shape — exactly what a score-follower needs.
2 · Onsets & tempo
Between two consecutive STFT frames, the rise in energy (spectral flux / novelty) spikes when a note is struck. Peaks in that novelty curve are onset candidates; their spacing is a noisy estimate of the local tempo. The follower uses this to keep moving forward even through a held chord where the chroma barely changes.
3 · Score following = online time warping
We pre-compute the score as a reference sequence (its chroma over score time ). The live audio is a query sequence that arrives one frame at a time. On-line Time Warping (Dixon, 2005) keeps a running cost matrix and, each frame, extends the cheapest monotonic path through it — never going backwards, sometimes pausing, sometimes skipping ahead — so the latest audio frame is matched to the score position whose chroma fits best given everything heard so far. The matched score time → bar → page is what turns the page.
4 · Confidence
If the local cost surface has one sharp minimum, the follower is sure (high conf ). If it's flat — lots of score positions fit roughly equally — conf drops and the cursor is essentially guessing. That's the number the tracking pill and the drift log report.
5 · The hard problems — polyphony, pedal, voicing, wrong notes
A pianist's recording is not the clean MIDI a follower wishes it were. The things that actually make this hard, and what each one does to the aligner:
Polyphony. Four voices of a fugue sounding at once is not a melody — it's a thick stack of pitches. A 12-bin chroma frame just sums them, so two harmonically-similar passages (a subject and its answer a fifth apart, a sequence repeated up a step) collapse onto nearly the same shape → the cost surface goes flat and the cursor can latch onto the wrong one. Fix: warp note events (an 88-key pianoroll from a transcriber), not 12 folded bins — that keeps register and voice-count, so the answer no longer looks like the subject.
The sustain pedal. Held pedal lets notes ring into each other — the chroma stops being "what's struck now" and becomes "everything still vibrating", and the onset/spectral-flux spikes that mark note attacks get buried under the wash. The cursor slows or stalls right where the music is most legato. Fix: a note-onset front-end (which already separates onset from sustain ) keys on the attacks; longer-term, model the pedal explicitly (it's in the MusicXML — RUMAA's score encoder reads pedal markings).
Voicing. Which note should the cursor sit on when a whole chord lands at once — the top voice the listener tracks, the bass, all of them? And in a fugue, the "current bar" is well-defined but the "current note" isn't. We map to bar / score-quarter (not a single note id) precisely so voicing ambiguity doesn't make the cursor flicker; the bouncing dot lands on the bar's notes as a group.
Wrong / extra / missed notes. A student playing along is not playing the score — a fluffed note, a dropped one, an added grace. A pure DTW will happily warp the score onto the mistakes (treating an inserted note as "the music sped up") and drift. Fix: an alignment that allows edit operations — Match / Insert / Delete — instead of forcing every audio frame onto some score position; that's exactly RUMAA's T3 stream (and its mistake-detection output flags the missing/extra notes). For now the human safety net is tap-to-anchor: hit the bar you're really on and the tracker re-bases.
Repeats & skips. A repeat sign means the same audio appears twice; a D.C. or a cut in performance means the score order isn't the play order. A naive warp must go forward only, so it either smears the repeat or jumps. Fix: handle the repeat structure in the score model — RUMAA emits a Repeat edit op rather than needing a manually unfolded score; until then we keep a "lost → re-acquire" heuristic in the live aligner and, again, tap-to-anchor.
Tempo extremes & rubato. A big ritardando or a Lisztian rush stretches the warping path faster than its bounded search window can follow; a held fermata over near-constant chroma gives it nothing to move on. Fix: a tempo prior (the aligner tracks a running beats-per-second and biases the path toward it — Arzt & Widmer's adaptive normalisation), plus the onset spacing as a backup clock through held chords.
Mic & room. Live mic adds room reverb, a noisy room floor, and (on a phone) a low-quality capture — all of which inflate the audio-side noise the chroma/onset features see. Fix: the confidence read-out is honest about it (low conf = "I'm guessing"), and the gating only auto-turns the page when confidence is above threshold.
The common thread: chroma is harmony, not notes , and a plain DTW must warp every frame onto something. So the fix runs in two moves —
The fix — recognising the actual notes, not an approximate position — is a note-level transcription front-end ahead of the aligner: Spotify's basic-pitch (Bittner et al., ICASSP 2022) turns the audio into note onsets, and the warp runs on that note-event sequence against the score's notes instead of 12-bin chroma. That's the "AMT + DTW" recipe that scores ~96–99% alignment F1 in the literature (RUMAA Table 2); basic-pitch is being wired into the offline trace pipeline and the Metrics tab will report which traces use it. Meanwhile the copyable drift log is the fastest feedback — it tells us exactly which bars still drift and why. (Two more rungs after that: a learned alignment cost à la Agrawal & Dixon, then RUMAA-class repeat-aware end-to-end alignment once weights are public — see Sources.)
6 · The end-state — RUMAA
The furthest rung: RUMAA — Chang, Dixon & Benetos, Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection (WASPAA 2025 · arXiv:2507.12175 ). Instead of warping features it's a transformer that reads the MusicXML score and the audio together and writes the alignment out as text — the pipeline is the diagram at the top of this tab.
The three output streams are proxy tasks that force the model to stay internally consistent — T1 = "what was actually played, time-stamped", T2 = "the score notes, re-timed to the performance", T3 = the explicit edit script lining them up. T3 is the bit that handles repeats : the model emits a Repeat op rather than needing a pre-unfolded score. Result: ~98% note-onset F1 (±50 ms) even on scores with repeat signs , where the DTW / HMM baselines (including the AMT+DTW recipe in step 5) collapse to 12–36% there.
Why it isn't the live tracker today: the paper is explicit — RUMAA is offline ("the model struggles with long audio over one minute … online use unexplored"), it's trained on (n)ASAP (~2 GPU-days), and there are no public weights yet. So the plan is to wire it in as the offline trace generator — the thing that builds the precomputed traces this app follows — the day the authors release it. Until then, step 5's basic-pitch note-event DTW is the best offline path we can run, and our own OLTW (Dixon 2005) is the live one. (Figure not reproduced — see the paper.)
Reading the Signals panels
chroma 12-bin — the live pitch-class energies, brighter = louder.
waveform — raw amplitude over the last instant.
FFT log-frequency — the spectrum on a log axis (one octave = one unit), so harmonics line up.
onset novelty — spectral flux; spikes = note attacks.
confidence — the aligner's running trust, 0–1.
tracking · measure over time (Tracking tab) — the matched bar plotted against audio time; a clean monotone line = locked, a flat run = stuck, a jump = it re-snapped.
The science the follower stands on — what it runs today, and the work it's moving toward. ▸ = a method/library actually in the running code.
Score following · alignment methods
▸ Dixon (2005) — Live Tracking of Musical Performances Using On-Line Time Warping (DAFx 2005). PDF . The OLTW algorithm our live aligner runs (mvp/align_online.py).
Dixon & Widmer (2005) — MATCH: A Music Alignment Tool Chest (ISMIR 2005). PDF · code . Offline DTW with the forward-path constraint (the shape of our precomputed traces).
Macrae & Dixon (2010) — Accurate Real-time Windowed Time Warping (ISMIR 2010). PDF . Windowed OLTW — the bounded tempo-window trick our online aligner uses.
Arzt & Widmer (2008) — Automatic Page Turning for Musicians via Real-Time Machine Listening (ECAI 2008); Arzt, Widmer & Dixon (2012) — Adaptive Distance Normalization for Real-Time Music Tracking (EUSIPCO 2012). PDF . The page-turn application + the onset-feature normalization on our roadmap.
Agrawal & Dixon (2020/2021) — Learning Frame Similarity for Audio-to-Score Alignment (EUSIPCO 2020) · A Convolutional-Attentional Framework for Structure-Aware Performance-Score Synchronization (ICASSP 2021). EUSIPCO PDF · arXiv:2101.03937 . Learned alignment cost functions (a drop-in replacement for the chroma distance).
Park, Cancino-Chacón, Chiruthapudi & Nam (2025) — Matchmaker: An Open-Source Library for Real-Time Piano Score Following and Systematic Evaluation (ISMIR 2025). arXiv:2510.10087 · code . Reference OLTW-Dixon / OLTW-Arzt + the AR / AE / AAE / MAE metric suite the Metrics tab follows.
Chang, Dixon & Benetos (2025) — RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection (WASPAA 2025). WASPAA PDF · arXiv:2507.12175 . A transformer that aligns human-readable MusicXML with repeats straight to full-length audio, transcribes it, and flags wrong / missing / extra notes — 98.4 alignment F1 even with repeats, where DTW/HMM baselines collapse to 12–36. Offline (≤1-min chunks), no public weights yet; this is where the precomputed-trace generator goes when they release.
Henkel & Widmer — Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction (a copy is ingested at docs/Real-Time_Music_Following_in_Score_Sheet_Images_vi.pdf); Henkel, Kelz & Widmer (2020) — Learning to Read and Follow Music in Complete Score Sheet Images (ISMIR 2020). arXiv:2007.10736 . Follows audio against a photo of the page — no MusicXML needed. Research track.
Onset / beat / note-level front-ends
▸ Dixon (2006) — Onset Detection Revisited (DAFx 2006). PDF . Spectral-flux onset novelty (the "onset detection" panel in Signals).
Dixon (2007) — Evaluation of the Audio Beat Tracking System BeatRoot (J. New Music Research). PDF · code . Beat tracking — a soft tempo constraint we could feed the warp path.
Bittner et al. / Spotify Research (2022) — A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation — basic-pitch (ICASSP 2022). arXiv:2203.09893 · code . The note-level transcription front-end being wired in ahead of the aligner — warp a sequence of note onsets , not 12-bin chroma.
Hawthorne et al. (2018) — Onsets and Frames: Dual-Objective Piano Transcription (ISMIR 2018). arXiv:1710.11153 . The piano-transcription baseline.
Pilataki, Mauch & Dixon (2024) — Pitch-aware Generative Pretraining Improves Multi-pitch Estimation with Scarce Data (ACM MM Asia 2024). PDF . Pretraining trick for transcription when annotated data is thin.
Murgul & Heizmann (2025) — Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer (SMC 2025). arXiv:2507.00466 .
Wang, Ewert & Dixon (2016) — Identifying Missing and Extra Notes in Piano Recordings Using Score-Informed Dictionary Learning (ISMIR 2016). PDF . The basis for a "you missed / added a note here" practice layer.
Transcription & OMR engines · symbolic encoders
Chang, Benetos, Kirchhoff & Dixon (2024) — YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures (MLSP 2024). arXiv:2407.04822 . The audio encoder RUMAA reuses; SOTA multi-instrument transcription.
Gardner et al. / Google Magenta (2022) — MT3: Multi-Task Multitrack Music Transcription (ICLR 2022). code .
Zeng et al. (2023) — CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic MIR (ISMIR 2023). arXiv:2304.11029 · code . The M3 score encoder RUMAA builds on.
Audiveris — github.com/Audiveris/audiveris . The optical-music-recognition engine behind "Convert PDF → MusicXML" (self-hosted builds only).
Datasets, benchmarks & evaluation
Foscarin et al. (2020) — ASAP: a Dataset of Aligned Scores and Performances for Piano Transcription (ISMIR 2020). PDF · code . Aligned piano scores + performances — most of the Piano-solo tab and the note-level alignment ground truth the Metrics tab scores against.
Hawthorne et al. (2019) — Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset (ICLR 2019). arXiv:1810.12247 · dataset . ~200 h of real Yamaha-Disklavier recordings — the source of genuine demo audio.
Cheng et al. (2021) — ACPAS: Aligned Classical Piano Audio and Score (ISMIR 2021). dataset .
MIREX — Real-time Audio to Score Alignment (Score Following) . protocol wiki . Alignment Rate @ 50/100/200/500/1000/2000 ms + mean/median absolute error (ms & beats) — the metric set the Metrics tab uses.
Public-domain corpora: OpenScore Lieder (the Lieder tab) · KernScores · Mutopia · IMSLP (editions for the PDF / OMR scores) · Open Well-Tempered Clavier (Kimiko Ishizaka, CC0 — the Bach fugue recordings).
Libraries we run on
▸ Verovio (reference book ) — the WASM music engraver that renders every score in the browser.
▸ pymatchmaker / matchmaker — real-time score-to-audio alignment (an alternative live path alongside our own mvp/align_online.py).
▸ CPJKU/partitura — parses MusicXML / MEI into the note arrays the aligner and the alignment map need.
▸ librosa — STFT, CQT-chroma, onset novelty, sub-sequence DTW.
▸ FastAPI — the server; audio + scores are served from Cloudflare R2 , the app deploys on Railway .
Graphify · how it all connects
The same machinery as the Theory tab, drawn as a graph: who feeds whom. Sound (mic or a real recording) is turned into features — chroma (12 pitch-class energies), an FFT spectrum, an onset novelty curve, and (offline) a basic-pitch 88-key pianoroll. Those drive the aligner: DTW offline (one trace per recording) and OLTW live (real-time). The aligner emits a trace = a list of (audio second → score quarter) ticks; the cursor reads it (binary-search at a small look-ahead) to pick the bar to highlight and decide the page turn . The score (MusicXML/MEI) goes through Verovio to the engraved SVG + timemap , which is how a score-quarter becomes a bar's screen rectangle. Confidence gates the auto-turn; the anchor (tap a bar) re-syncs the live aligner and re-bases the trace. RUMAA (the transformer; offline, no public weights yet) is the future end-state that would replace the DTW+OLTW pair.
Sound mic / recording
Chroma 12-bin
FFT spectrum
Onset novelty
basic-pitch 88-key roll
DTW offline · per take
OLTW live · real-time
Trace audio→quarter
Cursor +look-ahead
Bar highlight
Page turn
Score MusicXML/MEI
Verovio
SVG + timemap bar geometry
Confidence
Anchor
RUMAA transformer · future
inputs
outputs
future
Read it left-to-right: sound → features → aligner → trace → cursor → what you see. The score lane (bottom-left) feeds the cursor the bar geometry. Anchor closes the loop back into the live aligner. RUMAA (dashed) is where this is heading — one model that does the alignment + the edit-op tags end-to-end, when its weights are public.