Page Turner — pages that turn themselves

Sheet Turner _v0.2

page —/— m. — audio 0.0s tracking —

This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it• This web demo is our original two-week mockup•The iPad app has gone far beyond it•

Get the beta →

listening mode

Pick a score, hit Play, start playing.

Open the library (L) or upload a PDF or MusicXML. Hit Play, the mic listens, pages turn at the right moment.

Library

Live signal · chroma 12-bin

RMS 0.00peak 0.00conf —

Audio waveform · live

FFT · log-frequency spectrum

Spectrogram · scrolling time × log-freq

Onset detection · novelty

Confidence · model trust

Notes heard · AI vs score

Hit Play — each detected onset that passes the harmonic gate lands here, with the score's expected pitches.

Alignment map · audio → score

x = seconds of audio · y = position in the score. The curve is the model's mapping of this performance; a wavy path over a heatmap = a real alignment, a straight diagonal = a linear estimate. Hit Play (demo mode) to see it move.

Live cost matrix · scrolling DTW landscape

x = score position (0 to end), y = audio time (recent at top). Dark = good match, bright = poor match. Yellow trail = the OLTW warping path. Live as you play.

Alignment map · audio → score

x = seconds of audio · y = position in the score. The curve is the model's mapping of this performance; the dot is where playback is on it. A diagonal line = a linear estimate (no precise alignment for this piece).

AI tracking · measure over time

m. —page —lag —

state — feat — qps — dtw — gold Δ — hes× — notes — harm —

Tempo curve · across the performance

Beat-by-beat tempo (score-quarters/sec ×60) read off the precomputed alignment trace, smoothed. When a piece carries several recordings they overlay (P1, P2…).

Rubato & rhythm · expressive timing

Pick a piece with an alignment trace, or hit REFRESH, to read its rubato — the tempo elasticity, the ritardandi / accelerandi, and a synthesis of the performance's timing.

Rubato = expressive deviation of played timing from steady/notated timing. We read it as the local slope of the alignment trace (score-quarters per audio-second): >1 = rushing (accel / stringendo), <1 = stretching (ritard / agogic hold). The headline rubato number is std/mean of that local tempo. Because the read comes off the aligner, a mis-track (wrong-bar then snap-back) shows a fake "rush" — so a gesture under low alignment confidence, or a musically implausible swing, is flagged uncertain, and a mostly-unsure trace says so instead of over-claiming. Note resolution: /api/rhythm sharpens this to per-onset timing deviation when heard onsets are available.

Drift log · copy & send for analysis

        Source: —

        Trace: —

        FPS: —

Trace metrics · this performance's mapping

Pick a piece with an alignment trace, or hit REFRESH.

What these say about the sound → visual map. Each precomputed trace is a list of (audio second → score position) ticks — the table reads it back.

covers audio — how much of the recording the trace spans. <100% means the cursor falls back to a linear estimate past that point.
monotone — the trace should never move the cursor backwards in the score. 100% = clean; below that, the cursor jumps back somewhere (a bug, or a real repeat the chroma latched onto).
tempo — the overall score-quarters-per-second of this performance (and the equivalent BPM if the beat is a quarter note). rubato = how much the local tempo swings around that average (0 = a metronome, 0.3+ = heavy give-and-take).
vs metronome — the headline number. How far the alignment pulls away from a straight line, in seconds. Near 0 → the trace is basically linear and adds nothing over a clock. A few seconds → it is genuinely following the performer's timing — that's the gap a fixed page-turn timer would get wrong.
method / feature — offline-dtw / chroma_cqt = the chroma warp; offline-notes-dtw / basic-pitch-pianoroll-88 = the note-event front-end (sharper through trills + polyphony).
note-onset F1 — MIREX-style alignment accuracy at ±50/100/200 ms tolerance, scored against an aligned reference. Only shown when one pairs this recording (ASAP .match); our CC0 recordings have no such reference, so it reads “no reference” — everything above is reference-free.

Compare · OMR-vs-twin structural diff

What is actually different between two engravings? Enter a library id (e.g. schumann-liszt-widmung) or a filename under /scores (e.g. Schumann.Liszt.Widmung.musicxml) for each side. Reads the same voice-aware grader the cascade trusts — so the verdict here can't drift from the pipeline's. MusicXML/MXL only (MEI twins out of scope for v1).

reference candidate

First principles — the building blocks

Before the follower itself: the handful of ideas every later section leans on — from what sound even is to a computer up to what it means to align two performances. If anything below ever reads as jargon, the plain-language version is almost certainly one of these seven.

1 · Sound becomes a list of numbers

Sound is a pressure wave — air pushed and pulled. A microphone turns that pressure into a fluctuating voltage, and the computer measures that voltage at a fixed rate. Each measurement is a sample; the rate is the sample rate. CD audio is 44,100 samples per second (44.1 kHz). Sheet Turner resamples everything to 16 kHz — by the Nyquist limit that captures frequencies up to 8 kHz, and everything carrying musical pitch sits well below that. So inside the app, "the audio" is just a long array of numbers between −1 and +1.

2 · Frequency, pitch and amplitude

A pure tone is the signal repeating at a steady rate. That rate is its frequency, in hertz (Hz, cycles per second), and the ear hears it as pitch — faster is higher. The orchestral tuning A is 440 Hz. How far the signal swings each cycle is its amplitude, heard as loudness (measured in decibels, a logarithmic scale, because hearing is logarithmic). Doubling a frequency raises the pitch by exactly one octave: 220, 440 and 880 Hz are all the note A — the fact idea 5 is built on.

3 · The Fourier transform and the spectrogram

A real musical sound is many frequencies at once. The Fourier transform is the mathematical tool that decomposes any signal into the pure frequencies that sum to it — the result is its spectrum (how much energy at each frequency). Music changes over time, so we don't transform the whole recording at once: the short-time Fourier transform (STFT) slides a short window — a few tens of milliseconds — along the signal and transforms each one. Stack those spectra side by side and you get a spectrogram: time across, frequency up, brightness for energy. The spectrogram is the raw material every later step reads.

4 · Harmonics and timbre

Pluck a string tuned to 220 Hz and it does not only vibrate at 220 — it also rings at 440, 660, 880, 1100…, whole-number multiples called the harmonic series. The lowest, the fundamental, fixes the pitch you name. The relative loudness of the upper harmonics (overtones) is the timbre — why a piano and a clarinet on the same note are instantly told apart. Timbre is also the follower's adversary: it must recognise pitch through wildly different timbres — your piano, the demo's piano, a phone mic's colouration.

5 · Pitch classes and the chroma vector

Western music names twelve pitch classes — C, C♯, D…B — and (idea 2) any two notes an octave apart are the same class. The chroma vector exploits this: it folds the whole spectrum onto those twelve bins, summing every octave's C-energy into one number, every C♯ into the next, and so on. The result is a compact 12-number fingerprint of "what harmony is sounding now" that barely depends on octave or instrument. That invariance is exactly what lets the follower compare your performance against a reference — see idea 7. Chroma is the workhorse feature of the live tracker.

6 · Onsets

An onset is the instant a note begins — the sharp jump in energy when a key is struck. Comparing each spectrogram frame to the one before and summing the increases gives the spectral flux, a curve that spikes at every attack. Picking those peaks gives the onset times; the spacing between them is a rough estimate of tempo. Onsets give the follower a clock to keep advancing through long held chords, where the chroma barely changes.

7 · Alignment, dynamic time warping and score following

Two musicians never play a piece identically — different tempi, rubato, pauses, mistakes. Alignment is the task of pairing each moment of one rendition with the matching moment of another (or of the score). The classic algorithm is dynamic time warping (DTW): lay the two sequences on the axes of a grid, score every cell by how dissimilar the two frames are, and find the cheapest forward-only path corner to corner — that path is the alignment, stretching and squeezing time as needed. Score following is alignment done live: the performance arrives one frame at a time and the path must extend immediately, with no sight of the future. That constraint, and the algorithm that meets it (online time warping), is what the rest of this tab is about.

The RUMAA pipeline · where this is heading

Our schematic of the architecture in Chang, Dixon & Benetos — RUMAA, WASPAA 2025 (Fig. 2; also arXiv:2507.12175). ❄ = frozen pretrained encoder. RUMAA is the offline trace generator we'd adopt once the authors release weights — see section 6. The live follower today is the chroma / OLTW pipeline below.

How the follower works

1 · From sound to features

The mic (or the demo recording) is a stream of samples. We slide a short window over it (a few tens of ms) and take its short-time Fourier transform — energy per frequency. Folding every octave onto one ring of 12 semitones gives the chroma vector: how much C, C♯, D… is sounding right now, independent of which octave or which instrument. Chroma is the workhorse because the same chord played by different pianos at different registers lands on roughly the same 12-bin shape — exactly what a score-follower needs.

2 · Onsets & tempo

Between two consecutive STFT frames, the rise in energy (spectral flux / novelty) spikes when a note is struck. Peaks in that novelty curve are onset candidates; their spacing is a noisy estimate of the local tempo. The follower uses this to keep moving forward even through a held chord where the chroma barely changes.

3 · Score following = online time warping

We pre-compute the score as a reference sequence (its chroma over score time). The live audio is a query sequence that arrives one frame at a time. On-line Time Warping (Dixon, 2005) keeps a running cost matrix and, each frame, extends the cheapest monotonic path through it — never going backwards, sometimes pausing, sometimes skipping ahead — so the latest audio frame is matched to the score position whose chroma fits best given everything heard so far. The matched score time → bar → page is what turns the page.

4 · Confidence

If the local cost surface has one sharp minimum, the follower is sure (high conf). If it's flat — lots of score positions fit roughly equally — conf drops and the cursor is essentially guessing. That's the number the tracking pill and the drift log report.

5 · The hard problems — polyphony, pedal, voicing, wrong notes

A pianist's recording is not the clean MIDI a follower wishes it were. The things that actually make this hard, and what each one does to the aligner:

Polyphony. Four voices of a fugue sounding at once is not a melody — it's a thick stack of pitches. A 12-bin chroma frame just sums them, so two harmonically-similar passages (a subject and its answer a fifth apart, a sequence repeated up a step) collapse onto nearly the same shape → the cost surface goes flat and the cursor can latch onto the wrong one. Fix: warp note events (an 88-key pianoroll from a transcriber), not 12 folded bins — that keeps register and voice-count, so the answer no longer looks like the subject.
The sustain pedal. Held pedal lets notes ring into each other — the chroma stops being "what's struck now" and becomes "everything still vibrating", and the onset/spectral-flux spikes that mark note attacks get buried under the wash. The cursor slows or stalls right where the music is most legato. Fix: a note-onset front-end (which already separates onset from sustain) keys on the attacks; longer-term, model the pedal explicitly (it's in the MusicXML — RUMAA's score encoder reads pedal markings).
Voicing. Which note should the cursor sit on when a whole chord lands at once — the top voice the listener tracks, the bass, all of them? And in a fugue, the "current bar" is well-defined but the "current note" isn't. We map to bar / score-quarter (not a single note id) precisely so voicing ambiguity doesn't make the cursor flicker; the bouncing dot lands on the bar's notes as a group.
Wrong / extra / missed notes. A student playing along is not playing the score — a fluffed note, a dropped one, an added grace. A pure DTW will happily warp the score onto the mistakes (treating an inserted note as "the music sped up") and drift. Fix: an alignment that allows edit operations — Match / Insert / Delete — instead of forcing every audio frame onto some score position; that's exactly RUMAA's T3 stream (and its mistake-detection output flags the missing/extra notes). For now the human safety net is tap-to-anchor: hit the bar you're really on and the tracker re-bases.
Repeats & skips. A repeat sign means the same audio appears twice; a D.C. or a cut in performance means the score order isn't the play order. A naive warp must go forward only, so it either smears the repeat or jumps. Fix: handle the repeat structure in the score model — RUMAA emits a Repeat edit op rather than needing a manually unfolded score; until then we keep a "lost → re-acquire" heuristic in the live aligner and, again, tap-to-anchor.
Tempo extremes & rubato. A big ritardando or a Lisztian rush stretches the warping path faster than its bounded search window can follow; a held fermata over near-constant chroma gives it nothing to move on. Fix: a tempo prior (the aligner tracks a running beats-per-second and biases the path toward it — Arzt & Widmer's adaptive normalisation), plus the onset spacing as a backup clock through held chords.
Mic & room. Live mic adds room reverb, a noisy room floor, and (on a phone) a low-quality capture — all of which inflate the audio-side noise the chroma/onset features see. Fix: the confidence read-out is honest about it (low conf = "I'm guessing"), and the gating only auto-turns the page when confidence is above threshold.

The common thread: chroma is harmony, not notes, and a plain DTW must warp every frame onto something. So the fix runs in two moves —

The fix — recognising the actual notes, not an approximate position — is a note-level transcription front-end ahead of the aligner: Spotify's basic-pitch (Bittner et al., ICASSP 2022) turns the audio into note onsets, and the warp runs on that note-event sequence against the score's notes instead of 12-bin chroma. That's the "AMT + DTW" recipe that scores ~96–99% alignment F1 in the literature (RUMAA Table 2); basic-pitch is being wired into the offline trace pipeline and the Metrics tab will report which traces use it. Meanwhile the copyable drift log is the fastest feedback — it tells us exactly which bars still drift and why. (Two more rungs after that: a learned alignment cost à la Agrawal & Dixon, then RUMAA-class repeat-aware end-to-end alignment once weights are public — see Sources.)

6 · The end-state — RUMAA

The furthest rung: RUMAA — Chang, Dixon & Benetos, Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection (WASPAA 2025 · arXiv:2507.12175). Instead of warping features it's a transformer that reads the MusicXML score and the audio together and writes the alignment out as text — the pipeline is the diagram at the top of this tab.

The three output streams are proxy tasks that force the model to stay internally consistent — T1 = "what was actually played, time-stamped", T2 = "the score notes, re-timed to the performance", T3 = the explicit edit script lining them up. T3 is the bit that handles repeats: the model emits a Repeat op rather than needing a pre-unfolded score. Result: ~98% note-onset F1 (±50 ms) even on scores with repeat signs, where the DTW / HMM baselines (including the AMT+DTW recipe in step 5) collapse to 12–36% there.

Why it isn't the live tracker today: the paper is explicit — RUMAA is offline ("the model struggles with long audio over one minute … online use unexplored"), it's trained on (n)ASAP (~2 GPU-days), and there are no public weights yet. So the plan is to wire it in as the offline trace generator — the thing that builds the precomputed traces this app follows — the day the authors release it. Until then, step 5's basic-pitch note-event DTW is the best offline path we can run, and our own OLTW (Dixon 2005) is the live one. (Figure not reproduced — see the paper.)

7 (bis) · The three paradigms in one picture

Three parallel traditions of real-time score-following, 1984–2025. The live tracker in this app runs on the signal/DTW lane (Dixon 2005); the upstream sequence-matching and HMM layers above are the queued fusion paths (see § 7a below and the Sources tab for full bibliographic references and links).

7a · Sequence theory — the note-event paradigm

The OLTW path above is the signal paradigm: every audio frame is a 12-bin chroma, every score frame is a 12-bin chroma, and a continuous warping path forces a match. The other paradigm is sequence matching: detect discrete note events from the audio (pitch + onset), then match that sequence against the score's note sequence as a string-matching problem. The history is older than DTW for music:

Vercoe (1984) — The Synthetic Performer in the Context of Live Performance (ICMC 1984). The first published score-follower. Pitch-tracked monophonic input matched against the expected note list with simple recovery heuristics.
Dannenberg & Mukaino (1988) — New Techniques for Enhanced Quality of Computer Accompaniment (ICMC 1988). Adds expectation-based matching: rather than «just match the next note», score the alignment over a small window so wrong / extra / missed notes degrade gracefully.
Puckette (1990) — EXPLODE: A User Interface for Sequencing and Score Following (ICMC 1990). Simplifies it to first-exact-match against the list of upcoming expected events — the cheap baseline every later paper compares to.
Puckette & Lippe (1992) — Score Following in Practice (ICMC 1992). Formalises the two-level approach still used in IRCAM-style live electronics: try a fast exact-match first, fall back to a slower «skip-list» search that tolerates omitted or substituted notes.
Puckette (1995) — Score Following Using the Sung Voice (ICMC 1995). Pairs two pitch trackers in parallel — one fast and imprecise, one slower but more reliable — so the matcher has both quick guesses and well-vetted commits to choose between, instead of trusting a single noisy estimate.
Cano, Loscos & Bonada (1999) — Score-Performance Matching using HMMs (ICMC 1999). Bridges the two paradigms with a left-to-right HMM whose emissions are a vector of low-level audio features (energy, zero-crossing rate, fundamental frequency, plus their first derivatives); self-transitions model note duration, and Viterbi decoding produces the most likely score-position path. The probabilistic ancestor of Cont's ANTESCOFO (2007) line cited below.

Shared weakness, useful warning. Both the Puckette-style sequence matcher and the Cano/Loscos/Bonada HMM ride directly on the robustness of pitch / fundamental-frequency detection — a smeared chord, a pedal wash, or a fast trill that confuses the front-end propagates straight into the matcher's decisions. Our chroma-DTW path is less brittle there (it averages over a 12-bin folded vector instead of committing to single-pitch labels) but pays for it in resolution: chroma cannot tell a C from a C an octave up, so it can't tell a fugue subject from its answer. The harmonic gate we added in _detect_notes() (Score Wizard pattern: peak/mean ≥ 1.85 on chroma columns) is the cheap counterpart of Puckette 1995's «reliable» pitch tracker — it filters non-musical onsets out of the note-event stream before any matcher gets to see them.

The three paradigms are complementary, not competing. The signal path (chroma DTW) is robust to messy onsets but doesn't «know what note was played» — it never feeds back «you played a C, the score expects an E». The sequence path knows exactly that — the matched note pair, the substitution, the omission — but breaks when onset detection is noisy (the chord is smeared, the pedal swallows the attack, the trill is too fast). Layered together: OLTW carries the score position, the note-event matcher mirrors detected pitches against the score's expected pitches at that position, and the two streams correct each other.

What we run today already produces the inputs both layers need: mvp/align_online.py _detect_notes() emits per-chunk detected note events (pitch class + match flag against the score's expected pitches at the current beat), and the harmonic gate (Score Wizard pattern) filters non-musical onsets out of that stream. The piece is the Puckette-style matcher itself: take the recent detected-note window, slide it against the expected-note window from the score's pre-engraved timemap (cached in /static/precomputed/<id>.json), and emit a sequence-alignment confidence the OLTW can fuse with its chroma cost. Backlog item, with the sources above; not shipped yet — it needs an A/B against a hard piece (BWV 848 fugue) before going live.

7 · Bootleg score layer for raw PDFs

The TISMIR 2021 piano score-following-video paper is now one of our source-of-truth designs for scanned scores. Its core idea: convert both the audio and the sheet image into the same sparse bootleg score representation, then align those matrices. Audio becomes note events first (audio → AMT/MIDI → notehead positions). PDF becomes image features directly (PDF page → staff lines + filled noteheads → notehead positions). This is exactly why the new Bootleg Lab exists.

MIDI boundary: the app still has no user-facing MIDI surface: no MIDI playback requirement, no hardware MIDI dependency, no MIDI feature for the performer to manage. Internally, note events are allowed and necessary. Audio → AMT → note-event matrices is the bridge into bootleg alignment; it is a private recognition layer, not the product surface.

The important architectural consequence: OMR is not the only path. MusicXML/MEI is still the best structured layer when available, but raw scans need a fallback that survives bad recognition, filler pages, repeats, and unknown jumps. The bootleg layer discards durations/accidentals/key signatures on purpose and keeps the geometry that is stable across audio and image: notehead positions in time.

8 · Hierarchical DTW for repeats and unknown jumps

The same paper aligns every sheet line against the whole audio with subsequence DTW, then runs a second segment-level DP over line candidates. That lets repeats and skips happen at line breaks without permitting arbitrary chaos. Our first implementation is now in mvp/bootleg.py: note-event bootlegs, image bootlegs, subsequence line matching, and segment-path selection with penalties for repeats/skips. The next step is feeding real PDF page images and user recordings into that pipeline and showing the resulting line path in this Lab tab.

Timed mode is not recognition. It is an explicit opt-in fallback for rehearsal cases where the user knowingly wants a clock estimate. It stays off by default because page turns based only on elapsed time are fragile under rubato, repeats, pauses, and mistakes.

Reading the Signals panels

chroma 12-bin — the live pitch-class energies, brighter = louder.
waveform — raw amplitude over the last instant.
FFT log-frequency — the spectrum on a log axis (one octave = one unit), so harmonics line up.
onset novelty — spectral flux; spikes = note attacks.
confidence — the aligner's running trust, 0–1.
tracking · measure over time (Tracking tab) — the matched bar plotted against audio time; a clean monotone line = locked, a flat run = stuck, a jump = it re-snapped.

The science the follower stands on — what it runs today, and the work it's moving toward. ▸ = a method/library actually in the running code.

Score following · alignment methods

▸ Dixon (2005) — Live Tracking of Musical Performances Using On-Line Time Warping (DAFx 2005). PDF. The OLTW algorithm our live aligner runs (mvp/align_online.py).
Cont (2007) — Realtime Audio to Score Alignment with a Coupled Probabilistic Model (ISMIR 2007). PDF. The probabilistic/state-space camp behind ANTESCOFO-style score following; roadmap layer, not a replacement for OLTW.
Vercoe (1984) — The Synthetic Performer in the Context of Live Performance (ICMC 1984). archive. First published score-follower: pitch-tracked monophonic input matched against the expected note list. Foundational; the sequence-matching ancestor of every later note-event approach.
Dannenberg & Mukaino (1988) — New Techniques for Enhanced Quality of Computer Accompaniment (ICMC 1988). PDF. Expectation-based note-event matching; scores the alignment over a small window so dropped / added / substituted notes degrade gracefully.
Puckette (1990) — EXPLODE: A User Interface for Sequencing and Score Following (ICMC 1990). IRCAM archive. First-exact-match baseline of the sequence-matching paradigm; every later paper compares to this.
Puckette & Lippe (1992) — Score Following in Practice (ICMC 1992). Semantic Scholar. Formalises the two-level approach (fast exact match first, slower skip-list recovery on miss) still used in IRCAM-style live electronics.
Puckette (1995) — Score Following Using the Sung Voice (ICMC 1995). archive. Dual pitch tracker (one fast/imprecise + one slower/reliable) so the matcher can choose between quick guesses and well-vetted commits — the conceptual ancestor of our harmonic gate.
Cano, Loscos & Bonada (1999) — Score-Performance Matching using HMMs (ICMC 1999). Left-to-right HMM with multi-feature emissions (energy, zero-crossing rate, fundamental frequency + derivatives); self-transitions model note length, Viterbi decoding picks the score-position path. The probabilistic ancestor of the ANTESCOFO line.
Orio & Déchelle (2001) — Score Following Using Spectral Analysis and Hidden Markov Models (ICMC 2001). HAL PDF. The HMM-meets-spectral-features synthesis that surveys and combines the three earlier tracks (Vercoe / Dannenberg / Puckette / Cano-Loscos-Bonada). Cited as the source of the multi-feature emission and pitch-tracker robustness framing in the new Theory section.
Cont — Anticipatory Score-Following. HAL PDF. Anticipation/prediction layer for concert-grade latency once the base tracker is reliable.
Dixon & Widmer (2005) — MATCH: A Music Alignment Tool Chest (ISMIR 2005). PDF · code. Offline DTW with the forward-path constraint (the shape of our precomputed traces).
Macrae & Dixon (2010) — Accurate Real-time Windowed Time Warping (ISMIR 2010). PDF. Windowed OLTW — the bounded tempo-window trick our online aligner uses.
Arzt & Widmer (2008) — Automatic Page Turning for Musicians via Real-Time Machine Listening (ECAI 2008); Arzt, Widmer & Dixon (2012) — Adaptive Distance Normalization for Real-Time Music Tracking (EUSIPCO 2012). PDF. The page-turn application + the onset-feature normalization on our roadmap.
Agrawal & Dixon (2020/2021) — Learning Frame Similarity for Audio-to-Score Alignment (EUSIPCO 2020) · A Convolutional-Attentional Framework for Structure-Aware Performance-Score Synchronization (ICASSP 2021). EUSIPCO PDF · arXiv:2101.03937. Learned alignment cost functions (a drop-in replacement for the chroma distance).
Park, Cancino-Chacón, Chiruthapudi & Nam (2025) — Matchmaker: An Open-Source Library for Real-Time Piano Score Following and Systematic Evaluation (ISMIR 2025). arXiv:2510.10087 · code. Reference OLTW-Dixon / OLTW-Arzt + the AR / AE / AAE / MAE metric suite the Metrics tab follows.
Chang, Dixon & Benetos (2025) — RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection (WASPAA 2025). WASPAA PDF · arXiv:2507.12175. A transformer that aligns human-readable MusicXML with repeats straight to full-length audio, transcribes it, and flags wrong / missing / extra notes — 98.4 alignment F1 even with repeats, where DTW/HMM baselines collapse to 12–36. Offline (≤1-min chunks), no public weights yet; this is where the precomputed-trace generator goes when they release.
Henkel & Widmer — Real-Time Music Following in Score Sheet Images via Multi-Resolution Prediction (a copy is ingested at docs/Real-Time_Music_Following_in_Score_Sheet_Images_vi.pdf); Henkel, Kelz & Widmer (2020) — Learning to Read and Follow Music in Complete Score Sheet Images (ISMIR 2020). arXiv:2007.10736. Follows audio against a photo of the page — no MusicXML needed. Research track.
▸ Dorfer / Henkel / Widmer line of work, represented here by the TISMIR piano-score-following-video system — Automatic Generation of Piano Score Following Videos. TISMIR article. Source-of-truth for the new Bootleg Lab: convert audio/MIDI and sheet images into sparse notehead-position matrices, then align them with hierarchical DTW.
▸ Bootleg score alignment — MIDI-Sheet Music Alignment Using Bootleg Score Synthesis. arXiv:2004.10345. The direct model for our raw-PDF fallback: staff/notehead geometry first, full symbolic OMR second.
Müller — Fundamentals of Music Processing. DTW notebook · book resources. The baseline math for DTW, subsequence matching, and alignment-cost thinking.
Feffer (2022) — MeSA: Multi-Score Alignment. project page. Why PDF-in-the-wild still needs a human-verifiable system map and correction loop.

Onset / beat / note-level front-ends

▸ Dixon (2006) — Onset Detection Revisited (DAFx 2006). PDF. Spectral-flux onset novelty (the "onset detection" panel in Signals).
Dixon (2007) — Evaluation of the Audio Beat Tracking System BeatRoot (J. New Music Research). PDF · code. Beat tracking — a soft tempo constraint we could feed the warp path.
Bittner et al. / Spotify Research (2022) — A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation — basic-pitch (ICASSP 2022). project · arXiv:2203.09893 · code. The note-level transcription front-end being wired in ahead of the aligner — warp a sequence of note onsets, not 12-bin chroma.
Hawthorne et al. (2018) — Onsets and Frames: Dual-Objective Piano Transcription (ISMIR 2018). arXiv:1710.11153. The piano-transcription baseline.
Hawthorne et al. / Google Research — Onsets and Frames implementation notes. Google Research page. This is the reference shape for audio → piano note events before bootleg or MusicXML alignment.
Li et al. (2023) — MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. arXiv:2306.00107 · MERT-v1-330M. Candidate embedding layer for robust audio confidence when chroma/onsets disagree.
Pilataki, Mauch & Dixon (2024) — Pitch-aware Generative Pretraining Improves Multi-pitch Estimation with Scarce Data (ACM MM Asia 2024). PDF. Pretraining trick for transcription when annotated data is thin.
Murgul & Heizmann (2025) — Beat and Downbeat Tracking in Performance MIDI Using an End-to-End Transformer (SMC 2025). arXiv:2507.00466.
Wang, Ewert & Dixon (2016) — Identifying Missing and Extra Notes in Piano Recordings Using Score-Informed Dictionary Learning (ISMIR 2016). PDF. The basis for a "you missed / added a note here" practice layer.

Transcription & OMR engines · symbolic encoders

Chang, Benetos, Kirchhoff & Dixon (2024) — YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures (MLSP 2024). arXiv:2407.04822. The audio encoder RUMAA reuses; SOTA multi-instrument transcription.
Gardner et al. / Google Magenta (2022) — MT3: Multi-Task Multitrack Music Transcription (ICLR 2022). code.
Zeng et al. (2023) — CLaMP: Contrastive Language-Music Pre-training for Cross-Modal Symbolic MIR (ISMIR 2023). arXiv:2304.11029 · code. The M3 score encoder RUMAA builds on.
Ríos-Vila / Calvo-Zaragoza line of work (2024–2025) — Sheet Music Transformer, Practical End-to-End OMR for Pianoform Music, and Sheet Music Transformer++. SMT · Linearized MusicXML · SMT++. Research path for a future GPU OMR backend that emits MusicXML-like sequences directly from piano pages.
IJDAR (2024) — A unified representation framework for the evaluation of Optical Music Recognition systems. article. Why we keep MusicXML/MEI plus bootleg/system maps instead of trusting one opaque OMR output.
▸ Audiveris — github.com/Audiveris/audiveris · handbook. The optical-music-recognition engine behind "Convert PDF → MusicXML" (self-hosted builds only).
Matt Zucker — Page Dewarp. article/code notes. Pre-OMR image correction for camera scans before Audiveris or bootleg extraction.

Datasets, benchmarks & evaluation

Foscarin et al. (2020) — ASAP: a Dataset of Aligned Scores and Performances for Piano Transcription (ISMIR 2020). PDF · code. Aligned piano scores + performances — most of the Piano-solo tab and the note-level alignment ground truth the Metrics tab scores against.
Hawthorne et al. (2019) — Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset (ICLR 2019). arXiv:1810.12247 · dataset. ~200 h of real Yamaha-Disklavier recordings — the source of genuine demo audio.
Cheng et al. (2021) — ACPAS: Aligned Classical Piano Audio and Score (ISMIR 2021). dataset.
MIREX — Real-time Audio to Score Alignment (Score Following). protocol wiki. Alignment Rate @ 50/100/200/500/1000/2000 ms + mean/median absolute error (ms & beats) — the metric set the Metrics tab uses.
Public-domain corpora: OpenScore Lieder (the Lieder tab) · KernScores · Mutopia · IMSLP (editions for the PDF / OMR scores) · Open Well-Tempered Clavier (Kimiko Ishizaka, CC0 — the Bach fugue recordings).

Libraries we run on

▸ Verovio (reference book) — the WASM music engraver that renders every score in the browser.
▸ pymatchmaker / matchmaker — real-time score-to-audio alignment (an alternative live path alongside our own mvp/align_online.py).
▸ CPJKU/partitura — parses MusicXML / MEI into the note arrays the aligner and the alignment map need.
▸ librosa — STFT, CQT-chroma, onset novelty, sub-sequence DTW.
▸ FastAPI — the server; audio + scores are served from Cloudflare R2, the app deploys on Railway.

Learn the fundamentals

Valerio Velardo — Audio Signal Processing for Machine Learning. YouTube series. Waveforms, the Fourier transform, spectrograms, chroma and MFCC — intuition and maths side by side.
Xavier Serra & Julius O. Smith — Audio Signal Processing for Music Applications. Coursera (free to audit). UPF / Stanford CCRMA: the DFT, the STFT and spectral models, hands-on in Python.
Meinard Müller — Fundamentals of Music Processing. FMP Notebooks · textbook. Chroma, onsets, DTW and score following, each explained then implemented in a Python notebook.
musicinformationretrieval.com — Stanford CCRMA's hands-on MIR notebooks: chroma, onset detection, dynamic time warping in runnable Python.
ISMIR — the International Society for Music Information Retrieval. Its openly archived proceedings are the primary record of music-information-retrieval theory — score following, transcription and alignment.
Steven W. Smith — The Scientist and Engineer's Guide to Digital Signal Processing. dspguide.com (free book). Sampling, the FFT and digital filters in plain language.

Console · web app log

Every console.log / info / warn / error the web app emits, plus uncaught errors and unhandled promise rejections, captured into a ring buffer (last 400). This is the web-side log; the iPad app keeps its own native detector / alignment log under iOS Settings → Debug console.

follow the tail

(no logs yet)

The score that
turns itself.

Pick a score, hit Play, start playing.

Smart Annotations

Sheet Turner — 3 steps

A score follower in your browser

Pick a score

Watch the alignment

Two ways to use it

Step by step

What the icons do

Privacy

Inspector legend

Limits — read this

Calibrate this piece

Train Listening

Setup Systems

Correct OMR

Listening Points

Refine Cursor Sync

1 · Record

2 · Your take

3 Â· Train tracker on this recording

Published recordings