Speech is continuous. Text is discrete. Aligning audio frames to characters would require precise annotation of every syllable. CTC (Connectionist Temporal Classification) eliminates this need, enabling end-to-end training from audio to transcription without frame-level labels.
Why Sutskever Included This
Deep Speech 2 used CTC to achieve state-of-the-art speech recognition. The technique enables sequence-to-sequence learning when input and output lengths differ and alignment is unknown. CTC powers voice assistants, transcription services, and accessibility tools.
The Alignment Problem
Given 200 audio frames and the target "hello", traditional approaches need labels for each frame: frames 1-20 produce 'h', frames 21-50 produce 'e', and so on. Creating such alignments is expensive and error-prone.
CTC sidesteps this. Given audio and transcript, it computes the probability of producing that transcript from the audio, summing over all possible alignments.
The Blank Token
CTC adds a special blank token (ε) to the output vocabulary. The network outputs a probability distribution over characters plus blank at each timestep.
Blank handles silence, transitions, and variable-length mappings. A network can output "ε ε h h ε e ε l ε l ε o ε ε" for "hello," with blanks filling the gaps.
Collapsing Rules
CTC converts raw outputs to final transcriptions through two rules:
1. Remove all blank tokens
2. Merge consecutive repeated characters
"h h ε e ε l l ε o" → "h h e l l o" → "hello"
Different frame-level outputs can collapse to the same transcription. CTC considers all such paths as valid.
The Forward Algorithm
Computing the probability of a transcription requires summing over all valid alignments, which can be exponentially many. Dynamic programming makes this tractable.
The forward algorithm computes partial sums efficiently: what's the probability of producing the first k characters using the first t frames? Build up from small subproblems to the full probability.
Loss Function
CTC loss is the negative log probability of the correct transcription:
L = -log P(target | audio) = -log Σ P(path) over all valid paths
Gradient computation uses the forward-backward algorithm. The network learns to increase probability for all paths that produce the correct output.
Practical Considerations
Decoding: At inference, find the most probable transcription. Greedy decoding takes the highest-probability character at each frame. Beam search explores multiple hypotheses for better results.
Language models: CTC models audio-to-character mapping. Combining with language models improves word-level accuracy by favoring likely word sequences.
Peaky outputs: CTC networks often produce confident spikes at character positions with blanks elsewhere. This makes decoding easier but can limit some applications.
Beyond Speech
CTC applies wherever input and output sequences have different lengths with unknown alignment: handwriting recognition (stroke sequences to text), video captioning (frames to description), and music transcription (audio to notes).
Further Reading
More in This Series
Part of a series on Ilya Sutskever's recommended 30 papers.