The fastest free way to get a YouTube video's transcript in 2026 is yt-dlp, the free, open-source command-line downloader. It pulls the caption track straight from YouTube without downloading the video, and a one-line sed strips the timestamps so you are left with plain prose. That plain-text transcript is exactly what you paste into ChatGPT, Claude, or any AI summarizer to get a summary, key points, or a searchable record of a long talk without watching it. Paste a URL below and copy the command:
Paste a YouTube URL, choose what to grab, and copy the command. It updates as you change the options.
yt-dlp --skip-download --write-subs --write-auto-subs --sub-langs en --convert-subs srt "https://www.youtube.com/watch?v=dQw4w9WgXcQ"The builder gives you the download. The rest of this page is the detail: which caption types exist, listing what a video actually has, getting the words without the megabytes of video, cleaning the file up to plain text, and the fallback for videos that have no captions at all.
Pull transcripts you have the right to use. Reading a transcript for your own notes or feeding it to a summarizer is one thing; republishing someone else's captions as your own content is another, and bulk scraping can violate YouTube's Terms of Service. This guide is for the legitimate cases. What you do with the text is on you.
Two kinds of captions: uploaded vs auto-generated
Before you download anything, know what you are pulling. YouTube has two distinct caption tracks, and yt-dlp treats them with two different flags:
- Creator-uploaded captions (
--write-subs). The video's owner uploaded or hand-corrected these. They are accurate, properly punctuated, and the ones you want when they exist. Many videos do not have them. - Auto-generated captions (
--write-auto-subs). YouTube's own speech-to-text. Almost every video with spoken English has them, so they are the reliable fallback, but they have no punctuation to speak of and they mangle names, jargon, and homophones. Good enough to feed an AI summarizer that fixes the grammar for you; not good enough to publish verbatim.
The practical move is to ask for both and let yt-dlp grab whichever is present. Uploaded wins when available; auto fills the gap.
List what a video actually has
Do not guess. List the caption tracks (both manual and automatic) first:
yt-dlp --list-subs "https://www.youtube.com/watch?v=dQw4w9WgXcQ"That prints two tables: the available subtitles (creator-uploaded) and the available automatic captions, each with a language code (en, en-US, es, and so on) and the formats on offer (vtt, srt, ttml). If the subtitles table is empty but the automatic-captions table lists en, you will be relying on --write-auto-subs. If both tables are empty, there are no captions at all and you skip to the Whisper fallback below.
Download the transcript only, no video
This is the whole point: get the words without fetching hundreds of megabytes of video. --skip-download tells yt-dlp to skip the media stream and pull only the subtitle file:
# Uploaded + auto captions, English, written out as SRT, no video
yt-dlp --skip-download --write-subs --write-auto-subs --sub-langs en --convert-subs srt "https://www.youtube.com/watch?v=dQw4w9WgXcQ"Breaking that down:
--skip-downloadskips the video entirely; you only get the caption file.--write-subsrequests creator-uploaded captions.--write-auto-subsrequests YouTube's auto captions, so you still get something when there are no uploaded ones.--sub-langs enlimits it to English. Useen.*to catchen-US/en-GBvariants, or a comma list likeen,esfor several languages.--convert-subs srtnormalizes whatever YouTube serves into SubRip (.srt).
You get a file named something like Video Title [VIDEO_ID].en.srt. Prefer WebVTT? Swap the format:
# Same, but WebVTT output instead of SRT
yt-dlp --skip-download --write-subs --write-auto-subs --sub-langs en --convert-subs vtt "https://www.youtube.com/watch?v=dQw4w9WgXcQ"SRT and VTT carry the same words; they differ only in the timecode syntax and a header line. Pick SRT if a downstream tool expects it, VTT if you are embedding in HTML5 video. For feeding an AI model, neither matters once you strip the timing in the next step.
Convert SRT/VTT to clean plain text
Here is the gap people hit: yt-dlp converts between subtitle formats, but it does not output a plain .txt. An SRT file is interleaved with sequence numbers, 00:00:01,200 --> 00:00:03,800 timestamp lines, and blank separators, which is noise when all you want is the prose. One sed removes all three:
# Strip indices, timestamp lines, and blanks down to plain prose
sed -E '/^[0-9]+$/d; /-->/d; /^$/d' "Video Title [VIDEO_ID].en.srt" > transcript.txtThat deletes any line that is only a number (the sequence index), any line containing --> (the timestamp), and any empty line, leaving the spoken text. The result is one line per caption cue, which most summarizers handle fine; if you want it reflowed into paragraphs, pipe it through fmt or paste it into your editor and reflow there. The same sed works on a .vtt file too, since VTT timestamp lines also contain --> (you may want to also drop the leading WEBVTT header line, which the number/-->/blank filter leaves behind).
For a one-shot pipeline that downloads and cleans in a single command:
# Download the auto/uploaded EN captions, then immediately flatten to text
yt-dlp --skip-download --write-subs --write-auto-subs --sub-langs en --convert-subs srt -o "%(id)s.%(ext)s" "URL" \
&& sed -E '/^[0-9]+$/d; /-->/d; /^$/d' *.en.srt > transcript.txtThe -o "%(id)s.%(ext)s" keeps the filename predictable (just the video ID) so the sed glob is easy. For more on output templates and every other flag, the yt-dlp cheat sheet is the reference.
No captions at all? Transcribe locally with Whisper
Some videos have neither uploaded nor auto captions: music, very new uploads, languages YouTube does not auto-caption well. When --list-subs comes back empty, the fallback is to download the audio and transcribe it yourself with OpenAI Whisper, a free, open-source speech-to-text model that runs entirely on your own machine, no API key, no per-minute cost:
# Grab the audio, then transcribe it to a .txt locally
yt-dlp -x --audio-format m4a -o audio.m4a "URL" && whisper audio.m4a --model small --output_format txtyt-dlp -x extracts the audio (see download YouTube audio for the full -x flow). Then whisper transcribes audio.m4a and --output_format txt writes a clean audio.txt with no timestamps to strip, so there is no sed step here. The --model small is a good speed/accuracy balance on a laptop; tiny/base are faster, medium/large are more accurate but want a GPU. Install it with pip install -U openai-whisper (it needs ffmpeg, which you likely already have from yt-dlp).
If you want C-speed transcription on CPU, whisper.cpp is the same model reimplemented in C++; it is noticeably faster on a machine with no GPU and is also free and open source. Either one turns a captionless video into text without sending your audio to a third-party service.
See also
- Download a YouTube video (any quality): the full video workflow, resolution selection, and the "Sign in to confirm you're not a bot" fix.
- Download YouTube audio: extract audio directly to M4A, Opus, or MP3 with
-x --audio-format. - yt-dlp cheat sheet: every flag worth knowing, including the subtitle options, in one reference.
FAQ
Sources
Authoritative references this article was fact-checked against.
- yt-dlp README (official)github.com
- yt-dlp subtitle options (official docs)github.com
- OpenAI Whisper README (official)github.com
- whisper.cpp (official repository)github.com





