Back to all posts

How to Scrape YouTube Transcripts With Puppeteer

Jonathan Geiger
web-scrapingpuppeteeryoutubetutorial

YouTube transcripts are invaluable for content creators, researchers, and developers who need to extract spoken content from videos. While YouTube provides transcripts through its interface, automating this process can save significant time when dealing with multiple videos or building applications that require transcript data.

In this comprehensive guide, we'll explore how to scrape YouTube transcripts using Puppeteer, a powerful Node.js library that controls headless Chrome browsers. We'll cover everything from basic setup to advanced error handling and provide a complete working solution.

Prerequisites

Before diving into the code, ensure you have the following:

  • Node.js installed on your machine (version 14 or higher)
  • Basic knowledge of JavaScript and async/await
  • Understanding of DOM manipulation and CSS selectors
  • Familiarity with browser automation concepts

Understanding YouTube's Transcript Structure

YouTube's transcript feature isn't always visible by default. The transcript panel appears after clicking the "Show transcript" button, usually located in the video description area. The transcript content is dynamically loaded and structured as follows:

  • Transcript Button: Located in ytd-video-description-transcript-section-renderer button
  • Transcript Container: Found in #segments-container
  • Individual Segments: Each transcript line is in yt-formatted-string elements

Setting Up the Project

First, create a new project directory and initialize it:

mkdir youtube-transcript-scraper
cd youtube-transcript-scraper
npm init -y

Install the required dependencies:

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

The stealth plugin helps avoid detection by making our automated browser appear more like a real user.

Basic Implementation

Let's start with a basic implementation that extracts transcripts from a YouTube video:

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

// Use stealth plugin to avoid detection
puppeteer.use(StealthPlugin());

const scrapeYouTubeTranscript = async (url) => {
  const browser = await puppeteer.launch({
    headless: "new",
    ignoreDefaultArgs: ["--enable-automation"]
  });

  try {
    const page = await browser.newPage();
    await page.setViewport({ width: 1280, height: 800 });

    // Navigate to YouTube video
    await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
    await page.waitForTimeout(2000);

    // Handle cookie banner (EU compliance)
    try {
      await page.evaluate(() => {
        const cookieButton = document.querySelector('button[aria-label*="cookies"]');
        if (cookieButton) {
          cookieButton.click();
          console.log('Closed cookie banner');
        }
      });
    } catch (e) {
      console.log('No cookie banner found');
    }

    // Click transcript button
    await page.waitForSelector('ytd-video-description-transcript-section-renderer button', { timeout: 10000 });
    await page.evaluate(() => {
      const transcriptButton = document.querySelector('ytd-video-description-transcript-section-renderer button');
      if (transcriptButton) {
        transcriptButton.click();
        console.log('Clicked transcript button');
      }
    });

    await page.waitForTimeout(2000);

    // Extract transcript text
    const transcriptText = await page.evaluate(() => {
      const segments = Array.from(document.querySelectorAll('#segments-container yt-formatted-string'));
      return segments.map(element => element.textContent?.trim()).filter(text => text && text.length > 0);
    });

    return transcriptText.join(' ');

  } catch (error) {
    console.error('Error scraping transcript:', error);
    throw error;
  } finally {
    await browser.close();
  }
};

// Usage
const videoUrl = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ';
scrapeYouTubeTranscript(videoUrl)
  .then(transcript => console.log('Transcript:', transcript))
  .catch(error => console.error('Failed to scrape transcript:', error));

Advanced Implementation with Error Handling

The basic implementation works for most videos, but YouTube's interface can vary. Here's a more robust version with comprehensive error handling:

const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

const formatTimestamp = (seconds) => {
  const mins = Math.floor(seconds / 60);
  const secs = seconds % 60;
  return `${mins.toString().padStart(2, '0')}:${secs.toString().padStart(2, '0')}`;
};

const scrapeYouTubeTranscriptAdvanced = async (url) => {
  const browser = await puppeteer.launch({
    headless: "new",
    ignoreDefaultArgs: ["--enable-automation"]
  });

  try {
    const page = await browser.newPage();
    await page.setViewport({
      width: 1280,
      height: 1024,
      deviceScaleFactor: 1,
    });

    // Navigate to YouTube video
    await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
    await page.waitForTimeout(2000);

    // Try to close cookie banner if present
    try {
      await page.evaluate(() => {
        const cookieButton = document.querySelector('button[aria-label*="cookies"]');
        if (cookieButton) {
          cookieButton.click();
          console.log('Closed cookie banner');
        }
      });
    } catch (e) {
      console.log('No cookie banner found');
    }

    // Click transcript button with multiple fallback selectors
    try {
      await page.waitForSelector('ytd-video-description-transcript-section-renderer button', { timeout: 10000 });
      await page.evaluate(() => {
        const transcriptButton = document.querySelector('ytd-video-description-transcript-section-renderer button');
        if (transcriptButton) {
          transcriptButton.click();
          console.log('Clicked transcript button');
        }
      });
      await page.waitForTimeout(2000);
    } catch (e) {
      console.log('Transcript button not found, trying alternative selectors');
      
      // Try alternative selectors
      try {
        await page.evaluate(() => {
          const selectors = [
            'button[aria-label*="transcript"]',
            'button[aria-label*="Transcript"]',
            '[data-target-id="engagement-panel-transcript"] button',
            '#transcript-button',
            'button[aria-label*="Show transcript"]'
          ];
          
          for (const selector of selectors) {
            const button = document.querySelector(selector);
            if (button) {
              button.click();
              console.log('Clicked transcript button with selector:', selector);
              return;
            }
          }
        });
        await page.waitForTimeout(2000);
      } catch (e2) {
        console.log('Alternative transcript button selectors failed');
      }
    }

    // Extract transcript text with multiple fallback selectors
    let transcriptText = await page.evaluate(() => {
      const segments = Array.from(document.querySelectorAll('#segments-container yt-formatted-string'));
      return segments.map(element => element.textContent?.trim()).filter(text => text && text.length > 0);
    });

    if (transcriptText.length === 0) {
      // Try alternative transcript selectors
      const alternativeTranscript = await page.evaluate(() => {
        const selectors = [
          '#segments-container span',
          '#segments-container div',
          '[data-target-id="engagement-panel-transcript"] span',
          '[data-target-id="engagement-panel-transcript"] div',
          '.ytd-transcript-segment-renderer span',
          '.ytd-transcript-segment-renderer div'
        ];
        
        for (const selector of selectors) {
          const elements = document.querySelectorAll(selector);
          if (elements.length > 0) {
            const texts = Array.from(elements).map(el => el.textContent?.trim()).filter(text => text && text.length > 0);
            if (texts.length > 0) {
              return texts;
            }
          }
        }
        return [];
      });

      if (alternativeTranscript.length === 0) {
        throw new Error('No transcript available for this video.');
      }
      
      transcriptText = alternativeTranscript;
    }

    // Format transcript with timestamps
    const transcript = transcriptText.map((text, index) => ({
      text,
      start: index * 5,
      duration: 5,
      timestamp: formatTimestamp(index * 5),
    }));

    const fullText = transcript.map(entry => entry.text).join(' ');
    
    return {
      url,
      transcript,
      fullText,
      wordCount: fullText.split(' ').length,
      segments: transcript.length,
    };

  } catch (error) {
    console.error('Error scraping transcript:', error);
    throw error;
  } finally {
    await browser.close();
  }
};

// Usage
const videoUrl = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ';
scrapeYouTubeTranscriptAdvanced(videoUrl)
  .then(result => {
    console.log('Full transcript:', result.fullText);
    console.log('Word count:', result.wordCount);
    console.log('Segments:', result.segments);
    console.log('First few segments:', result.transcript.slice(0, 5));
  })
  .catch(error => console.error('Failed to scrape transcript:', error));

Handling Common Issues

YouTube shows cookie consent banners in EU regions. Our code handles this by looking for buttons with "cookies" in their aria-label and clicking them automatically.

2. Transcript Button Variations

YouTube's interface changes frequently. We use multiple selectors to find the transcript button, ensuring compatibility across different layouts.

3. Dynamic Content Loading

Transcripts are loaded dynamically. We use waitForTimeout() to ensure content is fully loaded before attempting to extract it.

4. Rate Limiting

To avoid being blocked, consider:

  • Adding random delays between requests
  • Using residential proxies
  • Implementing retry logic with exponential backoff

Best Practices

  1. Respect robots.txt: Always check YouTube's robots.txt file
  2. Implement caching: Store transcripts locally to avoid repeated requests
  3. Error handling: Always wrap your scraping code in try-catch blocks
  4. User agent rotation: Use different user agents to appear more natural
  5. Respect rate limits: Don't overwhelm YouTube's servers

Alternative: Using SocialKit YouTube Transcript API

If you're looking for a more reliable solution without managing browser instances and handling YouTube's constantly changing interface, consider using SocialKit's YouTube Transcript API:

curl "https://api.socialkit.dev/youtube/transcript?access_key=YOUR_ACCESS_KEY&url=https://youtube.com/watch?v=dQw4w9WgXcQ"

Example Response

{
  "success": true,
  "data": {
    "url": "https://youtube.com/watch?v=dQw4w9WgXcQ",
    "transcript": "[♪♪♪] ♪ We're no strangers to love ♪ ♪ You know the rules\nand so do I ♪ ♪ A full commitment's\nwhat I'm thinking of ♪ ♪ You wouldn't get this\nfrom any other guy ♪ ♪ I just wanna tell you\nhow I'm feeling ♪ ♪ Gotta make you understand ♪ ♪ Never gonna give you up ♪",
    "transcriptSegments": [
      {
        "text": "[♪♪♪]",
        "start": 0,
        "duration": 5,
        "timestamp": "00:00"
      },
      {
        "text": "♪ We're no strangers to love ♪",
        "start": 5,
        "duration": 5,
        "timestamp": "00:05"
      }
    ],
    "wordCount": 458,
    "segments": 61
  }
}

Benefits of using SocialKit:

  • Reliable extraction: Built-in handling of YouTube's interface changes
  • No browser management: No need to manage Puppeteer instances
  • Consistent results: Tested across thousands of videos
  • Automatic retries: Built-in retry logic for failed requests
  • Structured data: Returns both full text and timestamped segments
  • Scale-ready: Handle multiple videos without resource constraints

Conclusion

Scraping YouTube transcripts with Puppeteer is a powerful technique for automating content extraction. While the implementation requires careful handling of YouTube's dynamic interface and various edge cases, the results can be extremely valuable for content analysis, accessibility, and automation workflows.

For production applications or when dealing with large volumes of videos, consider using a dedicated API service like SocialKit's YouTube Transcript API to ensure reliability and save development time.

Remember to always respect YouTube's terms of service and implement appropriate rate limiting to avoid being blocked. Happy scraping!