How to Scrape YouTube Videos: Complete Guide
YouTube video data extraction is a common requirement for developers building analytics tools, content management systems, or research applications. This guide shows you how to scrape YouTube video metadata, comments, and transcripts using Puppeteer with practical code examples.
We'll cover three main data types you can extract from YouTube videos: video details (title, views, channel info), comments (with sorting and metadata), and transcripts (timestamped captions). Each section includes working code and links to comprehensive tutorials for advanced implementations.
Prerequisites
Before starting, ensure you have:
- Node.js installed (version 14 or higher)
- Basic knowledge of JavaScript and async/await
- Understanding of DOM manipulation and CSS selectors
- Familiarity with browser automation concepts
Setting Up the Project
Create a new project and install dependencies:
mkdir youtube-scraper
cd youtube-scraper
npm init -y
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
The stealth plugin helps avoid detection by making automated browser behavior appear more natural.
Extracting Video Details
Video details include metadata like title, view count, upload date, and channel information. Here's a basic implementation:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
async function scrapeVideoDetails(url) {
const browser = await puppeteer.launch({
headless: "new",
ignoreDefaultArgs: ["--enable-automation"]
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
// Wait for content to load
await page.waitForTimeout(3000);
const videoDetails = await page.evaluate(() => {
// Extract video title
const titleElement = document.querySelector('h1.ytd-video-primary-info-renderer yt-formatted-string');
const title = titleElement ? titleElement.textContent.trim() : '';
// Extract view count
const viewsElement = document.querySelector('yt-view-count-renderer .view-count');
const viewsText = viewsElement ? viewsElement.textContent : '';
// Extract channel name
const channelElement = document.querySelector('ytd-video-owner-renderer .ytd-channel-name a');
const channelName = channelElement ? channelElement.textContent.trim() : '';
// Extract upload date
const dateElement = document.querySelector('#info-strings yt-formatted-string');
const uploadDate = dateElement ? dateElement.textContent : '';
return {
title,
views: viewsText,
channelName,
uploadDate,
url: window.location.href
};
});
return videoDetails;
} finally {
await browser.close();
}
}
// Usage
// scrapeVideoDetails('https://www.youtube.com/watch?v=dQw4w9WgXcQ').then(console.log);
This basic implementation extracts core video metadata. For production use, you'll need more robust selectors, error handling, and numeric parsing for view counts.
For advanced video details extraction: How to Extract YouTube Video Details Using Puppeteer - includes advanced selectors, like/dislike parsing, duration extraction, and comprehensive error handling.
Extracting Comments
YouTube comments require scrolling to load more content and handling dynamic loading. Here's a simple implementation:
async function scrapeComments(url, limit = 10) {
const browser = await puppeteer.launch({
headless: "new",
ignoreDefaultArgs: ["--enable-automation"]
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 1024 });
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
// Scroll to comments section
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight / 3);
});
// Wait for comments to load
await page.waitForSelector('#comments', { timeout: 15000 });
await page.waitForTimeout(3000);
const comments = await page.evaluate((maxComments) => {
const commentElements = document.querySelectorAll('#comments #contents > ytd-comment-thread-renderer');
const results = [];
for (let i = 0; i < Math.min(commentElements.length, maxComments); i++) {
const element = commentElements[i];
const authorElement = element.querySelector('#author-text span');
const textElement = element.querySelector('#content-text');
const likesElement = element.querySelector('#vote-count-middle');
const timeElement = element.querySelector('#published-time-text');
const comment = {
author: authorElement ? authorElement.textContent.trim() : 'Unknown',
text: textElement ? textElement.textContent.trim() : '',
likes: likesElement ? parseInt(likesElement.textContent.replace(/[^\d]/g, '')) || 0 : 0,
time: timeElement ? timeElement.textContent.trim() : '',
position: i + 1
};
if (comment.text) {
results.push(comment);
}
}
return results;
}, limit);
return comments;
} finally {
await browser.close();
}
}
// Usage
// scrapeComments('https://www.youtube.com/watch?v=dQw4w9WgXcQ', 15).then(console.log);
This extracts the first set of loaded comments. For more comments, you need infinite scrolling and deduplication.
For advanced comment scraping: How to Scrape YouTube Comments With Puppeteer - includes infinite scrolling, comment sorting by top/new, reply counts, and creator hearts detection.
Extracting Transcripts
YouTube transcripts require clicking the transcript button and parsing timestamped content:
async function scrapeTranscript(url) {
const browser = await puppeteer.launch({
headless: "new",
ignoreDefaultArgs: ["--enable-automation"]
});
try {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
await page.waitForTimeout(3000);
// Look for transcript button and click it
await page.evaluate(() => {
const buttons = Array.from(document.querySelectorAll('button'));
const transcriptButton = buttons.find(button =>
button.textContent && button.textContent.toLowerCase().includes('transcript')
);
if (transcriptButton) {
transcriptButton.click();
}
});
// Wait for transcript panel to load
await page.waitForTimeout(2000);
const transcript = await page.evaluate(() => {
const transcriptContainer = document.querySelector('#segments-container');
if (!transcriptContainer) return null;
const segments = transcriptContainer.querySelectorAll('ytd-transcript-segment-renderer');
const results = [];
segments.forEach((segment, index) => {
const timestampElement = segment.querySelector('.ytd-transcript-segment-renderer[role="button"] .timestamp');
const textElement = segment.querySelector('.ytd-transcript-segment-renderer[role="button"] .segment-text');
if (timestampElement && textElement) {
results.push({
timestamp: timestampElement.textContent.trim(),
text: textElement.textContent.trim(),
index: index + 1
});
}
});
return results;
});
return transcript;
} finally {
await browser.close();
}
}
// Usage
// scrapeTranscript('https://www.youtube.com/watch?v=dQw4w9WgXcQ').then(console.log);
This basic transcript extraction handles the common case where transcripts are available.
For advanced transcript scraping: How to Scrape YouTube Transcripts With Puppeteer - includes multiple language support, auto-generated vs manual detection, and robust error handling.
API Alternative for Production Use
While Puppeteer gives you complete control, scraping at scale requires handling rate limits, browser management, and YouTube's frequent interface changes. For production applications, consider using the SocialKit YouTube APIs:
// Simple API calls - no browser management needed
const videoDetails = await fetch('https://api.socialkit.dev/youtube/stats?url=VIDEO_URL&access_key=KEY');
const comments = await fetch('https://api.socialkit.dev/youtube/comments?url=VIDEO_URL&limit=100&access_key=KEY');
const transcript = await fetch('https://api.socialkit.dev/youtube/transcript?url=VIDEO_URL&access_key=KEY');
API benefits: Reliable infrastructure, automatic updates, rate limiting handled, 20 free requests monthly.
Conclusion
This guide provides working examples for scraping YouTube video data with Puppeteer. Start with these basic implementations and refer to the detailed tutorials for production-ready solutions with advanced features.
For learning and experimentation, Puppeteer gives you complete control. For production applications requiring reliability and scale, consider using specialized APIs that handle the complexity of YouTube scraping automatically.