Introduction to Bilibili Subtitle Extraction
Bilibili is one of the largest video-sharing platforms, featuring a vast array of content ranging from anime and gaming to educational tutorials and documentaries. As the platform grows globally, the need for Bilibili subtitle extraction has skyrocketed. Whether you are a content creator looking to repurpose video materials, a language learner needing transcripts, or a developer building AI summarization tools, extracting subtitles (Closed Captions or AI-generated text) is a highly valuable skill.
Unlike YouTube, Bilibili does not offer a native, one-click button to download subtitles. The platform utilizes dynamic loading and specific API endpoints to serve subtitle files, usually in JSON format. This comprehensive guide will walk you through the technical principles of Bilibili subtitle extraction, manual methods, browser userscripts, Python automation, and advanced AI-powered extraction techniques.
Understanding Bilibili Subtitle Types and Formats
Before diving into extraction methods, it is crucial to understand how Bilibili stores and displays text on its videos. Generally, subtitles on Bilibili fall into three categories:
- Uploader-Provided Subtitles (CC): These are high-quality, manually uploaded subtitle files (often in SRT or ASS formats) provided by the video creator. They are perfectly synced and highly accurate.
- Official AI-Generated Subtitles: Bilibili employs advanced speech recognition algorithms to auto-generate subtitles for videos lacking manual captions. While highly convenient, they may occasionally misunderstand complex jargon or heavy accents.
- Hardcoded Subtitles (Hardsubs): These are texts permanently burned into the video frames. They cannot be extracted via network requests and require Optical Character Recognition (OCR) or Audio Speech Recognition (ASR) to transcribe.
When you extract soft subtitles (CC or AI) from Bilibili's servers, the data is typically returned in a structured JSON format containing timestamps and text segments. Converting this to standard formats like TXT, SRT, or VTT is usually the next step for most users.
Method 1: Manual Extraction via Browser Developer Tools
If you only need to download subtitles for a single video, using your browser's Developer Tools is the most straightforward method. It requires no third-party software installation.
Step-by-Step Guide:
- Open your web browser (Chrome, Edge, or Firefox) and navigate to the target Bilibili video page.
- Right-click anywhere on the page and select Inspect (or press
F12) to open the Developer Tools. - Navigate to the Network tab.
- In the filter box, type
jsonto narrow down the network requests. - Refresh the web page (press
F5) and start playing the video. Ensure you click the "CC" button on the Bilibili video player to activate the subtitles. - Look for a network request containing keywords like
subtitlesor ending in a `.json` extension. Click on it and inspect the Preview or Response tab to verify it contains the subtitle text. - Right-click the request URL, open it in a new tab, and save the JSON file to your local computer.
Parsing the JSON File
Once downloaded, the JSON file will look like a structured dictionary. You can use a simple Python script to parse this file and extract just the text content:
import json
json_path = 'subtitle.json'
# Read the Bilibili JSON subtitle file
with open(json_path, 'r', encoding='utf-8') as f:
content = json.load(f)
extracted_text = ''
# Iterate through the body containing the text elements
for data in content.get('body', []):
extracted_text += data['content'] + '\n'
print(extracted_text)
Method 2: Browser Extensions and Userscripts
For users who frequently need to extract Bilibili subtitles, manual extraction becomes tedious. Fortunately, the open-source community has developed powerful userscripts that add direct download buttons to the Bilibili interface.
Using Tampermonkey Scripts
One of the most popular tools is the Bilibili AI Subtitle Exporter available on GreasyFork. This script injects a floating window onto the Bilibili video page, allowing you to download subtitles in TXT, SRT, or JSON formats with a single click.
- Installation: First, install the Tampermonkey extension for your browser. Then, search for "Bilibili AI Subtitle Exporter" on GreasyFork and install the script.
- Usage: When you open a Bilibili video, a floating icon will appear. If the video has subtitles, the icon turns blue. Clicking it reveals options to copy the text directly (ideal for pasting into AI summarization tools like ChatGPT) or download it as an SRT file for video editing.
Method 3: Python Web Scraping for Automated Extraction
If you are building an application or need to batch-download subtitles, interacting directly with Bilibili's API using Python is the best approach. The process involves two main API calls: fetching the video's CID (page ID) using the BVID, and then requesting the subtitle data.
Python Implementation Example
Below is a conceptual example of how to fetch Bilibili subtitles programmatically using the requests library:
import requests
def get_bilibili_subtitles(bvid):
# Step 1: Get the CID (Video Part ID)
pagelist_url = f"https://api.bilibili.com/x/player/pagelist?bvid={bvid}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(pagelist_url, headers=headers).json()
cid = response['data'][0]['cid']
# Step 2: Get Subtitle Metadata
info_url = f"https://api.bilibili.com/x/web-interface/view?bvid={bvid}&cid={cid}"
info_resp = requests.get(info_url, headers=headers).json()
# Step 3: Extract Subtitle URL and Download
subtitles = info_resp['data']['subtitle']['list']
if subtitles:
sub_url = subtitles[0]['subtitle_url']
if sub_url.startswith('//'):
sub_url = 'https:' + sub_url
sub_data = requests.get(sub_url, headers=headers).json()
for item in sub_data['body']:
print(f"[{item['from']} - {item['to']}] {item['content']}")
else:
print("No soft subtitles found for this video.")
Method 4: Handling Hardcoded Subtitles with AI and OCR
What happens if the Bilibili video does not have soft CCs or AI-generated captions, but instead has text burned directly into the video? In this scenario, network scraping will fail. You must rely on Audio Speech Recognition (ASR) or Optical Character Recognition (OCR).
Using Whisper AI for Transcription
Advanced open-source projects like lvusyy/biliSub utilize OpenAI's Whisper model to generate subtitles locally. By downloading the video's audio track and passing it through models like Whisper (Tiny, Base, Small, Medium, or Large), you can achieve highly accurate transcriptions.
This method is highly recommended for creating bilingual subtitles or processing older videos. However, it requires significant computational power, especially when using the "Large" models for maximum accuracy.
Overcoming Anti-Scraping Mechanisms and Rate Limits
Bilibili employs robust anti-bot and anti-scraping mechanisms to protect its server resources and copyright content. If you are scraping subtitles at scale, you will quickly run into API rate limits, IP bans, and browser fingerprinting challenges.
Essential Tools for Large-Scale Scraping
To ensure your extraction scripts run smoothly without triggering CAPTCHAs or permanent bans, you need professional infrastructure:
- AntidetectBrowser: When interacting with Bilibili's web interface or APIs, Bilibili checks your browser fingerprint (Canvas, WebGL, User-Agent). AntidetectBrowser allows you to create hundreds of isolated, unique browser profiles. This makes your automated scripts appear as genuine, distinct users, completely bypassing fingerprint-based blocks.
- IPOCTO: Bilibili strictly monitors IP request frequencies. By integrating IPOCTO's premium residential proxy networks into your scraping scripts or AntidetectBrowser profiles, you can rotate your IP address for every request. This eliminates the risk of IP-based rate limiting.
Additionally, some high-quality subtitles (especially for premium content) require a valid login session (Cookies/SESSDATA). Using AntidetectBrowser, you can safely log into multiple Bilibili accounts without linking them, ensuring uninterrupted access to authenticated API endpoints.
Repurposing Extracted Content for SEO
Extracting subtitles is often just the first step. Many marketers and bloggers use these transcripts to create comprehensive blog posts, video summaries, or localized content for international audiences.
If your goal is to publish this extracted content on your own website, simply pasting raw transcripts won't rank well on search engines. You need to optimize the text. We highly recommend using SEONIB, an advanced SEO optimization platform. SEONIB can analyze your extracted Bilibili transcripts, restructure them into highly readable articles, inject relevant keywords, and significantly boost your website's organic traffic.
Video Demonstration
For a visual understanding of how web scraping and network request analysis work in modern browsers, check out this conceptual tutorial:
Conclusion
Bilibili subtitle extraction ranges from simple manual JSON downloads via browser developer tools to sophisticated Python scraping and AI-driven speech recognition. By leveraging userscripts, API endpoints, and powerful tools like AntidetectBrowser and IPOCTO for bypassing restrictions, you can efficiently gather the text data you need. Always remember to respect copyright laws and use extracted content responsibly.
FAQ Section
Can I download Bilibili subtitles without an account?
Yes, for most public videos, you can extract subtitles without logging in. However, for members-only (premium) videos or certain restricted content, you must provide valid Bilibili account cookies (like SESSDATA) in your API requests to access the subtitle files.
How do I convert Bilibili JSON subtitles to SRT format?
Bilibili's native subtitle format is JSON. You can convert it to SRT using online subtitle converters, browser userscripts that offer direct SRT downloads, or by writing a simple Python script to map the JSON timestamps (from/to) into the standard SRT sequential format.
What should I do if a Bilibili video only has hardcoded subtitles?
If the subtitles are burned into the video frames, network extraction will not work. You must use Audio Speech Recognition (ASR) tools like OpenAI's Whisper, or Optical Character Recognition (OCR) software to scan the video frames and transcribe the text.
Why is my Python scraper getting blocked by Bilibili?
Bilibili uses strict anti-scraping measures including IP rate limits and browser fingerprinting. To avoid getting blocked, ensure you set legitimate User-Agent headers, add delays between requests, and utilize tools like AntidetectBrowser and IPOCTO proxies to distribute your traffic.
Is it legal to extract subtitles from Bilibili?
Extracting subtitles for personal use, language learning, or accessibility is generally acceptable. However, redistributing copyrighted subtitle content, using it for commercial gain, or plagiarizing a creator's work violates Bilibili's terms of service and copyright laws. Always obtain permission from the original uploader if you plan to republish.