xsukax ReadClean PDF

📄 xsukax ReadClean PDF

A privacy-focused, client-side web application that extracts clean, readable content from any webpage and converts it to PDF format. Built with pure HTML, CSS, and JavaScript—no backend required, no tracking, complete privacy.

Github Repo: https://github.com/xsukax/xsukax-ReadClean-PDF

Demo: https://xsukax.github.io/xsukax-ReadClean-PDF

License: GPL v3 GitHub stars GitHub issues

🎯 Project Overview

xsukax ReadClean PDF is a lightweight, browser-based tool designed to transform cluttered web content into clean, distraction-free PDFs optimized for reading and archival. The application strips away advertisements, navigation elements, and other extraneous content while preserving the core article or document structure.

Primary Purpose

  • Content Extraction: Intelligently identifies and extracts main content from web pages
  • Distraction Removal: Eliminates ads, scripts, sidebars, navigation menus, and other non-essential elements
  • PDF Generation: Leverages native browser print functionality for high-quality PDF output
  • Universal Compatibility: Works with any website through multiple fetching methods

Core Functionalities

  1. URL-Based Fetching: Retrieve content directly from web URLs using CORS proxies
  2. HTML Paste Processing: Process raw HTML content pasted directly into the application
  3. Bookmarklet Integration: One-click content extraction from any webpage via browser bookmarklet
  4. Intelligent Content Cleaning: Automated removal of ads, scripts, images, links, and navigation elements
  5. RTL/LTR Language Support: Automatic detection and proper rendering of right-to-left and left-to-right text
  6. Responsive Design: Optimized interface for desktop and mobile devices

🔒 Security and Privacy Benefits

xsukax ReadClean PDF is architected with privacy and security as foundational principles. All processing occurs entirely within your browser, ensuring complete data sovereignty and protection.

Privacy-Centric Architecture

Client-Side Processing Only

All HTML parsing, content extraction, and PDF generation occur exclusively in your browser’s JavaScript environment. No data is transmitted to external servers for processing, eliminating concerns about data interception, logging, or unauthorized access.

No Data Collection or Tracking

  • Zero Analytics: The application contains no analytics scripts, tracking pixels, or telemetry
  • No Cookies: Does not set or read cookies for user identification or behavior tracking
  • No External Dependencies: Core functionality operates without loading third-party libraries from CDNs
  • No User Accounts: Fully functional without registration, login, or profile creation

Note: When using the URL Fetch method, the application routes requests through public CORS proxy services to bypass browser same-origin restrictions. While these proxies can see the URLs being fetched, they do not receive or process your extracted content. For maximum privacy, use the Bookmarklet or Paste HTML methods, which operate entirely offline.

Security Features

Content Sanitization

  • Script Removal: Automatically strips all <script> tags to prevent execution of potentially malicious code
  • Style Isolation: Removes external stylesheets and inline styles that could contain tracking mechanisms
  • Iframe Elimination: Blocks embedded iframes that might load third-party content or trackers

Safe HTML Processing

  • Utilizes browser’s native DOMParser for secure HTML parsing
  • Prevents XSS (Cross-Site Scripting) attacks through proper content handling
  • Sandboxed execution environment ensures no persistent storage or state

Bookmarklet Security Model

The bookmarklet operates with the same security context as the page you’re visiting, ensuring:

  • No data leaves your browser
  • Execution occurs only when explicitly triggered by user action
  • Content extraction happens in an isolated window context

Data Sovereignty Guarantees

Aspect Implementation
Data Storage None—all processing is ephemeral and session-based
Network Requests Only when using URL Fetch; Bookmarklet mode is 100% offline
Third-Party Access Zero—no external services process your content
Data Retention None—content is discarded when you close the browser tab
Audit Trail Open source codebase—verify behavior independently

✨ Features and Advantages

Key Benefits

🚀 Zero Installation Required

  • Single HTML file—download and open in any modern browser
  • No npm packages, dependencies, or build processes
  • Portable—run from USB drive, local filesystem, or web server

🎨 Intelligent Content Extraction

  • Prioritizes main content areas using semantic HTML selectors (`
    `, `
    `, `[role=”main”]`)
  • Adaptive fallback mechanism for non-standard page structures
  • Preserves document hierarchy and formatting

🧹 Comprehensive Cleaning Options

  • Remove Ads & Scripts: Eliminates advertisements, tracking scripts, and promotional content
  • Remove Images: Strips all images for text-only output (reduces PDF size)
  • Remove Links & Buttons: Converts hyperlinks to plain text, removes interactive elements
  • Expand All Sections: Automatically opens collapsible elements and hidden content
  • Auto-detect RTL/LTR: Recognizes Hebrew, Arabic, and other right-to-left scripts

🌐 Multi-Language Support

  • Unicode-aware text processing
  • Automatic bidirectional text handling
  • Proper rendering of mixed-direction content

📱 Responsive Interface

  • Mobile-optimized design with touch-friendly controls
  • Tablet and desktop layouts
  • Dark theme for reduced eye strain

Three Fetching Methods

  1. URL Fetch: Direct content retrieval with automatic proxy fallback
  2. Paste HTML: Process saved HTML files or copied source code
  3. Bookmarklet: One-click extraction from any webpage (recommended for maximum privacy)

Unique Selling Points

Feature xsukax ReadClean PDF Traditional Tools
Privacy 100% client-side, zero tracking Often cloud-based with data collection
Cost Free and open source Frequently requires subscriptions
Installation None—single HTML file Browser extensions or desktop apps
Cross-Platform Any device with a modern browser Platform-specific builds
Offline Capable Yes (Bookmarklet and Paste modes) Usually requires internet connection
Customization Open source—modify freely Closed-source, limited options
CORS Workaround Built-in proxy fallback Manual configuration or paid services

Comparison with Alternatives

vs. Browser Extensions

  • No permission requirements or security warnings
  • Works in private/incognito mode without special settings
  • No browser-specific compatibility issues

vs. Online Services

  • Complete privacy—no data leaves your device (Bookmarklet mode)
  • No rate limits, captchas, or service outages
  • Unlimited usage without accounts or paywalls

vs. Print Stylesheets

  • More aggressive content cleaning
  • Works on sites without print-optimized CSS
  • Consistent output across all websites

🛠️ Installation Instructions

Method 1: Direct Download (Recommended)

  1. Download the HTML file:

    # Clone the repository
    git clone https://github.com/xsukax/xsukax-ReadClean-PDF.git
    
    # Navigate to the directory
    cd xsukax-ReadClean-PDF
  2. Open in browser:

    • Windows: Double-click index.html or right-click → Open with → [Your Browser]
    • macOS: Double-click index.html or drag to browser icon
    • Linux: xdg-open index.html or use your file manager

Method 2: Web Server Deployment

For sharing within your organization or hosting publicly:

# Using Python's built-in HTTP server
python3 -m http.server 8000

# Using Node.js http-server
npx http-server -p 8000

# Using PHP's built-in server
php -S localhost:8000

Access at: http://localhost:8000

Method 3: GitHub Pages (Public Hosting)

  1. Fork this repository
  2. Go to SettingsPages
  3. Select main branch as source
  4. Your app will be available at: https://[username].github.io/xsukax-ReadClean-PDF/

Bookmarklet Installation

  1. Open index.html in your browser
  2. Navigate to the Bookmarklet tab
  3. Drag the “📄 xsukax ReadClean PDF” button to your bookmarks bar
  4. Use on any webpage by clicking the bookmark

Manual Bookmarklet Creation (if drag-and-drop doesn’t work):

  1. Create a new bookmark in your browser
  2. Set the name to “xsukax ReadClean PDF”
  3. Copy the JavaScript code from the bookmarklet button
  4. Paste into the URL field
  5. Save

📖 Usage Guide

Detailed Workflows

Workflow 1: URL Fetch Method

Best for: Websites that allow CORS or when you want quick extraction

  1. Open the application in your browser
  2. Enter the URL in the “Website URL” field
  3. Configure cleaning options:
    • ✅ Remove Ads & Scripts (recommended)
    • ✅ Remove Images (optional—for smaller PDFs)
    • ✅ Remove Links & Buttons (optional—for cleaner text)
    • ✅ Expand All Sections (recommended—reveals hidden content)
    • ✅ Auto-detect RTL/LTR (recommended—for proper text direction)
  4. Click “🔥 Fetch Content”
  5. Wait for processing (the app tries multiple proxies automatically)
  6. Review the extracted content in the preview area
  7. Click “📄 Save as PDF (Print)”
  8. In the print dialog:
    • Choose “Save as PDF” as destination
    • Adjust margins if needed (recommend “Default” or “Minimum”)
    • Click “Save”

Troubleshooting: If all proxies fail, use the Bookmarklet or Paste HTML method instead.

Workflow 2: Paste HTML Method

Best for: When you have the HTML source code or saved webpage

  1. Get the HTML source:
    • Open target webpage
    • Press Ctrl+U (Windows/Linux) or Cmd+Option+U (macOS)
    • Or right-click → “View Page Source”
    • Select all (Ctrl+A or Cmd+A) and copy (Ctrl+C or Cmd+C)
  2. Switch to “Paste HTML” tab
  3. Paste the HTML into the text area
  4. Configure cleaning options (same as URL Fetch)
  5. Click “🔄 Process HTML”
  6. Review and save (steps 6-8 from Workflow 1)

Alternative: You can also paste HTML from “Save As → Web Page, Complete” files

Workflow 3: Bookmarklet Method (Recommended)

Best for: Maximum privacy, one-click extraction, bypassing CORS restrictions

Setup (one-time):

  1. Open the application
  2. Navigate to the “Bookmarklet” tab
  3. Drag the “📄 xsukax ReadClean PDF” button to your bookmarks bar

Usage:

  1. Navigate to any webpage you want to convert
  2. Click the bookmarklet from your bookmarks bar
  3. A new window opens with cleaned content
  4. Print dialog appears automatically after 500ms
  5. Save as PDF:
    • Choose “Save as PDF” as destination
    • Click “Save”
  6. Done! The original page remains unchanged

Configuration Options Explained

Option Effect Recommendation
Remove Ads & Scripts Strips advertising, tracking scripts, navigation menus, sidebars ✅ Always enable
Remove Images Eliminates all images and figures 🟡 Enable for text-only PDFs; disable to preserve diagrams
Remove Links & Buttons Converts hyperlinks to plain text, removes interactive elements 🟡 Enable for print-optimized output; disable to preserve URLs
Expand All Sections Opens collapsible `
` elements and hidden content
✅ Always enable
Auto-detect RTL/LTR Analyzes content for Hebrew, Arabic, Persian scripts and adjusts text direction ✅ Always enable (harmless for LTR content)

📞 Support and Contact

Leave a Reply

Your email address will not be published. Required fields are marked *