HinTextify

HinTextify - Hindi OCR Extractor

![Python](https://img.shields.io/badge/Python-3.10%2B-3776ab?style=for-the-badge&logo=python&logoColor=white) ![OpenCV](https://img.shields.io/badge/OpenCV-4.0%2B-5C3EE8?style=for-the-badge&logo=opencv&logoColor=white) ![Tesseract](https://img.shields.io/badge/Tesseract-OCR-FF6B35?style=for-the-badge&logo=googlefonts&logoColor=white) ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge) **πŸš€ Advanced Python toolkit for extracting Hindi text from scanned book images** *Leveraging Tesseract OCR with intelligent OpenCV preprocessing for superior accuracy* --- ---

🎯 Overview

Transform your digitization workflow with this powerful OCR solution designed specifically for Hindi manuscripts and books. Built with enterprise-grade image preprocessing and Unicode compliance, this tool delivers production-ready text extraction from scanned documents.

πŸ† Key Highlights

**πŸ”€ Multi-Language Support** - Native Hindi (Devanagari) OCR - Mixed Hindi+English processing - UTF-8 compliant output **πŸ–ΌοΈ Smart Image Processing** - Adaptive noise reduction - Intelligent binarization - Contrast optimization
**⚑ Batch Processing** - Process entire folders - Preserved file naming - Automated workflow **🎯 High Accuracy** - OpenCV preprocessing pipeline - Optimized for 300-600 DPI - Production-ready results

πŸ“ Project Architecture

πŸ“¦ hindi-book-ocr/
β”œβ”€β”€ πŸ“‚ Book/                 # πŸ“₯ Input Directory
β”‚   β”œβ”€β”€ πŸ–ΌοΈ page001.jpg
β”‚   β”œβ”€β”€ πŸ–ΌοΈ page002.png
β”‚   └── πŸ–ΌοΈ ...
β”œβ”€β”€ πŸ“‚ Book_text/            # πŸ“€ Output Directory  
β”‚   β”œβ”€β”€ πŸ“„ page001.txt
β”‚   β”œβ”€β”€ πŸ“„ page002.txt
β”‚   └── πŸ“„ ...
β”œβ”€β”€ 🐍 main.py              # πŸš€ Core OCR Engine
β”œβ”€β”€ πŸ“‹ requirements.txt     # πŸ“¦ Dependencies (pip)
β”œβ”€β”€ βš™οΈ pyproject.toml       # πŸ“¦ Dependencies (uv)
β”œβ”€β”€ πŸ“– README.md            # πŸ“š Documentation
└── πŸ”§ config.py           # βš™οΈ Configuration (optional)

πŸ› οΈ Installation Guide

Prerequisites

| Component | Version | Platform | |-----------|---------|----------| | 🐍 Python | 3.10+ | Cross-platform | | πŸ” Tesseract | 5.0+ | Windows/Linux/macOS | | πŸ“š Hindi Language Pack | Latest | Required |

Step 1: Install Tesseract OCR

πŸͺŸ Windows Installation

  1. Download & Install
    # Download from official repository
    https://github.com/UB-Mannheim/tesseract/wiki
    
  2. Default Installation Path
    C:\Users\{USERNAME}\AppData\Local\Programs\Tesseract-OCR\
    
  3. Verify Installation
    tesseract --version
    tesseract --list-langs
    

🐧 Linux Installation

# Ubuntu/Debian
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-hin

# CentOS/RHEL/Fedora
sudo dnf install tesseract tesseract-langpack-hin

🍎 macOS Installation

# Using Homebrew
brew install tesseract tesseract-lang

Step 2: Python Environment Setup

Option A: Using pip (Traditional)

# Clone or download the project
git clone https://github.com/sahilkhan117/HinTextify.git
cd HinTextify

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Option B: Using uv (Modern & Faster)

# Install uv if not already installed
pip install uv

# Setup project
uv sync

πŸš€ Quick Start

Basic Usage

  1. Prepare Your Images
    # Place scanned images in the Book folder
    cp /path/to/your/scans/* ./Book/
    
  2. Execute OCR Processing
    # Standard execution
    python main.py
       
    # With uv (recommended)
    uv run main.py
    
  3. Retrieve Results
    # Check extracted text files
    ls -la Book_text/
    

Advanced Configuration

# main.py - Customization Options

# Input/Output Directories
INPUT_FOLDER = "Book"
OUTPUT_FOLDER = "Book_text"

# Tesseract Configuration
TESSERACT_PATH = r"C:\Users\{USERNAME}\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"
LANGUAGE_CONFIG = 'hin+eng'  # Hindi + English

# Image Processing Parameters
DPI_SETTING = 300
PREPROCESSING_ENABLED = True

πŸ“Š Performance Optimization

Image Quality Guidelines

Parameter Recommended Notes
Resolution 300-600 DPI Higher DPI = Better accuracy
Format PNG, TIFF Lossless compression preferred
Color Mode Grayscale Reduces processing time
File Size < 10MB For optimal memory usage

πŸ“ˆ Usage Examples

Example 1: Single Language Processing

# For pure Hindi content
config = {
    'lang': 'hin',
    'psm': 6,  # Uniform block of text
    'oem': 1   # Neural nets LSTM engine
}

Example 2: Mixed Language Content

# For Hindi + English mixed content
config = {
    'lang': 'hin+eng',
    'psm': 3,  # Fully automatic page segmentation
    'oem': 1
}

Example 3: Batch Processing Results

Input Structure:

Book/
β”œβ”€β”€ chapter01_page001.jpg
β”œβ”€β”€ chapter01_page002.jpg
β”œβ”€β”€ chapter02_page001.png
└── manuscript_page045.tiff

Output Results:

Book_text/
β”œβ”€β”€ chapter01_page001.txt  βœ… 2.3KB extracted
β”œβ”€β”€ chapter01_page002.txt  βœ… 1.8KB extracted  
β”œβ”€β”€ chapter02_page001.txt  βœ… 2.1KB extracted
└── manuscript_page045.txt βœ… 2.7KB extracted

🀝 Contributing

We welcome contributions from the community! Here’s how you can help:

| πŸ› **Bug Reports** | πŸ†• **Feature Requests** | πŸ“– **Documentation** | πŸ§ͺ **Testing** | |-------------------|------------------------|---------------------|---------------|

πŸ“„ License & Credits

**MIT License** Β© 2024 Hindi Book OCR Extractor *Built with ❀️ for the Hindi digitization community* ### πŸ™ Acknowledgments - **[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)** - Google's OCR Engine - **[OpenCV](https://opencv.org/)** - Computer Vision Library - **[Python Community](https://www.python.org/)** - Programming Language --- ### πŸ”— Connect & Support [![⭐ Star on GitHub](https://img.shields.io/badge/⭐-Star%20on%20GitHub-yellow?style=for-the-badge)](../../stargazers) [![πŸ› Report Bug](https://img.shields.io/badge/πŸ›-Report%20Bug-red?style=for-the-badge)](../../issues) [![πŸ’‘ Request Feature](https://img.shields.io/badge/πŸ’‘-Request%20Feature-blue?style=for-the-badge)](../../issues) **Made with 🧠 and β˜• | Happy OCR Processing! πŸ“šβž‘οΈπŸ“**