HinTextify

HinTextify - Hindi OCR Extractor

![Python](https://img.shields.io/badge/Python-3.10%2B-3776ab?style=for-the-badge&logo=python&logoColor=white) ![OpenCV](https://img.shields.io/badge/OpenCV-4.0%2B-5C3EE8?style=for-the-badge&logo=opencv&logoColor=white) ![Tesseract](https://img.shields.io/badge/Tesseract-OCR-FF6B35?style=for-the-badge&logo=googlefonts&logoColor=white) ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge) **🚀 Advanced Python toolkit for extracting Hindi text from scanned book images** *Leveraging Tesseract OCR with intelligent OpenCV preprocessing for superior accuracy* ---

---

🎯 Overview

Transform your digitization workflow with this powerful OCR solution designed specifically for Hindi manuscripts and books. Built with enterprise-grade image preprocessing and Unicode compliance, this tool delivers production-ready text extraction from scanned documents.

🏆 Key Highlights

🔤 Multi-Language Support - Native Hindi (Devanagari) OCR - Mixed Hindi+English processing - UTF-8 compliant output	🖼️ Smart Image Processing - Adaptive noise reduction - Intelligent binarization - Contrast optimization
⚡ Batch Processing - Process entire folders - Preserved file naming - Automated workflow	🎯 High Accuracy - OpenCV preprocessing pipeline - Optimized for 300-600 DPI - Production-ready results

📁 Project Architecture

📦 hindi-book-ocr/
├── 📂 Book/                 # 📥 Input Directory
│   ├── 🖼️ page001.jpg
│   ├── 🖼️ page002.png
│   └── 🖼️ ...
├── 📂 Book_text/            # 📤 Output Directory  
│   ├── 📄 page001.txt
│   ├── 📄 page002.txt
│   └── 📄 ...
├── 🐍 main.py              # 🚀 Core OCR Engine
├── 📋 requirements.txt     # 📦 Dependencies (pip)
├── ⚙️ pyproject.toml       # 📦 Dependencies (uv)
├── 📖 README.md            # 📚 Documentation
└── 🔧 config.py           # ⚙️ Configuration (optional)

🛠️ Installation Guide

Prerequisites

| Component | Version | Platform | |-----------|---------|----------| | 🐍 Python | 3.10+ | Cross-platform | | 🔍 Tesseract | 5.0+ | Windows/Linux/macOS | | 📚 Hindi Language Pack | Latest | Required |

Step 1: Install Tesseract OCR

🪟 Windows Installation

Download & Install

# Download from official repository
https://github.com/UB-Mannheim/tesseract/wiki

Default Installation Path

C:\Users\{USERNAME}\AppData\Local\Programs\Tesseract-OCR\

Verify Installation

tesseract --version
tesseract --list-langs

🐧 Linux Installation

# Ubuntu/Debian
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-hin

# CentOS/RHEL/Fedora
sudo dnf install tesseract tesseract-langpack-hin

🍎 macOS Installation

# Using Homebrew
brew install tesseract tesseract-lang

Step 2: Python Environment Setup

Option A: Using pip (Traditional)

# Clone or download the project
git clone https://github.com/sahilkhan117/HinTextify.git
cd HinTextify

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate     # Windows

# Install dependencies
pip install -r requirements.txt

Option B: Using uv (Modern & Faster)

# Install uv if not already installed
pip install uv

# Setup project
uv sync

🚀 Quick Start

Basic Usage

Prepare Your Images

# Place scanned images in the Book folder
cp /path/to/your/scans/* ./Book/

Execute OCR Processing

# Standard execution
python main.py
   
# With uv (recommended)
uv run main.py

Retrieve Results

# Check extracted text files
ls -la Book_text/

Advanced Configuration

# main.py - Customization Options

# Input/Output Directories
INPUT_FOLDER = "Book"
OUTPUT_FOLDER = "Book_text"

# Tesseract Configuration
TESSERACT_PATH = r"C:\Users\{USERNAME}\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"
LANGUAGE_CONFIG = 'hin+eng'  # Hindi + English

# Image Processing Parameters
DPI_SETTING = 300
PREPROCESSING_ENABLED = True

📊 Performance Optimization

Image Quality Guidelines

Parameter	Recommended	Notes
Resolution	300-600 DPI	Higher DPI = Better accuracy
Format	PNG, TIFF	Lossless compression preferred
Color Mode	Grayscale	Reduces processing time
File Size	< 10MB	For optimal memory usage

📈 Usage Examples

Example 1: Single Language Processing

# For pure Hindi content
config = {
    'lang': 'hin',
    'psm': 6,  # Uniform block of text
    'oem': 1   # Neural nets LSTM engine
}

Example 2: Mixed Language Content

# For Hindi + English mixed content
config = {
    'lang': 'hin+eng',
    'psm': 3,  # Fully automatic page segmentation
    'oem': 1
}

Example 3: Batch Processing Results

Input Structure:

Book/
├── chapter01_page001.jpg
├── chapter01_page002.jpg
├── chapter02_page001.png
└── manuscript_page045.tiff

Output Results:

Book_text/
├── chapter01_page001.txt  ✅ 2.3KB extracted
├── chapter01_page002.txt  ✅ 1.8KB extracted  
├── chapter02_page001.txt  ✅ 2.1KB extracted
└── manuscript_page045.txt ✅ 2.7KB extracted

🤝 Contributing

We welcome contributions from the community! Here’s how you can help:

| 🐛 **Bug Reports** | 🆕 **Feature Requests** | 📖 **Documentation** | 🧪 **Testing** | |-------------------|------------------------|---------------------|---------------|

📄 License & Credits

**MIT License** © 2024 Hindi Book OCR Extractor *Built with ❤️ for the Hindi digitization community* ### 🙏 Acknowledgments - **[Tesseract OCR](https://github.com/tesseract-ocr/tesseract)** - Google's OCR Engine - **[OpenCV](https://opencv.org/)** - Computer Vision Library - **[Python Community](https://www.python.org/)** - Programming Language --- ### 🔗 Connect & Support [![⭐ Star on GitHub](https://img.shields.io/badge/⭐-Star%20on%20GitHub-yellow?style=for-the-badge)](../../stargazers) [![🐛 Report Bug](https://img.shields.io/badge/🐛-Report%20Bug-red?style=for-the-badge)](../../issues) [![💡 Request Feature](https://img.shields.io/badge/💡-Request%20Feature-blue?style=for-the-badge)](../../issues) **Made with 🧠 and ☕ | Happy OCR Processing! 📚➡️📝**

This site is open source. Improve this page.

🔤 Multi-Language Support - Native Hindi (Devanagari) OCR - Mixed Hindi+English processing - UTF-8 compliant output	🖼️ Smart Image Processing - Adaptive noise reduction - Intelligent binarization - Contrast optimization
⚡ Batch Processing - Process entire folders - Preserved file naming - Automated workflow	🎯 High Accuracy - OpenCV preprocessing pipeline - Optimized for 300-600 DPI - Production-ready results