Transform your digitization workflow with this powerful OCR solution designed specifically for Hindi manuscripts and books. Built with enterprise-grade image preprocessing and Unicode compliance, this tool delivers production-ready text extraction from scanned documents.
**π€ Multi-Language Support** - Native Hindi (Devanagari) OCR - Mixed Hindi+English processing - UTF-8 compliant output | **πΌοΈ Smart Image Processing** - Adaptive noise reduction - Intelligent binarization - Contrast optimization |
**β‘ Batch Processing** - Process entire folders - Preserved file naming - Automated workflow | **π― High Accuracy** - OpenCV preprocessing pipeline - Optimized for 300-600 DPI - Production-ready results |
π¦ hindi-book-ocr/
βββ π Book/ # π₯ Input Directory
β βββ πΌοΈ page001.jpg
β βββ πΌοΈ page002.png
β βββ πΌοΈ ...
βββ π Book_text/ # π€ Output Directory
β βββ π page001.txt
β βββ π page002.txt
β βββ π ...
βββ π main.py # π Core OCR Engine
βββ π requirements.txt # π¦ Dependencies (pip)
βββ βοΈ pyproject.toml # π¦ Dependencies (uv)
βββ π README.md # π Documentation
βββ π§ config.py # βοΈ Configuration (optional)
# Download from official repository
https://github.com/UB-Mannheim/tesseract/wiki
C:\Users\{USERNAME}\AppData\Local\Programs\Tesseract-OCR\
tesseract --version
tesseract --list-langs
# Ubuntu/Debian
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-hin
# CentOS/RHEL/Fedora
sudo dnf install tesseract tesseract-langpack-hin
# Using Homebrew
brew install tesseract tesseract-lang
# Clone or download the project
git clone https://github.com/sahilkhan117/HinTextify.git
cd HinTextify
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Install uv if not already installed
pip install uv
# Setup project
uv sync
# Place scanned images in the Book folder
cp /path/to/your/scans/* ./Book/
# Standard execution
python main.py
# With uv (recommended)
uv run main.py
# Check extracted text files
ls -la Book_text/
# main.py - Customization Options
# Input/Output Directories
INPUT_FOLDER = "Book"
OUTPUT_FOLDER = "Book_text"
# Tesseract Configuration
TESSERACT_PATH = r"C:\Users\{USERNAME}\AppData\Local\Programs\Tesseract-OCR\tesseract.exe"
LANGUAGE_CONFIG = 'hin+eng' # Hindi + English
# Image Processing Parameters
DPI_SETTING = 300
PREPROCESSING_ENABLED = True
Parameter | Recommended | Notes |
---|---|---|
Resolution | 300-600 DPI | Higher DPI = Better accuracy |
Format | PNG, TIFF | Lossless compression preferred |
Color Mode | Grayscale | Reduces processing time |
File Size | < 10MB | For optimal memory usage |
# For pure Hindi content
config = {
'lang': 'hin',
'psm': 6, # Uniform block of text
'oem': 1 # Neural nets LSTM engine
}
# For Hindi + English mixed content
config = {
'lang': 'hin+eng',
'psm': 3, # Fully automatic page segmentation
'oem': 1
}
Input Structure:
Book/
βββ chapter01_page001.jpg
βββ chapter01_page002.jpg
βββ chapter02_page001.png
βββ manuscript_page045.tiff
Output Results:
Book_text/
βββ chapter01_page001.txt β
2.3KB extracted
βββ chapter01_page002.txt β
1.8KB extracted
βββ chapter02_page001.txt β
2.1KB extracted
βββ manuscript_page045.txt β
2.7KB extracted
We welcome contributions from the community! Hereβs how you can help: