Home

version: 2.2.7

Welcome to the Document Redaction App documentation. This site provides comprehensive documentation for the Document Redaction App.

Navigate through the sections to learn how to install, use, and manage the application. Below is a brief introduction to the app, followed by complete installation instructions (PyPI, source, and Docker overview).

Document redaction

Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the User Guide for a full walkthrough of all the features in the app.

Handwriting and signatures redacted example

To identify text in documents, the ‘Local’ text extraction uses PikePDF, and OCR image analysis uses Tesseract, and works well only for documents with typed text or scanned PDFs with clear text. Use AWS Textract to extract more complex elements e.g. handwriting, signatures, or unclear text. For PII identification, ‘Local’ (based on spaCy) gives good results if you are looking for common names or terms, or a custom list of terms to redact (see Redaction settings). AWS Comprehend gives better results at a small cost.

Additional options include, choosing the type of information to redact (e.g. people, places), custom terms to include/ exclude from redaction, fuzzy matching, language settings, and whole page redaction. After redaction is complete, you can view and modify suggested redactions on the ‘Review redactions’ tab to quickly create a final redacted document.

NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed by a human before using the final outputs.


🚀 Quick Start - Installation and first run

Follow these instructions to get the document redaction application running on your local machine.

1. Package installation

Option 2 - Install from PyPI

Create a virtual environment (recommended) and install doc_redaction.

python -m venv venv
# Windows:
.\venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate

The package is published on PyPI as doc-redaction (import name doc_redaction):

pip install doc_redaction

Optional extras (same as in pyproject.toml). For installing paddleOCR:

pip install "doc_redaction[paddle]"

For running VLMs / LLMs with the transformers package:

pip install "doc_redaction[vlm]"

For programmatic use (CLI-first API matching Gradio api_name routes), see Python Package usage (Python). The console script cli_redact is available after install.

Web UI from a PyPI install: You can start the Gradio UI after pip install doc_redaction by running (note that the prerequisites tesseract and poppler will need to be correctly installed following step 2 below):

python -m app

Important: your working directory matters. When you run python -m app, the app treats your current folder as the “app folder”:

  • It will look for configuration at config/app_config.env relative to the folder you run it from (and python -m doc_redaction.install_deps will also write config/app_config.env there).
  • It may create new folders in that location (for example config/, output/, input/, logs/, usage/, feedback/, and temporary/cache folders depending on your settings).
  • The UI example files and bundled assets are packaged with the PyPI install (they live inside the installed doc_redaction package). If you run from a “random” directory after a PyPI install, the app can still locate its packaged examples; your working directory mainly affects where config/, input/, output/, logs, and temp folders are created.

In practice, the smoothest UI experience (examples, bundled assets, docs links, predictable relative paths) is still usually via a repository checkout or Docker, but PyPI install is sufficient to launch the UI as long as you run it from a suitable working folder and have the system dependencies available (or run python -m doc_redaction.install_deps first).

Option 3 - Docker installation

The doc_redaction Redaction app can be installed by using the Dockerfile or Docker compose files (llama.cpp, vLLM) provided in the repo.

With Llama.cpp / vLLM inference server

The project now has Docker and Docker compose files available to pair running the Redaction app with local inference servers powered by llama.cpp, or vLLM. Llama.cpp is more flexible than vLLM for low VRAM systems, as Llama.cpp will offload to cpu/system RAM automatically rather than failing as vLLM tends to do.

For Llama.cpp, you can use the docker-compose_llama.yml file, and for vLLM, you can use the docker-compose_vllm.yml file. To run, Docker / Docker Desktop should be installed, and then you can run the commands suggested in the top of the files to run the servers.

You will need ~40 GB of disk space to run everything depending on the model chosen from the compose file. For the vLLM server, you will need 24 GB VRAM. For the Llama.cpp server, 24 GB VRAM is needed to run at full speed, but the n-gpu-layers and n-cpu-moe parameters in the Docker compose file can be adjusted to fit into your system. I would suggest that 8 GB VRAM is needed as a bare minimum for decent inference speed. See the Unsloth guide for more details on working with GGUF files for Qwen 3.5.

Without Llama.cpp / vLLM inference server

If you want a working Docker installation without GPU support, you can install from the Dockerfile in the repo. A working example of this, with the CPU version of PaddleOCR, can be found on Hugging Face. You can adjust the INSTALL_PADDLEOCR, PADDLE_GPU_ENABLED, INSTALL_VLM, and TORCH_GPU_ENABLED config variables to adjust for PaddleOCR and Transformers packages for local VLM support. Note that GPU-enabled PaddleOCR, and GPU-enabled Transformers/Torch often don’t work well together, which is one reason why a Llama.cpp/vLLM inference server Docker installation option is provided below.

2. Install prerequisites: Tesseract and Poppler

This application relies on two external tools for OCR (Tesseract) and PDF processing (Poppler). Please install them on your system before proceeding.


On Windows

If you don’t use the automated setup above, you can install the dependencies manually by downloading installers and adding the programs to your system’s PATH.

  1. Install Tesseract OCR:
    • Download the installer from the official Tesseract at UB Mannheim page (e.g., tesseract-ocr-w64-setup-v5.X.X...exe).
    • Run the installer.
    • IMPORTANT: During installation, ensure you select the option to “Add Tesseract to system PATH for all users” or a similar option. This is crucial for the application to find the Tesseract executable.
  2. Install Poppler:
    • Download the latest Poppler binary for Windows. A common source is the Poppler for Windows GitHub releases page. Download the .zip file (e.g., poppler-25.07.0-win.zip).
    • Extract the contents of the zip file to a permanent location on your computer, for example, C:\Program Files\poppler\.
    • You must add the bin folder from your Poppler installation to your system’s PATH environment variable.
      • Search for “Edit the system environment variables” in the Windows Start Menu and open it.
      • Click the “Environment Variables…” button.
      • In the “System variables” section, find and select the Path variable, then click “Edit…”.
      • Click “New” and add the full path to the bin directory inside your Poppler folder (e.g., C:\Program Files\poppler\poppler-24.02.0\bin).
      • Click OK on all windows to save the changes.

    To verify, open a new Command Prompt and run tesseract --version and pdftoppm -v. If they both return version information, you have successfully installed the prerequisites.

On Linux (Debian/Ubuntu)

Open your terminal and run the following command to install Tesseract and Poppler:

sudo apt-get update && sudo apt-get install -y tesseract-ocr poppler-utils

On Linux (Fedora/CentOS/RHEL)

Open your terminal and use the dnf or yum package manager:

sudo dnf install -y tesseract poppler-utils

3. Run the Application

With all dependencies installed, you can now start the Gradio application.

python app.py

After running the command, the application will start, and you will see a local URL in your terminal (usually http://127.0.0.1:7860).

Open this URL in your web browser to use the document redaction tool

Command line interface

If you installed from PyPI, use the installed console script:

cli_redact --help

From a repository checkout, you can also run:

python cli_redact.py --help

For Python examples that mirror each Gradio api_name, see Python Package usage (Python).


3. Run the Application

With all dependencies installed, you can now start the Gradio application GUI. For a guide on how to use this, please go here.

python app.py

After running the command, the application will start, and you will see a local URL in your terminal (usually http://127.0.0.1:7860).

Open this URL in your web browser to use the document redaction tool

Command line interface

For example CLI commands, please refer to this guide or the examples in cli_redact.py

If you installed from PyPI, use the installed console script:

cli_redact --help

From a repository checkout, you can also run:

python cli_redact.py --help

Python package commands

For Python examples in using the Python package, please see Python Package usage (Python).


Docker and cloud deployment

For container-based installs, AWS CDK, and advanced deployment, see the App installation guide (with CDK) and the Dockerfile in the repository. You can also use one of the Docker compose files and docker-compose_vllm.yml in the repository to run the application with a Llama.cpp or vLLM inference server.

Configuration

Application settings are typically loaded from config/app_config.env. A full variable reference is on the site under App settings management guide and in the GitHub repository documentation.


More documentation

Back to top