True redaction, not overlays
Approved findings are permanently removed from the PDF.
Detect PII in finished PDFs, route low-confidence findings to review, and permanently remove approved content with an audit trail.
Built for developers shipping redaction workflows in legal, clinical, and regulated document pipelines. Use the cloud API or deploy on-prem when data residency matters.
Approved findings are permanently removed from the PDF.
Entity type, confidence score, page location, and review outcome.
6 common PII categories with precision, recall, and F1.
Hosted API or documents kept inside your own environment.
Choose your SDK
Run finished PDFs through detection, keep humans in the loop where confidence drops, and ship clean files without sending reviewers back to manual cleanup.
The model works on document structure and context instead of pattern matching alone, so names, addresses, and mixed-format identifiers can be reviewed with confidence data.
High-confidence findings can move fast. Lower-confidence findings stay in your review workflow with labeled entities, page locations, and a record of every decision.
Redaction happens at the PDF level. The underlying content is deleted from the file instead of being visually masked with overlays.
Feed in PDFs: scanned, digital, or mixed.
Extract text from scanned pages, images, and non-standard text layers.
Semantic engine parses document structure, context, and entity relationships.
ML engine labels every detected entity with a confidence score.
Your logic sets the rules: which labels, what threshold, what action.
Binary-level removal. Clean at the file level, not cosmetically masked.
Match the redaction workflow to the document set, reviewer, and compliance pressure you are working under.
Automated PII detection for discovery, contract redaction, and FOIA compliance. Confidence-scored findings with audit trails for court.
See the legal solution →CSR redaction for EMA Policy 0070, patient de-identification for HIPAA, and TMF batch processing with recall-optimized detection.
See the clinical trials solution →Redact PII from loan applications, KYC files, and audit trails. Confidence scoring tuned to your risk tolerance.
Coming soon →FOIA-ready document preparation with automated PII removal. Deployable on-prem for strict data residency requirements.
Coming soon →Before you automate redaction, you need benchmark data, review logs, and deployment controls you can defend.
By HIPAA entity category, measured on our English-language benchmark dataset:
| Category | Recall | Precision | F1 Score |
|---|---|---|---|
| Person | 96.28% | 97.43% | 0.969 |
| Dates of Birth | 92.57% | 100% | 0.961 |
| Account Number / SSN | 93.93% | 85.27% | 0.894 |
| Addresses | 91.22% | 99.43% | 0.951 |
| Phone / Fax Numbers | 96.3% | 94.12% | 0.952 |
| Email Addresses | 99.98% | 99.58% | 0.998 |
In our current pilot with a legal services provider, the SDK processes thousands of pages per month with high accuracy on first pass. Manual review time dropped significantly compared to their previous workflow.
Keep documents in the hosted API or inside your own environment. The redaction workflow stays under your operational controls instead of being routed through external LLM services.
Permanent removal, audit trails, and deployment control support regulated document workflows. Your review policy, infrastructure, and retention rules determine the final compliance posture.
Image redaction, handwritten signatures, and entities that span page breaks still need extra handling.
The engine processes text. It does not detect or redact faces in photographs, visible handwritten signatures, logos, or other graphical elements.
We support English, German, Spanish, French, and Italian. Additional languages are on our roadmap — talk to us if your use case requires others.
The engine analyzes each page independently. If an entity (such as a name or address) starts on one page and continues on the next, we may miss part of it. This is a known gap for documents with dense, flowing text across page breaks.
Get the SDK plus the deployment help, threshold tuning, and operational support needed to keep the workflow reliable in production.
We help you scope the integration — document types, entity categories, confidence thresholds, edge cases specific to your domain.
Hands-on support for deployment, whether you're calling our cloud API or installing on-prem in a locked-down environment.
SLAs, model updates, new language and entity support as we ship it, and a dedicated account contact for enterprise customers.
Detection, redaction, and audit trail included. Requires Pro plan ($199/month). No minimum volume.
Example: 10,000 pages/month = $199 + $2,000 = $2,199/month
For teams that need data residency, high-volume pricing, custom SLAs, or dedicated support.
Choose the implementation guide that matches your backend stack. Each page covers detection, review thresholds, and permanent removal.
Bring a discovery file, patient packet, or public-records release. We will show where detection, human review, and permanent removal fit, and tell you plainly whether PDFDancer is the right tool.