User FAQ
General Advice:
- Read the User Guide: Many common questions are addressed in the detailed User Guide sections.
- Start Simple: If you’re new, try redacting with default options first before customising extensively.
- Human Review is Key: Always manually review the
...redacted.pdfor use the ‘Review redactions’ tab. No automated system is perfect. - Save Incrementally: When working on the ‘Review redactions’ tab, use the ‘Save changes on current page to file’ button periodically, especially for large documents.
General questions
What is document redaction and what does this app do?
Document redaction is the process of removing sensitive or personally identifiable information (PII) from documents. This application is a tool that automates this process for various document types, including PDFs, images, open text, and tabular data (XLSX/CSV/Parquet). It identifies potential PII using different methods and allows users to review, modify, and export the suggested redactions.
What types of documents and data can be redacted?
The app can handle a variety of formats. For documents, it supports PDFs and images (JPG, PNG). For tabular data, it works with XLSX, CSV, and Parquet files. Additionally, it can redact open text that is copied and pasted directly into the application interface.
How does the app identify text and PII for redaction?
The app employs several methods for text extraction and PII identification. Text can be extracted directly from selectable PDF text, using a local Optical Character Recognition (OCR) model for image-based content, or through the AWS Textract service for more complex documents, handwriting, and signatures (if available). For PII identification, it can use a local model based on the spacy package or the AWS Comprehend service for more accurate results (if available).
Can I customise what information is redacted?
Yes, the app offers extensive customisation options. You can define terms that should never be redacted (an ‘allow list’), terms that should always be redacted (a ‘deny list’), and specify entire pages to be fully redacted using CSV files. You can also select specific types of entities to redact, such as dates, or remove default entity types that are not relevant to your needs.
How can I review and modify the suggested redactions?
The app provides a dedicated ‘Review redactions’ tab with a visual interface. You can upload the original document and the generated review file (CSV) to see the suggested redactions overlaid on the document. Here, you can move, resize, delete, and add new redaction boxes. You can also filter suggested redactions based on criteria and exclude them individually or in groups.
Can I work with tabular data or copy and pasted text?
Yes, the app has a dedicated tab for redacting tabular data files (XLSX/CSV) and open text. For tabular data, you can upload your file and select which columns to redact. For open text, you can simply paste the text into a box. You can then choose the redaction method and the desired output format for the anonymised data.
What are the options for the anonymisation format of redacted text?
When redacting tabular data or open text, you have several options for how the redacted information is replaced. The default is to replace the text with ‘REDACTED’. Other options include replacing it with the entity type (e.g., ‘PERSON’), redacting completely (removing the text), replacing it with a consistent hash value, or masking it with stars (’*’).
Can I export or import redactions to/from other software like Adobe Acrobat?
Yes, the app supports exporting and importing redaction data using the Adobe Acrobat comment file format (.xfdf). You can export suggested redactions from the app to an .xfdf file that can be opened in Adobe. Conversely, you can import an .xfdf file created in Adobe into the app to generate a review file (CSV) for further work within the application.
Is there a way to try the app without uploading my own documents first?
Yes. The app includes built-in examples on several tabs so you can see how it works before using your own files. * On the ‘Redact PDFs/images’ tab, look for the “Try an example” section. Click any example to load it with pre-configured settings, then click ‘Extract text and redact document’ to run it. Examples include selectable-text PDFs, image OCR, custom entity selection, and deny list / whole-page redaction scenarios. * On the ‘Word or Excel/CSV files’ tab, you will find examples for CSV redaction, Word document redaction, and Excel duplicate detection. Click an example, then click ‘Redact text/data files’ to process it.
What is the ...redactions_for_review.pdf file, and how is it different from ...redacted.pdf?
The app produces two different PDF outputs after redaction: * ...redacted.pdf — the final output with redacted text permanently removed and replaced by black boxes. This is the document you would share externally. * ...redactions_for_review.pdf — the original document with redaction boxes overlaid but the underlying text still visible. This is a working file intended for review. It can be opened in Adobe Acrobat to inspect suggested redactions, and it can be re-uploaded to the app’s ‘Review redactions’ tab to continue working on redactions at a later session.
Does the app support Word (.docx) documents?
Yes. In addition to PDFs, images, CSV, and XLSX files, the app can also redact Word (.docx) documents. Go to the ‘Word or Excel/CSV files’ tab and upload your .docx file. The redaction method and anonymisation output format options available for tabular data apply equally to Word documents.
What do the ‘Extract text only’ and ‘Redact selected terms’ options do?
Under ‘Redaction settings’ on the ‘Redact PDFs/images’ tab, the ‘Choose redaction method’ radio button has three options: * ‘Extract text only’ — runs text extraction (OCR) without applying any redactions. Useful when you only need the ocr_output.csv text output or want to inspect what was extracted before deciding on redactions. * ‘Redact all PII’ (the default) — uses the chosen PII detection method to find and redact personal information across all selected entity types. * ‘Redact selected terms’ — focuses redaction only on the specific terms in your custom deny list. No automatic PII detection is run; only the terms you have listed will be redacted.
Troubleshooting
Q1: The app missed some personal information or redacted things it shouldn’t have. Is it broken?
A: Not necessarily. The app is not 100% accurate and is designed as an aid. The README explicitly states: “NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed by a human before using the final outputs.” * Solution: Always use the ‘Review redactions’ tab to manually inspect, add, remove, or modify redactions.
Q2: I uploaded a PDF, but no text was found, or redactions are very poor using the ‘Local model - selectable text’ option.
A: This option only works if your PDF has actual selectable text. If your PDF is an image scan (even if it looks like text), this method won’t work well. * Solution: * Try the ‘Local OCR model - PDFs without selectable text’ option. This uses Tesseract OCR to “read” the text from images. * For best results, especially with complex documents, handwriting, or signatures, use the ‘AWS Textract service - all PDF types’ if available.
Q3: Handwriting or signatures are not being redacted properly.
A: The ‘Local’ text/OCR methods (selectable text or Tesseract) struggle with handwriting and signatures. * Solution: * Use the ‘AWS Textract service’ for text extraction. * Ensure that on the main ‘Redact PDFs/images’ tab, under “Optional - select signature extraction” (when AWS Textract is chosen), you have enabled handwriting and/or signature detection. Note that signature detection has higher cost implications.
Q4: The options for ‘AWS Textract service’ or ‘AWS Comprehend’ are missing or greyed out.
A: These services are typically only available when the app is running in an AWS environment or has been specifically configured by your system admin to access these services (e.g., via API keys). * Solution: * Check if your instance of the app is supposed to have AWS services enabled. * If running outside AWS, see the “Using AWS Textract and Comprehend when not running in an AWS environment” section in the advanced guide. This involves configuring AWS access keys, which should be done with IT and data security approval.
Q5: I re-processed the same document, and it seems to be taking a long time and potentially costing more with AWS services. Can I avoid this?
A: Yes. If you have previously processed a document with AWS Textract or the Local OCR model, the app generates a .json output file (..._textract.json or ..._ocr_results_with_words.json). * Solution: When re-uploading your original document for redaction, also upload the corresponding .json file. The app should detect this (the “Existing Textract output file found” box may be checked), skipping the expensive text extraction step.
Q6: My app crashed, or I reloaded the page. Are my output files lost?
A: If you are logged in via AWS Cognito and the server hasn’t been shut down, you might be able to recover them. * Solution: Go to the ‘Settings’ tab and open ‘View and download all output files from this session’. Click ‘Refresh files in output folder’ to load the list, then tick the box next to a file to display and download it.
Q7: My custom allow list (terms to never redact) or deny list (terms to always redact) isn’t working.
A: There are a few common reasons: * File Format: Ensure your list is a .csv file with terms in the first column only, with no column header. * Case Sensitivity: Terms in the allow/deny list are case sensitive. * Deny List & ‘CUSTOM’ Entity: For a deny list to work, you must select the ‘CUSTOM’ entity type in ‘Redaction settings’ under ‘Entities to redact’. * Manual Additions: If you manually added terms in the app interface (under ‘Manually modify custom allow…’), ensure you pressed Enter after typing each term in its cell. * Fuzzy Search for Deny List: If you intend to use fuzzy matching for your deny list, ensure ‘CUSTOM_FUZZY’ is selected as an entity type, and you’ve configured the “maximum number of spelling mistakes allowed.”
Q8: I’m trying to review redactions, but the PDF in the viewer looks like it’s already redacted with black boxes.
A: You likely uploaded the ...redacted.pdf file (the final output with text permanently removed) instead of the correct file. * Solution: On the ‘Review redactions’ tab, the first upload box (1.) accepts either: * The original, unredacted PDF — if you are starting a fresh review and want to see all suggested redactions overlaid, or * The ...redactions_for_review.pdf — if you want to reload a previous set of redactions (this file shows the original text with redaction boxes overlaid but the underlying text still visible). The second upload box (2.) is for an ...ocr_results_with_words.csv or ...ocr_output.csv file, which enables the ‘Search text and redact’ and ‘View text’ features. Do not upload the ...redacted.pdf (the version with black boxes) to either box.
Q9: I can’t move or pan the document in the ‘Review redactions’ viewer when zoomed in.
A: You are likely in “add redaction boxes” mode. * Solution: Scroll to the bottom of the document viewer pane and click the hand icon. This switches to “modify mode,” allowing you to pan the document by clicking and dragging, and also to move/resize existing redaction boxes.
Q10: I accidentally clicked “Exclude all items in table from redactions” on the ‘Review redactions’ tab without filtering, and now all my redactions are gone!
A: This can happen if you don’t apply a filter first. * Solution: Click the ‘Undo last element removal’ button immediately. This should restore the redactions. Always ensure you have clicked the blue tick icon next to the search box to apply your filter before using “Exclude all items…”.
Q11: Redaction of my CSV or XLSX file isn’t working correctly.
A: The app expects a specific format for tabular data. * Solution: Ensure your data file has a simple table format, with the table starting in the first cell (A1). There should be no other information or multiple tables within the sheet you intend to redact. For XLSX files, each sheet to be redacted must follow this format.
Q12: The “Identify duplicate pages” feature isn’t finding duplicates I expect, or it’s flagging too many pages.
A: This feature uses text similarity based on the ocr_output.csv files. The default similarity threshold is 0.95 (95%), which may be too strict or too lenient for your documents. * Solution: * Ensure you’ve uploaded the correct ocr_output.csv files for all documents you’re comparing (these are generated every time you run a redaction task). * On the ‘Identify duplicate pages’ tab, open the ‘Duplicate matching parameters’ accordion to adjust: * Similarity threshold (0–1): Lower this to catch more (potentially looser) matches; raise it to require more exact matches. * Minimum word count: Pages with fewer words than this are ignored — useful for skipping near-blank pages. * Duplicate matching mode: Choose between ‘Find duplicates by page’ (compares full-page text) and ‘Find duplicates by text line’ (compares individual lines). * Review the page_similarity_results.csv output to inspect the similarity scores and verify matched text side-by-side in the interactive preview.
Q13: I exported a review file to Adobe (.xfdf), but when I open it in Adobe Acrobat, it can’t find the PDF or shows no redactions.
A: When Adobe Acrobat prompts you, it needs to be pointed to the exact original PDF. * Solution: Ensure you select the original, unredacted PDF file that was used to generate the ..._review_file.csv (and subsequently the .xfdf file) when Adobe Acrobat asks for the associated document.
Q14: My AWS Textract API job (submitted via “Submit whole document to AWS Textract API…”) is taking a long time, or I don’t know if it’s finished.
A: Large documents can take time. The document estimates about five seconds per page as a rough guide. * Solution: * After submitting, a Job ID will appear. * Periodically click the ‘Check status of Textract job and download’ button. Processing continues in the background. * Once ready, the _textract.json output will appear in the output area.
Q15: I’m trying to redact specific terms from my deny list, but they are not being picked up, even though the ‘CUSTOM’ entity is selected.
A: The deny list matches whole words with exact spelling by default. * Solution: * Double-check the spelling and case in your deny list. * If you expect misspellings to be caught, you need to use the ‘CUSTOM_FUZZY’ entity type and configure the “maximum number of spelling mistakes allowed” under ‘Redaction settings’. Then, upload your deny list.
Q16: I set the “Lowest page to redact” and “Highest page to redact” in ‘Redaction settings’, but the app still seems to process or show redactions outside this range.
A: The page range setting primarily controls which pages have redactions applied in the final ...redacted.pdf. The underlying text extraction (especially with OCR/Textract) might still process the whole document to generate the ...ocr_results.csv or ..._textract.json. When reviewing, the review_file.csv might initially contain all potential redactions found across the document. * Solution: * Ensure the ...redacted.pdf correctly reflects the page range. * When reviewing, use the page navigation and filters on the ‘Review redactions’ tab to focus on your desired page range. The final application of redactions from the review tab should also respect the range if it’s still set, but primarily it works off the review_file.csv.
Q17: My “Full page redaction list” isn’t working. I uploaded a CSV with page numbers, but those pages aren’t blacked out.
A: Common issues include: * File Format: Ensure your list is a .csv file with page numbers in the first column only, with no column header. Each page number should be on a new row. * Redaction Task: Simply uploading the list doesn’t automatically redact. You need to: 1. Upload the PDF you want to redact. 2. Upload the full page redaction CSV in ‘Redaction settings’. 3. It’s often best to deselect all other entity types in ‘Redaction settings’ if you only want to redact these full pages. 4. Run the ‘Redact document’ process. The output ...redacted.pdf should show the full pages redacted, and the ...review_file.csv will list these pages.
Q18: I merged multiple ...review_file.csv files, but the output seems to have duplicate redaction boxes or some are missing.
A: The merge feature simply combines all rows from the input review files. * Solution: * Duplicates: If the same redaction (same location, text, label) was present in multiple input files, it will appear multiple times in the merged file. You’ll need to manually remove these duplicates on the ‘Review redactions’ tab or by editing the merged ...review_file.csv in a spreadsheet editor before review. * Missing: Double-check that all intended ...review_file.csv files were correctly uploaded for the merge. Ensure the files themselves contained the expected redactions.
Q19: I imported an .xfdf Adobe comment file, but the review_file.csv generated doesn’t accurately reflect the highlights or comments I made in Adobe Acrobat.
A: The app converts Adobe’s comment/highlight information into its review_file format. Discrepancies can occur if: * Comment Types: The app primarily looks for highlight-style annotations that it can interpret as redaction areas. Other Adobe comment types (e.g., sticky notes without highlights, text strike-throughs not intended as redactions) might not translate. * Complexity: Very complex or unusually shaped Adobe annotations might not convert perfectly. * PDF Version: Ensure the PDF uploaded alongside the .xfdf is the exact same original, unredacted PDF that the comments were made on in Adobe. * Solution: After import, always open the generated review_file.csv (with the original PDF) on the ‘Review redactions’ tab to verify and adjust as needed.
Q20: The Textract API job status table (under “Submit whole document to AWS Textract API…”) only shows recent jobs, or I can’t find an older Job ID I submitted.
A: The table showing Textract job statuses might have a limit or only show jobs from the current session or within a certain timeframe (e.g., “up to seven days old” is mentioned). * Solution: * It’s good practice to note down the Job ID immediately after submission if you plan to check it much later. * If the _textract.json file was successfully created from a previous job, you can re-upload that .json file with your original PDF to bypass the API call and proceed directly to redaction or OCR conversion.
Q21: I edited a ...review_file.csv in Excel (e.g., changed coordinates, labels, colors), but when I upload it to the ‘Review redactions’ tab, the boxes are misplaced, the wrong color, or it causes errors.
A: The review_file.csv has specific columns and data formats (e.g., coordinates, RGB color tuples like (0,0,255)). * Solution: * Coordinates (xmin, ymin, xmax, ymax): Ensure these are numeric and make sense for PDF coordinates. Drastic incorrect changes can misplace boxes. * Colors: Ensure the color column uses the (R,G,B) format, e.g., (0,0,255) for blue, not hex codes or color names, unless the app specifically handles that (the guide mentions RGB). * CSV Integrity: Ensure you save the file strictly as a CSV. Excel sometimes adds extra formatting or changes delimiters if not saved carefully. * Column Order: Do not change the order of columns in the review_file.csv. * Test Small Changes: Modify one or two rows/values first to see the effect before making bulk changes.
Q22: The cost and time estimation feature isn’t showing up, or it’s giving unexpected results.
A: This feature depends on admin configuration and certain conditions. * Solution: * Admin Enabled: Confirm with your system admin that the cost/time estimation feature is enabled in the app’s configuration. * AWS Services: Estimation is typically most relevant when using AWS Textract or Comprehend. If you’re only using ‘Local’ models, the estimation might be simpler or not show AWS-related costs. * Existing Output: If “Existing Textract output file found” is checked (because you uploaded a pre-existing _textract.json), the estimated cost and time should be significantly lower for the Textract part of the process.
Q23: I’m prompted for a “cost code,” but I don’t know what to enter, or my search isn’t finding it.
A: Cost code selection is an optional feature enabled by system admins for tracking AWS usage. * Solution: * Contact Admin/Team: If you’re unsure which cost code to use, consult your team lead or the system administrator who manages the redaction app. They should provide the correct code or guidance. * Search Tips: Try searching by project name, department, or any known identifiers for your cost center. The search might be case-sensitive or require exact phrasing.
Q24: I selected “hash” as the anonymisation output format for my tabular data, but the output still shows “REDACTED” or something else.
A: Ensure the selection was correctly registered before redacting. * Solution: * Double-check on the ‘Open text or Excel/csv files’ tab, under ‘Anonymisation output format,’ that “hash” (or your desired format) is indeed selected. * Try re-selecting it and then click ‘Redact text/data files’ again. * If the issue persists, it might be a bug or a specific interaction with your data type that prevents hashing. Report this to your app administrator. “Hash” should replace PII with a consistent unique ID for each unique piece of PII.
Q25: I’m using ‘CUSTOM_FUZZY’ for my deny list. I have “Should fuzzy search match on entire phrases in deny list” checked, but it’s still matching individual words within my phrases or matching things I don’t expect.
A: Fuzzy matching on entire phrases can be complex. The “maximum number of spelling mistakes allowed” applies to the entire phrase. * Solution: * Mistake Count: If your phrase is long and the allowed mistakes are few, it might not find matches if the errors are distributed. Conversely, too many allowed mistakes on a short phrase can lead to over-matching. Experiment with the mistake count. * Specificity: If “match on entire phrases” is unchecked, it will fuzzy match each individual word (excluding stop words) in your deny list phrases. This can be very broad. Ensure this option is set according to your needs. * Test with Simple Phrases: Try a very simple phrase with a known, small number of errors to see if the core fuzzy logic is working as you expect, then build up complexity.
Q26: I “locked in” a new redaction box format on the ‘Review redactions’ tab (label, colour), but now I want to change it or go back to the pop-up for each new box.
A: When a format is locked, a new icon (described as looking like a “gift tag”) appears at the bottom of the document viewer. * Solution: * Click the “gift tag” icon at the bottom of the document viewer pane. * This will allow you to change the default locked format. * To go back to the pop-up appearing for each new box, click the lock icon within that “gift tag” menu again to “unlock” it (it should turn from blue to its original state).
Q27: I clicked “Redact document,” processing seemed to complete (e.g., progress bar finished, “complete” message shown), but no output files (...redacted.pdf, ...review_file.csv) appeared in the output area.
A: This could be due to various reasons: * No PII Found: If absolutely no PII was detected according to your settings (entities, allow/deny lists), the app might not generate a ...redacted.pdf if there’s nothing to redact, though a review_file.csv (potentially empty) and ocr_results.csv should still ideally appear. * Error During File Generation: An unhandled error might have occurred silently during the final file creation step. * Browser/UI Issue: The UI might not have refreshed to show the files. * Permissions: In rare cases, if running locally, there might be file system permission issues preventing the app from writing outputs. * Solution: * Try refreshing the browser page (if feasible without losing input data, or after re-uploading). * Check the ‘Settings’ tab for ‘View and download all output files from this session’ (if logged in via Cognito) – they might be listed there. * Try a very simple document with obvious PII and default settings to see if any output is generated. * Check browser developer console (F12) for any error messages.
Q28: When reviewing, I click on a row in the ‘Search suggested redactions’ table. The page changes, but the specific redaction box isn’t highlighted, or the view doesn’t scroll to it.
A: The highlighting feature (“should change the colour of redaction box to blue”) is an aid. * Solution: * Ensure you are on the correct page. The table click should take you there. * The highlighting might be subtle or conflict with other UI elements. Manually scan the page for the text/label mentioned in the table row. * Scrolling to the exact box isn’t explicitly guaranteed, especially on very dense pages. The main function is page navigation.
Reviewing redactions
How do I manually search for and redact text that the automatic detection missed?
Use the ‘Search text and redact’ tab within the ‘Review redactions’ tab (found to the right of the document viewer, next to the ‘Apply redactions to PDF’ and ‘Save changes on current page’ buttons). This tab shows all the word-level text extracted from your document and allows you to: 1. Type a word or phrase in the ‘Multi-word text search’ box. Tick ‘Enable regex pattern matching’ if you want to use regular expressions. Click ‘Search’ (or press Enter). 2. The table updates to show only matching rows. Click any row to jump to that page in the document viewer. 3. Choose how to redact: * ‘Redact specific text row’ — redacts only the exact instance on the row you clicked. * ‘Redact all words with same text as selected row’ — redacts every occurrence of that word/phrase throughout the document. * ‘Redact all text in table’ — redacts everything currently shown in the filtered table in one go. 4. If you make a mistake, click ‘Undo latest redaction’ to reverse the last redaction action (one level of undo only). 5. Before redacting, you can customise the label and colour for new boxes under the ‘Search options’ accordion.
What is the ‘View text’ tab on the ‘Review redactions’ page?
The ‘View text’ tab (below the ‘Search text and redact’ tab on the ‘Review redactions’ page) displays the line-by-line text extracted from the document. This lets you verify the accuracy of the OCR output. You can search the table using the search bar above it, or filter individual columns by clicking the three dots next to a column header. Clicking a row navigates the document viewer to that page. Click ‘Reset OCR output table filter’ to clear any active filters. This table is populated automatically after a redaction run, or when you upload an ...ocr_output.csv file to the second upload box on the ‘Review redactions’ tab.
After reviewing and modifying redactions, how do I produce the final redacted PDF?
Once you are happy with all the redactions in the document viewer, click the ‘Apply revised redactions to PDF’ button (found above the ‘Save changes on current page to file’ button on the ‘Review redactions’ tab). This generates a new ...redacted.pdf (with text permanently removed) and an updated ...redactions_for_review.pdf (with redaction boxes overlaid on the original text). Both files will appear in the output area.
Can I remove individual redactions or all instances of a specific text without using the filter?
Yes. On the ‘Review redactions’ tab, to the right of the document viewer, under the ‘Modify redactions’ heading: * ‘Exclude specific redaction row’ — removes only the single redaction from the last row you clicked in the table. The currently selected row is shown below the button. * ‘Exclude all redactions with the same text as selected row’ — removes every redaction in the document that has exactly the same underlying text as your selected row. * ‘Exclude all redactions in table’ — removes all redactions currently visible in the table. Always apply a filter first (using the dropdowns or the filter box, then clicking the blue tick icon) before using this option, otherwise all redactions in the document will be removed. * After any of these actions, click ‘Reset filters’ to return the table to showing all remaining redactions. * If you remove redactions by mistake, click ‘Undo last element removal’ immediately to restore them (one level of undo only).
Working with previous redaction results
How do I return to a document I previously redacted to add or change redactions?
You do not need to re-run the full redaction process. Instead: 1. On the ‘Review redactions’ tab, upload the ...redactions_for_review.pdf file (produced during the original redaction run) into the first upload box (1.). This file contains all the previous redaction boxes embedded in it. 2. Upload the ...ocr_results_with_words.csv file into the second upload box (2.) if you want to use the ‘Search text and redact’ feature to find additional terms to redact. 3. The document viewer and redaction table will be populated with the previous redactions, and you can modify them as usual before applying the final redactions to PDF.
Can I combine redactions from multiple separate redaction runs of the same document?
Yes. If you have run several redaction tasks on the same document (for example, using different settings each time) and want to merge all the suggested redaction boxes together: 1. Go to the ‘Settings’ tab and find the ‘Combine multiple review PDFs or CSV files’ section. 2. Upload all the ...redactions_for_review.pdf files you want to merge. 3. Click ‘Combine multiple review PDFs into one’. A combined file will be produced containing all the redaction boxes from the uploaded files. 4. Upload this combined file into the ‘Review redactions’ tab to inspect, modify, and finalise the merged redactions.
How do I skip re-running OCR or AWS Textract when I redact the same document a second time?
Every time you redact a document, a .json output file is produced (either ..._textract.json for AWS Textract, or ...ocr_outputs_with_words.json for the local OCR model). To skip the text extraction step on future runs: 1. When uploading your document on the ‘Redact PDFs/images’ tab, select both the original PDF and the .json file at the same time from the upload area. 2. The app will detect the .json file and automatically tick the ‘Existing Textract output file found’ or ‘Existing local OCR output file found’ checkbox, indicating the text extraction step will be skipped. This saves time and, where AWS Textract is used, avoids incurring the extraction cost again.
Additional features
Can the app summarise documents?
Yes, if document summarisation is enabled in your deployment, a ‘Document summarisation’ tab will be visible. To summarise a document: 1. Upload one or more PDF files, or one or more ...ocr_output.csv files (from a previous redaction run) using the upload boxes on the tab. 2. Open the ‘Summarisation settings’ accordion to choose: * LLM inference method — the language model to use. * Max pages per page-group summary — how many pages are summarised together at a time. * Summary format — Concise (key themes) or Detailed. * Additional summary instructions (optional) — e.g. “Focus on key obligations.” 3. Click ‘Generate summary’. When finished, the summary appears below and summary files are available for download.
What does the ‘Redact duplicate pages’ checkbox do on the ‘Redact PDFs/images’ tab?
When this checkbox is ticked (found alongside the PII identification options under ‘Redaction settings’), the app will automatically detect pages with near-identical text within the document and apply whole-page redaction to any duplicates found, as part of the same redaction run. This is a quick way to handle documents that contain repeated pages. For more control over duplicate detection — such as adjusting the similarity threshold, comparing across multiple documents, or finding duplicate lines of text — use the dedicated ‘Identify duplicate pages’ tab as described in the advanced user guide.