supported and unsupported file formats

Supported File Formats

Formats We Cannot Process

Formats We Cannot Process

Text Documents

PDF (.pdf)
Microsoft Word (.doc, .docx)
Text files (.txt)
Rich Text Format (.rtf)
Markdown (.md)
OpenDocument Text (.odt)
HTML (.html, .htm)

Spreadsheets

Microsoft Excel (.xls, .xlsx)
CSV (Comma Separated Values) (.csv)
TSV (Tab Separated Values) (.tsv)
OpenDocument Spreadsheet (.ods)

Presentations

Microsoft PowerPoint (.ppt, .pptx)
OpenDocument Presentation (.odp)

Data Formats

JSON (.json)
XML (.xml)
YAML (.yaml, .yml)

Email

Outlook Messages (.msg)
Email Archives (.eml)

Image-Based Documents (with OCR processing)

Scanned PDFs
Image files containing text (.jpg, .png)

While we can process image-based documents, the accuracy of text extraction depends on image quality.

This format support would require implementing appropriate libraries for text extraction, such as:

pdf.js for PDFs
mammoth.js for Word documents
SheetJS for Excel files
An OCR service like Tesseract.js for image-based documents

Size Limitations

Maximum file size: 25MB per document
Maximum total project size: 100MB
Maximum recommended pages: 50 pages of text or 10 spreadsheets per project

Formats We Cannot Process

Formats We Cannot Process

Formats We Cannot Process

Image-Based Documents Without Quality OCR

Scanned documents with poor quality or resolution
Handwritten documents
Faxed documents with artifacts
Documents with watermarks that obscure text
Images with text embedded in complex backgrounds
Documents with non-standard fonts or stylized text

Protected/Secured Documents

Password-protected PDFs
DRM-protected documents
Encrypted files of any format
Documents with editing restrictions

Complex Formats

PDF forms with fillable fields (content may not extract properly)
Documents with complex layouts (multiple columns, text boxes, etc.)
PDFs created as image compilations without OCR
Scanned books with curved page surfaces

Specialized Formats

CAD files (.dwg, .dxf)
GIS/Mapping files
Proprietary financial software exports
Database files (.mdb, .accdb)
Compressed archives (.zip, .rar) - would need to be extracted first

Media Files

Audio files (.mp3, .wav)
Video files (.mp4, .mov)
Pure image files without text content

Size/Complexity Issues

Documents exceeding 100 pages
Spreadsheets with more than 10,000 rows
Files larger than 25MB
Documents with hundreds of embedded images