Text Documents
- PDF (.pdf)
- Microsoft Word (.doc, .docx)
- Text files (.txt)
- Rich Text Format (.rtf)
- Markdown (.md)
- OpenDocument Text (.odt)
- HTML (.html, .htm)
Spreadsheets
- Microsoft Excel (.xls, .xlsx)
- CSV (Comma Separated Values) (.csv)
- TSV (Tab Separated Values) (.tsv)
- OpenDocument Spreadsheet (.ods)
Presentations
- Microsoft PowerPoint (.ppt, .pptx)
- OpenDocument Presentation (.odp)
Data Formats
- JSON (.json)
- XML (.xml)
- YAML (.yaml, .yml)
Email
- Outlook Messages (.msg)
- Email Archives (.eml)
Image-Based Documents (with OCR processing)
- Scanned PDFs
- Image files containing text (.jpg, .png)
While we can process image-based documents, the accuracy of text extraction depends on image quality.
This format support would require implementing appropriate libraries for text extraction, such as:
- pdf.js for PDFs
- mammoth.js for Word documents
- SheetJS for Excel files
- An OCR service like Tesseract.js for image-based documents
Size Limitations
- Maximum file size: 25MB per document
- Maximum total project size: 100MB
- Maximum recommended pages: 50 pages of text or 10 spreadsheets per project