Apache Tika
Content analysis toolkit that detects and extracts metadata and text from over a thousand different file types
About
Apache Tika is an open-source toolkit designed to detect and extract metadata and structured text content from thousands of file formats. It simplifies the process of content analysis by providing a unified interface for parsing documents such as PDFs, Microsoft Office files, ima…