Release Category: Alpha
txt2phrases is a Python library and command-line tool for processing and analyzing textual data. It offers a streamlined workflow to convert documents (HTML and PDF) into plain text, extract keywords using AI-based models, and classify them into specific and general categories using TF-IDF techniques.
Role: Pipeline from documents (PDF/HTML) to plain text and then to keyphrases; can consume pygetpapers output.
Primary functionality:
Primary inputs:
auto: pygetpapers-style directory (e.g. {paper_id}/fulltext.pdf, fulltext.html).Primary outputs:
*_keywords.csv).Main file types for transfer: .pdf, .html, .txt, .csv.
Use pip to install the tool/package. Use this code pip install txt2phrases to install latest txt2phrases version.
GitHub Repository - txt2phrases