### Install Project Dependencies Source: https://github.com/lenkazuma/pdfreader/blob/main/README.md Steps to clone the repository from GitHub, navigate into the project directory, and install the required Python dependencies listed in the requirements.txt file using pip. ```Shell git clone https://github.com/lenkazuma/PDFReader.git ``` ```Shell cd PDFReader ``` ```Shell pip install -r requirements.txt ``` -------------------------------- ### Project Dependencies (requirements.txt) Source: https://github.com/lenkazuma/pdfreader/blob/main/requirements.txt This list specifies the Python packages and their exact or minimum versions required to run the project. These dependencies can typically be installed using a package manager like pip with the command `pip install -r requirements.txt`. ```requirements.txt langchain PyPDF2==3.0.1 python-dotenv streamlit faiss-cpu==1.7.4 altair<5 openai==0.27.8 tiktoken==0.4.0 numpy==1.21.6 python-docx==0.8.11 pycryptodome==3.18.0 streamlit-extras docx2txt chromadb pysqlite3-binary qianfan ``` -------------------------------- ### Initialize Langchain Components with OpenAI Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Initializes the OpenAI Large Language Model instance and various Langchain chains (summarization and QA) using different chain types ('stuff' and 'map_reduce') for processing documents. ```Python llm = OpenAI(temperature=0.7, model=st.session_state.model) #llmchat = OpenAI(temperature=0.7, model_name='gpt-3.5-turbo') chain = load_summarize_chain(llm, chain_type="stuff") chain_large = load_summarize_chain(llm, chain_type="map_reduce") chain_qa = load_qa_chain(llm, chain_type="stuff") chain_large_qa = load_qa_chain(llm, chain_type="map_reduce") ``` -------------------------------- ### Configure Streamlit Sidebar Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Sets up the sidebar in the Streamlit application, adding a title, an 'About' section with markdown text, and a radio button for selecting the OpenAI LLM model. ```Python with st.sidebar: st.title('🤗💬 LLM PDFReader App') st.markdown(""" ## About This app is an LLM-powered chatbot built using: - [Streamlit](https://streamlit.io/) - [Langchain](https://python.langchian.com/) - [OpenAI](https://platform.openai.com/docs/models) LLM model """) st.radio( "Model 👉", key="model", options=["text-ada-001", "text-davinci-002", "text-davinci-003"], ) add_vertical_space(5) ``` -------------------------------- ### Handle User Questions and Generate Responses Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Adds a text input for user questions, performs a similarity search in the knowledge base based on the question, generates a response using a Langchain QA chain (with fallback), and displays the response. ```Python # User input for questions user_question = st.text_input("Ask a question about your file:") if user_question: docs = knowledge_base.similarity_search(user_question) with st.spinner('Wait for it...'): with get_openai_callback() as cb: try: response = chain_qa.run(input_documents=docs, question=user_question) except Exception as maxtoken_error: print(maxtoken_error) response = chain_large_qa.run(input_documents=docs, question=user_question) print(cb) # Show/hide section using st.beta_expander #with st.expander("Used Tokens", expanded=False): #st.write(cb) ``` -------------------------------- ### Importing Libraries for PDF Reader (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Imports necessary libraries for loading environment variables, handling PDF/DOCX files, text splitting, embeddings, vector stores, language models, QA/summarization chains, Streamlit UI, and system operations. ```Python from dotenv import load_dotenv from langchain.vectorstores import Chroma import streamlit as st from PyPDF2 import PdfReader from langchain.text_splitter import CharacterTextSplitter from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings.openai import OpenAIEmbeddings from langchain.vectorstores import FAISS from langchain.chains.question_answering import load_qa_chain from langchain.chains.summarize import load_summarize_chain from langchain.llms import OpenAI from langchain.callbacks import get_openai_callback from docx import Document from docx.table import _Cell from streamlit_extras.add_vertical_space import add_vertical_space import sys ``` -------------------------------- ### Defining Main Execution Entry Point (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Standard Python construct that ensures the `main()` function is called only when the script is executed directly, not when imported as a module. ```Python if __name__ == '__main__': main() ``` -------------------------------- ### Run Streamlit Application Locally Source: https://github.com/lenkazuma/pdfreader/blob/main/README.md Command to execute the Streamlit application's main script using the Streamlit command-line interface, typically used for local development or testing. ```Shell streamlit run your-app-name.py ``` -------------------------------- ### Create Embeddings and Knowledge Base Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Generates embeddings for the text chunks using OpenAIEmbeddings and creates a FAISS knowledge base from these embeddings, enabling efficient similarity search for relevant document sections. ```Python # Create embeddings embeddings = OpenAIEmbeddings(disallowed_special=()) knowledge_base = FAISS.from_texts(chunks, embeddings) ``` -------------------------------- ### Generate and Display Document Summary Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Searches the knowledge base for relevant chunks based on a summary prompt, generates a summary using a Langchain summarization chain (with a fallback to a larger model), and displays the summary in the Streamlit app. ```Python st.header("Here's a brief summary of your file:") pdf_summary = "Give me a concise summary, use the language that the file is in. " docs = knowledge_base.similarity_search(pdf_summary) if 'summary' not in st.session_state or st.session_state.summary is None: with st.spinner('Wait for it...'): with get_openai_callback() as scb: try: st.session_state.summary = chain.run(input_documents=docs, question=pdf_summary) except Exception as maxtoken_error: # Fallback to the larger model if the context length is exceeded print(maxtoken_error) st.session_state.summary = chain_large.run(input_documents=docs, question=pdf_summary) print(scb) st.write(st.session_state.summary) ``` -------------------------------- ### Load Environment Variables Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Loads environment variables from a .env file, typically used to securely store API keys or other configuration settings required by the application. ```Python # Load environment variables load_dotenv() ``` -------------------------------- ### Initialize Streamlit Session State Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Initializes the Streamlit session state variables used to maintain state across user interactions, specifically setting a default LLM model if not already present. ```Python def main(): if "model" not in st.session_state: st.session_state.model = "text-davinci-003" ``` -------------------------------- ### Configure Streamlit Page Settings Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Sets the basic configuration for the Streamlit application page, including the page title displayed in the browser tab and the main title displayed on the page itself. ```Python # Configure Streamlit page settings st.set_page_config(page_title="PDFReader") st.title("PDF & Word Reader ✨") ``` -------------------------------- ### Implement File Uploader Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Adds a file uploader widget to the Streamlit app, allowing users to upload files of specified types (PDF and DOCX). ```Python # Upload file uploaded_file = st.file_uploader("Upload your file", type=["pdf", "docx"]) ``` -------------------------------- ### Formatting Chat History for Display (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Defines a function that takes a list of question-answer tuples and formats them into a single string with 'Question:' and 'Answer:' labels, adding blank lines between entries for readability. ```Python def format_chat_history(chat_history): formatted_history = "" for entry in chat_history: question, answer = entry # Added an extra '\n' for the blank line formatted_history += f"Question: {question}\nAnswer: {answer}\n\n" return formatted_history ``` -------------------------------- ### Initialize File-Related Session State Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Initializes session state variables to keep track of the uploaded file's name, ensuring state persistence across reruns. ```Python # Initialize session state if 'pdf_name' not in st.session_state: st.session_state.pdf_name = None ``` -------------------------------- ### Handling General Exceptions in Streamlit (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Catches any other general `Exception` that might occur during processing, displays a generic error message including the exception details using `st.error`. ```Python except Exception as e: st.error(f"An error occurred: {str(e)}") ``` -------------------------------- ### Process Uploaded File and Extract Text Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Handles the uploaded file, checks its type (PDF or DOCX), extracts text content, and clears the summary from session state if a new file is uploaded. ```Python # Extract the text if uploaded_file is not None : file_type = uploaded_file.type # Clear summary if a new file is uploaded if 'summary' in st.session_state and st.session_state.file_name != uploaded_file.name: st.session_state.summary = None st.session_state.file_name = uploaded_file.name try: if file_type == "application/pdf": # Handle PDF files pdf_reader = PdfReader(uploaded_file) text = "" for page in pdf_reader.pages: text += page.extract_text() elif file_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document": # Handle Word documents doc = Document(uploaded_file) paragraphs = [p.text for p in doc.paragraphs] text = "\n".join(paragraphs) # Extract text from tables for table in doc.tables: table_text = extract_text_from_table(table) if table_text: text += "\n" + table_text else: st.error("Unsupported file format. Please upload a PDF or DOCX file.") return ``` -------------------------------- ### Handling Empty PDF Error in Streamlit (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Catches an `IndexError`, typically indicating that no text was extracted from the PDF, and displays an informative error message to the user using `st.error`. ```Python except IndexError: #st.caption("Well, Seems like your PDF doesn't contain any text, try another one.🆖") st.error("Please upload another PDF. It seems like this PDF doesn't contain any text.") ``` -------------------------------- ### Clearing Streamlit Session History (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Defines a function to clear the chat history stored in the Streamlit session state. It checks if the 'history' key exists and deletes it if present. ```Python def clear_history(): if "history" in st.session_state: del st.session_state["history"] ``` -------------------------------- ### Split Text into Chunks Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Splits the extracted text into smaller, manageable chunks using a RecursiveCharacterTextSplitter, which is necessary for processing large documents with LLMs due to context window limitations. ```Python # Split text into chunks text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len ) chunks = text_splitter.split_text(text) ``` -------------------------------- ### Extracting Text from DOCX Table (Python) Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt Defines a function to iterate through rows and cells of a docx.table._Table object and extract the text content from each cell, joining them with newlines. It returns the combined text, stripping leading/trailing whitespace. ```Python def extract_text_from_table(table): text = "" for row in table.rows: for cell in row.cells: if isinstance(cell, _Cell): text += cell.text + "\n" return text.strip() ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.