### Install Project Dependencies

Source: https://github.com/lenkazuma/pdfreader/blob/main/README.md

Steps to clone the repository from GitHub, navigate into the project directory, and install the required Python dependencies listed in the requirements.txt file using pip.

```Shell
git clone https://github.com/lenkazuma/PDFReader.git
```

```Shell
cd PDFReader
```

```Shell
pip install -r requirements.txt
```

--------------------------------

### Project Dependencies (requirements.txt)

Source: https://github.com/lenkazuma/pdfreader/blob/main/requirements.txt

This list specifies the Python packages and their exact or minimum versions required to run the project. These dependencies can typically be installed using a package manager like pip with the command `pip install -r requirements.txt`.

```requirements.txt
langchain
PyPDF2==3.0.1
python-dotenv
streamlit
faiss-cpu==1.7.4
altair<5
openai==0.27.8
tiktoken==0.4.0
numpy==1.21.6
python-docx==0.8.11
pycryptodome==3.18.0
streamlit-extras
docx2txt
chromadb
pysqlite3-binary
qianfan
```

--------------------------------

### Initialize Langchain Components with OpenAI

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Initializes the OpenAI Large Language Model instance and various Langchain chains (summarization and QA) using different chain types ('stuff' and 'map_reduce') for processing documents.

```Python
    llm = OpenAI(temperature=0.7, model=st.session_state.model)
    #llmchat = OpenAI(temperature=0.7, model_name='gpt-3.5-turbo')
    chain = load_summarize_chain(llm, chain_type="stuff")
    chain_large = load_summarize_chain(llm, chain_type="map_reduce")
    chain_qa = load_qa_chain(llm, chain_type="stuff")
    chain_large_qa = load_qa_chain(llm, chain_type="map_reduce")
```

--------------------------------

### Configure Streamlit Sidebar

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Sets up the sidebar in the Streamlit application, adding a title, an 'About' section with markdown text, and a radio button for selecting the OpenAI LLM model.

```Python
    with st.sidebar:
        st.title('🤗💬 LLM PDFReader App')
        st.markdown("""
        ## About
        This app is an LLM-powered chatbot built using:
        - [Streamlit](https://streamlit.io/)
        - [Langchain](https://python.langchian.com/)
        - [OpenAI](https://platform.openai.com/docs/models) LLM model        
        """)
        st.radio(
        "Model 👉",
        key="model",
        options=["text-ada-001", "text-davinci-002", "text-davinci-003"],
        )
        add_vertical_space(5)
```

--------------------------------

### Handle User Questions and Generate Responses

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Adds a text input for user questions, performs a similarity search in the knowledge base based on the question, generates a response using a Langchain QA chain (with fallback), and displays the response.

```Python
            # User input for questions
            user_question = st.text_input("Ask a question about your file:")
            if user_question:
                docs = knowledge_base.similarity_search(user_question)
                with st.spinner('Wait for it...'):
                  with get_openai_callback() as cb:
                    try:
                        response = chain_qa.run(input_documents=docs, question=user_question)
                    except Exception as maxtoken_error:
                        print(maxtoken_error)
                        response = chain_large_qa.run(input_documents=docs, question=user_question) 
                    print(cb)
                    # Show/hide section using st.beta_expander
                    #with st.expander("Used Tokens", expanded=False):
                       #st.write(cb)
```

--------------------------------

### Importing Libraries for PDF Reader (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Imports necessary libraries for loading environment variables, handling PDF/DOCX files, text splitting, embeddings, vector stores, language models, QA/summarization chains, Streamlit UI, and system operations.

```Python
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
import streamlit as st
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.summarize import load_summarize_chain
from langchain.llms import OpenAI
from langchain.callbacks import get_openai_callback
from docx import Document
from docx.table import _Cell
from streamlit_extras.add_vertical_space import add_vertical_space
import sys
```

--------------------------------

### Defining Main Execution Entry Point (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Standard Python construct that ensures the `main()` function is called only when the script is executed directly, not when imported as a module.

```Python
if __name__ == '__main__':
    main()
```

--------------------------------

### Run Streamlit Application Locally

Source: https://github.com/lenkazuma/pdfreader/blob/main/README.md

Command to execute the Streamlit application's main script using the Streamlit command-line interface, typically used for local development or testing.

```Shell
streamlit run your-app-name.py
```

--------------------------------

### Create Embeddings and Knowledge Base

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Generates embeddings for the text chunks using OpenAIEmbeddings and creates a FAISS knowledge base from these embeddings, enabling efficient similarity search for relevant document sections.

```Python
            # Create embeddings
            embeddings = OpenAIEmbeddings(disallowed_special=())
            knowledge_base = FAISS.from_texts(chunks, embeddings)
```

--------------------------------

### Generate and Display Document Summary

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Searches the knowledge base for relevant chunks based on a summary prompt, generates a summary using a Langchain summarization chain (with a fallback to a larger model), and displays the summary in the Streamlit app.

```Python
            st.header("Here's a brief summary of your file:")
            pdf_summary = "Give me a concise summary, use the language that the file is in. "

            docs = knowledge_base.similarity_search(pdf_summary)
            
            
            if 'summary' not in st.session_state or st.session_state.summary is None:
              with st.spinner('Wait for it...'):
                    with get_openai_callback() as scb:
                        try:
                            st.session_state.summary = chain.run(input_documents=docs, question=pdf_summary)    
                        except Exception as maxtoken_error:
                            # Fallback to the larger model if the context length is exceeded
                            print(maxtoken_error)
                            st.session_state.summary = chain_large.run(input_documents=docs, question=pdf_summary)
                        print(scb)    
                            
            st.write(st.session_state.summary)
```

--------------------------------

### Load Environment Variables

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Loads environment variables from a .env file, typically used to securely store API keys or other configuration settings required by the application.

```Python
   # Load environment variables 
    load_dotenv()
```

--------------------------------

### Initialize Streamlit Session State

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Initializes the Streamlit session state variables used to maintain state across user interactions, specifically setting a default LLM model if not already present.

```Python
def main():
    if "model" not in st.session_state:
        st.session_state.model = "text-davinci-003"
```

--------------------------------

### Configure Streamlit Page Settings

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Sets the basic configuration for the Streamlit application page, including the page title displayed in the browser tab and the main title displayed on the page itself.

```Python
# Configure Streamlit page settings
st.set_page_config(page_title="PDFReader")
st.title("PDF & Word Reader ✨")
```

--------------------------------

### Implement File Uploader

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Adds a file uploader widget to the Streamlit app, allowing users to upload files of specified types (PDF and DOCX).

```Python
    # Upload file
    uploaded_file  = st.file_uploader("Upload your file", type=["pdf", "docx"])
```

--------------------------------

### Formatting Chat History for Display (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Defines a function that takes a list of question-answer tuples and formats them into a single string with 'Question:' and 'Answer:' labels, adding blank lines between entries for readability.

```Python
def format_chat_history(chat_history):
    formatted_history = ""
    for entry in chat_history:
        question, answer = entry
        # Added an extra '\n' for the blank line
        formatted_history += f"Question: {question}\nAnswer: {answer}\n\n"
    return formatted_history
```

--------------------------------

### Initialize File-Related Session State

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Initializes session state variables to keep track of the uploaded file's name, ensuring state persistence across reruns.

```Python
    # Initialize session state
    if 'pdf_name' not in st.session_state:
        st.session_state.pdf_name = None
```

--------------------------------

### Handling General Exceptions in Streamlit (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Catches any other general `Exception` that might occur during processing, displays a generic error message including the exception details using `st.error`.

```Python
except Exception as e:
    st.error(f"An error occurred: {str(e)}")
```

--------------------------------

### Process Uploaded File and Extract Text

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Handles the uploaded file, checks its type (PDF or DOCX), extracts text content, and clears the summary from session state if a new file is uploaded.

```Python
    # Extract the text
    if uploaded_file  is not None :
        file_type = uploaded_file.type

        # Clear summary if a new file is uploaded
        if 'summary' in st.session_state and st.session_state.file_name != uploaded_file.name:
            st.session_state.summary = None

        st.session_state.file_name = uploaded_file.name

        try:
            if file_type == "application/pdf":
                # Handle PDF files
                pdf_reader = PdfReader(uploaded_file)
                text = ""
                for page in pdf_reader.pages:
                    text += page.extract_text()

            elif file_type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
                # Handle Word documents
                doc = Document(uploaded_file)
                paragraphs = [p.text for p in doc.paragraphs]
                text = "\n".join(paragraphs)

                # Extract text from tables
                for table in doc.tables:
                    table_text = extract_text_from_table(table)
                    if table_text:
                        text += "\n" + table_text

            else:
                st.error("Unsupported file format. Please upload a PDF or DOCX file.")
                return
```

--------------------------------

### Handling Empty PDF Error in Streamlit (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Catches an `IndexError`, typically indicating that no text was extracted from the PDF, and displays an informative error message to the user using `st.error`.

```Python
except IndexError:
    #st.caption("Well, Seems like your PDF doesn't contain any text, try another one.🆖")
    st.error("Please upload another PDF. It seems like this PDF doesn't contain any text.")
```

--------------------------------

### Clearing Streamlit Session History (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Defines a function to clear the chat history stored in the Streamlit session state. It checks if the 'history' key exists and deletes it if present.

```Python
def clear_history():
    if "history" in st.session_state:
        del st.session_state["history"]
```

--------------------------------

### Split Text into Chunks

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Splits the extracted text into smaller, manageable chunks using a RecursiveCharacterTextSplitter, which is necessary for processing large documents with LLMs due to context window limitations.

```Python
            # Split text into chunks
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=1000,
                chunk_overlap=200,
                length_function=len
            )
            chunks = text_splitter.split_text(text)
```

--------------------------------

### Extracting Text from DOCX Table (Python)

Source: https://github.com/lenkazuma/pdfreader/blob/main/backupWorkVersion.txt

Defines a function to iterate through rows and cells of a docx.table._Table object and extract the text content from each cell, joining them with newlines. It returns the combined text, stripping leading/trailing whitespace.

```Python
def extract_text_from_table(table):
    text = ""
    for row in table.rows:
        for cell in row.cells:
            if isinstance(cell, _Cell):
                text += cell.text + "\n"
    return text.strip()
```

=== COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.