### Server Startup Output Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Example output indicating the server has started successfully. ```text [2025-07-26 16:09:07] INFO: Started server process [80269] [2025-07-26 16:09:07] INFO: Waiting for application startup. [2025-07-26 16:09:07] INFO: Application startup complete. [2025-07-26 16:09:07] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) [2025-07-26 16:09:08] INFO: 127.0.0.1:57722 - "GET /get_model_info HTTP/1.1" 200 OK [2025-07-26 16:09:11] INFO: 127.0.0.1:57732 - "POST /generate HTTP/1.1" 200 OK [2025-07-26 16:09:11] The server is fired up and ready to roll! ``` -------------------------------- ### Install SGLang Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Install the SGLang package on the server. ```bash pip install sglang ``` -------------------------------- ### Install and Run Claude Code Router with GLM Backend Source: https://context7.com/zai-org/glm-4.5/llms.txt These shell commands demonstrate how to install Claude Code and its router, configure it to use a GLM backend, and start a session. Ensure the router configuration file is saved to ~/.claude-code-router/config.json. ```shell # Install Claude Code and Router npm install -g @anthropic-ai/claude-code npm install -g @musistudio/claude-code-router # Save config to ~/.claude-code-router/config.json # Restart the router ccr restart # Start Claude Code with GLM backend ccr code # Example session: # > how can I improve this function's performance? # ⏺ I'll analyze the function and suggest optimizations... ``` -------------------------------- ### Install Claude Code and Router Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Install the required CLI tools globally using npm. ```bash npm install -g @anthropic-ai/claude-code npm install -g @musistudio/claude-code-router ``` -------------------------------- ### Launch GLM-4.5 Server Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Start the model service with the specified configuration parameters. ```bash python3 -m sglang.launch_server \ --model-path zai-org/GLM-4.5 \ --tp-size 16 \ --tool-call-parser glm45 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.7 \ --served-model-name glm-4.5 \ --port 8000 \ --host 0.0.0.0 # Or your server's internal/public IP address ``` -------------------------------- ### Huggingface Login and Install Prerequisites Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md Log in to Huggingface and install the necessary Python packages and dependencies for vLLM on AMD GPUs. Ensure you are in the correct directory and have set the PYTORCH_ROCM_ARCH environment variable. ```shell huggingface-cli login ``` ```shell pip uninstall vllm pip install --upgrade pip cd GLM-4.5/example/AMD_GPU/ pip install -r rocm-requirements.txt git clone https://github.com/vllm-project/vllm.git cd vllm export PYTORCH_ROCM_ARCH="gfx942" python3 setup.py develop ``` -------------------------------- ### Example Usage of Model Response Parser Source: https://context7.com/zai-org/glm-4.5/llms.txt This example demonstrates how to use the `parse_model_response` function with a sample raw model output. It prints the extracted reasoning, content, and the number of tool calls found. ```python raw_response = """ The user wants to calculate the 1000th Fibonacci number. I should use the Python interpreter. I'll calculate the 1000th Fibonacci number for you. python code def fib(n): a, b = 0, 1 for _ in range(n): a, b = b, a + b result = a """ parsed = parse_model_response(raw_response, []) print(f"Reasoning: {parsed.get('reasoning_content', 'None')[:50]}...") print(f"Content: {parsed.get('content', 'None')}") print(f"Tool calls: {len(parsed.get('tool_calls', []))}") ``` -------------------------------- ### Deploy GLM on AMD GPUs with ROCm-vLLM Source: https://context7.com/zai-org/glm-4.5/llms.txt This sequence of shell commands outlines the steps to deploy GLM models on AMD MI300X GPUs using a ROCm-enabled vLLM container. It includes launching the Docker container, installing dependencies, building vLLM from source for ROCm, and starting the vLLM server. ```shell # Step 1: Launch ROCm-vLLM Docker container docker run -it --rm \ --cap-add=SYS_PTRACE \ -e SHELL=/bin/bash \ --network=host \ --security-opt seccomp=unconfined \ --device=/dev/kfd \ --device=/dev/dri \ -v /:/workspace \ --group-add video \ --ipc=host \ --name vllm_GLM \ rocm/vllm-dev:nightly # Step 2: Inside the container, install dependencies huggingface-cli login pip uninstall vllm pip install --upgrade pip pip install -r rocm-requirements.txt # Build vLLM from source for ROCm git clone https://github.com/vllm-project/vllm.git cd vllm export PYTORCH_ROCM_ARCH="gfx942" python3 setup.py develop # Step 3: Start vLLM server VLLM_USE_V1=1 vllm serve zai-org/GLM-4.5 \ --tensor-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --disable-log-requests \ --no-enable-prefix-caching \ --trust-remote-code ``` -------------------------------- ### Run vLLM Online Serving for GLM-4.5 Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md Command to start vLLM online serving for the zai-org/GLM-4.5 model on AMD GPUs. Adjust tensor-parallel-size and gpu-memory-utilization as needed. ```shell VLLM_USE_V1=1 vllm serve zai-org/GLM-4.5 --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --disable-log-requests --no-enable-prefix-caching --trust-remote-code ``` -------------------------------- ### SGLang PD-Disaggregation Prefill Server Source: https://github.com/zai-org/glm-4.5/blob/main/README.md Starts an SGLang server for the GLM-4.5-Air model in prefill disaggregation mode. Requires specifying the model path, disaggregation mode, IB device, and tensor parallel size. ```shell python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air --disaggregation-mode prefill --disaggregation-ib-device mlx5_0 --tp-size 4 ``` -------------------------------- ### SGLang PD-Disaggregation Router Source: https://github.com/zai-org/glm-4.5/blob/main/README.md Starts the SGLang router for PD-Disaggregation, connecting to the prefill and decode servers. This command configures the router to listen on all interfaces and a specified port. ```shell python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000 ``` -------------------------------- ### Claude Code Interface Output Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Example terminal output when running the Claude Code command. ```text zr@MacBook GLM-4.5 % ccr code Service not running, starting service... ╭───────────────────────────────────────────────────╮ │ ✻ Welcome to Claude Code! │ │ │ │ /help for help, /status for your current setup │ │ │ │ cwd: /Users/zr/Code/GLM-4.5 │ │ │ │ ─────────────────────────────────────────────── │ │ │ │ Overrides (via env): │ │ │ │ • API timeout: 600000ms │ │ • API Base URL: http://127.0.0.1:3456 │ ╰───────────────────────────────────────────────────╯ ※ Tip: Press Esc twice to edit your previous messages ``` -------------------------------- ### Enable Preserved Thinking for Agentic Tasks Source: https://context7.com/zai-org/glm-4.5/llms.txt Use this Python code to enable preserved thinking mode for agentic tasks with GLM-4.7. This retains all thinking blocks across turns for complex, long-horizon tasks. Ensure the server is started with the appropriate chat template kwargs. ```python from openai import OpenAI client = OpenAI(api_key="EMPTY", base_url="http://127.0.0.1:8000/v1") # For agentic tasks with GLM-4.7, enable preserved thinking # This retains all thinking blocks across turns for complex, long-horizon tasks # When starting the server with SGLang, add these chat template kwargs: # python3 -m sglang.launch_server \ # --model-path zai-org/GLM-4.7 \ # ... other args ... # Then in your requests, enable preserved thinking: completion = client.chat.completions.create( model="glm-4.7", messages=[ {"role": "system", "content": "You are a coding agent working on a complex refactoring task."}, {"role": "user", "content": "Analyze this codebase and create a plan to improve its architecture."}, ], max_tokens=8192, temperature=0.7, extra_body={ "chat_template_kwargs": { "enable_thinking": True, # Enable thinking mode "clear_thinking": False # Preserve thinking across turns } } ) # The model will maintain reasoning context across multiple turns, # reducing information loss and improving consistency in complex tasks print(completion.choices[0].message.content) ``` -------------------------------- ### Implement Tool Logic Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md Provide the actual execution logic for the tools and map them to their names. ```python def python_tool(code: str) -> str: # Your implementation of the python tool return f"Result of python execution." def search_tool(query: str, num: int = 10) -> str: # Your implementation of the search tool return f"Search results for query." def open_tool(id) -> str: # Your implementation of the open tool return f"Opened result." def find_tool(pattern: str) -> str: # Your implementation of the find tool return f"Found pattern." tool_map = { "python": python_tool, "browser.search": search_tool, "browser.open": open_tool, "browser.find": find_tool, } ``` -------------------------------- ### Configure PD-Disaggregation with SGLang Source: https://context7.com/zai-org/glm-4.5/llms.txt Split prefill and decode phases across separate GPU sets for improved throughput. ```shell # Start prefill server (uses GPUs 0-3) python -m sglang.launch_server \ --model-path zai-org/GLM-4.5-Air \ --disaggregation-mode prefill \ --disaggregation-ib-device mlx5_0 \ --tp-size 4 # Start decode server (uses GPUs 4-7) python -m sglang.launch_server \ --model-path zai-org/GLM-4.5-Air \ --disaggregation-mode decode \ --port 30001 \ --disaggregation-ib-device mlx5_0 \ --tp-size 4 \ --base-gpu-id 4 # Start router to manage prefill/decode traffic python -m sglang_router.launch_router \ --pd-disaggregation \ --prefill http://127.0.0.1:30000 \ --decode http://127.0.0.1:30001 \ --host 0.0.0.0 \ --port 8000 ``` -------------------------------- ### Launch GLM-4.6 with sglang Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md Configure the inference engine with specific parsers for reasoning and tool calls. ```bash python3 -m sglang.launch_server \ --model-path zai-org/GLM-4.6 \ --tp-size 8 \ --tool-call-parser glm45 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.7 \ --served-model-name glm-4.6 \ --host 0.0.0.0 \ --port 8000 ``` -------------------------------- ### Run Claude Code Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Command to initiate the Claude Code interface. ```bash ccr code ``` -------------------------------- ### Serve Models with vLLM Source: https://context7.com/zai-org/glm-4.5/llms.txt Deploy GLM models as an OpenAI-compatible API server using vLLM with speculative decoding. ```shell # Start vLLM server with GLM-4.7 FP8 model vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-fp8 # For GLM-4.5 with full BF16 precision vllm serve zai-org/GLM-4.5 \ --tensor-parallel-size 16 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm45 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.5 ``` -------------------------------- ### Basic Chat Completion with Thinking Enabled Source: https://context7.com/zai-org/glm-4.5/llms.txt Use this for general chat completions where the model can 'think' to provide more detailed responses. Ensure the model is configured with appropriate parameters like `model`, `messages`, `max_tokens`, and `temperature`. ```python messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "Explain how quicksort works and provide Python code"}, ] completion = client.chat.completions.create( model="glm-4.7-fp8", messages=messages, max_tokens=4096, temperature=0.7, ) print(completion.choices[0].message.content) ``` -------------------------------- ### Run Basic Inference with Transformers Source: https://context7.com/zai-org/glm-4.5/llms.txt Load and run GLM models using Hugging Face Transformers. Requires torch and transformers libraries. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_PATH = "zai-org/GLM-4.7" # or "zai-org/GLM-4.5", "zai-org/GLM-4.6" # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained( pretrained_model_name_or_path=MODEL_PATH, torch_dtype=torch.bfloat16, device_map="auto", ) # Prepare messages using chat template messages = [{"role": "user", "content": "Write a Python function to calculate factorial"}] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", ) inputs = inputs.to(model.device) # Generate response generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False) output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:]) print(output_text) # Output: def factorial(n): # if n < 0: # raise ValueError("Factorial is not defined for negative numbers") # if n == 0 or n == 1: # return 1 # return n * factorial(n - 1) ``` -------------------------------- ### Serve Models with SGLang Source: https://context7.com/zai-org/glm-4.5/llms.txt Deploy GLM models using SGLang with EAGLE speculative decoding for high-performance inference. ```shell # Start SGLang server with GLM-4.7 FP8 model python3 -m sglang.launch_server \ --model-path zai-org/GLM-4.7-FP8 \ --tp-size 8 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.8 \ --served-model-name glm-4.7-fp8 \ --host 0.0.0.0 \ --port 8000 # Server output when ready: # [INFO] The server is fired up and ready to roll! # [INFO] Uvicorn running on http://0.0.0.0:8000 ``` -------------------------------- ### vLLM Compilation Output Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md This output shows the compilation process for vLLM, including HIP compiler detection and extension building. It confirms successful configuration and generation of build files. ```shell -- Found Torch: /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so -- The HIP compiler identification is Clang 19.0.0 -- Detecting HIP compiler ABI info -- Detecting HIP compiler ABI info - done -- Check for working HIP compiler: /opt/rocm/lib/llvm/bin/clang++ - skipped -- Detecting HIP compile features -- Detecting HIP compile features - done -- HIP supported arches: gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201 -- FetchContent base directory: /app/GLM-4.5/vllm/.deps -- Enabling C extension. -- Enabling moe extension. -- Configuring done (19.7s) -- Generating done (0.0s) -- Build files have been written to: /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312 [1/35] Running hipify on _C extension source files. /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan.h -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan_hip.h [ok] /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/static_switch.h -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/static_switch.h [skipped, no changes] /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan_fwd.cu -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan_fwd.hip [ok] /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/cuda_utils.h -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/hip_utils.h [ok] ... ... ... Using /usr/local/lib/python3.12/dist-packages Finished processing dependencies for vllm==0.10.1.dev343+g54de71d0d.rocm641 ``` -------------------------------- ### vLLM Inference Server Configuration Source: https://github.com/zai-org/glm-4.5/blob/main/README.md Use this command to serve the GLM-4.7-FP8 model with vLLM, enabling tensor parallelism and specific tool/reasoning parsers for enhanced functionality. ```shell vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --enable-auto-tool-choice \ --served-model-name glm-4.7-fp8 ``` -------------------------------- ### Make OpenAI-Compatible API Requests Source: https://context7.com/zai-org/glm-4.5/llms.txt Connect to a locally deployed GLM server using the OpenAI Python client. ```python from openai import OpenAI import json # Connect to local GLM server client = OpenAI( api_key="EMPTY", base_url="http://127.0.0.1:8000/v1", ) ``` -------------------------------- ### Launch ROCm-vllm Docker Container Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md Use this command to launch the ROCm-vllm development container with necessary device mappings and volume mounts. ```shell docker run -it --rm \ --cap-add=SYS_PTRACE \ -e SHELL=/bin/bash \ --network=host \ --security-opt seccomp=unconfined \ --device=/dev/kfd \ --device=/dev/dri \ -v /:/workspace \ --group-add video \ --ipc=host \ --name vllm_GLM \ rocm/vllm-dev:nightly ``` -------------------------------- ### Make GLM-4.5 Model Request with Tool Calls Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md This code demonstrates how to send a request to the GLM-4.5 chat completions API, including handling tool calls within a loop until a final response is generated. Ensure 'tools', 'tool_map', and 'api_base' are defined before execution. ```python import requests import json api_base = "http://ip:port/v1" messages = [ {"role": "user", "content": "Calculate the 1000th term of the Fibonacci sequence."}, ] finish_reason = None while finish_reason is None or finish_reason == "tool_calls": request_data = { "model": "glm-4.6", "messages": messages, "max_tokens": 2048, "temperature": 1.0, "stop": ["<|user|>", "<|endoftext|>", "<|observation|>", "<|assistant|>"], "tools": tools, "stream": False, } response = requests.post( f"{api_base}/chat/completions", headers={"Content-Type": "application/json"}, json=request_data ) response.raise_for_status() response_json = response.json() choice = response_json["choices"][0] finish_reason = choice["finish_reason"] if finish_reason == "tool_calls": messages.append(choice["message"]) tool_calls = choice["message"]["tool_calls"] for tool_call in tool_calls: tool_name = tool_call["function"]["name"] tool_arguments = json.loads(tool_call["function"]["arguments"]) tool_call_id = tool_call["id"] tool_output = tool_map[tool_name](**tool_arguments) messages.append({ "role": "tool", "name": tool_name, "content": tool_output, "tool_call_id": tool_call_id, }) print(choice["message"]["content"]) ``` -------------------------------- ### SGLang Inference Server Configuration Source: https://github.com/zai-org/glm-4.5/blob/main/README.md Launch the SGLang server for GLM-4.7-FP8 with specified tensor parallelism, speculative decoding parameters, and host/port settings. Includes configurations for tool and reasoning parsers. ```shell python3 -m sglang.launch_server \ --model-path zai-org/GLM-4.7-FP8 \ --tp-size 8 \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --mem-fraction-static 0.8 \ --served-model-name glm-4.7-fp8 \ --host 0.0.0.0 \ --port 8000 ``` -------------------------------- ### Perform Web Search Source: https://context7.com/zai-org/glm-4.5/llms.txt A placeholder function for performing web searches. It returns a JSON-formatted string with dummy results. Replace the implementation with actual search logic. ```python def search_tool(query: str, num: int = 10) -> str: # Your search implementation return json.dumps({"results": [{"title": f"Result for {query}", "url": "https://example.com"}]}) ``` -------------------------------- ### Define Tool Schemas Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md Define the function signatures for built-in tools to be provided to the model. ```python tools = [ { "type": "function", "function":{ "name": "python", "description": "Interpreter for executing python code", "parameters": { "type": "object", "properties": {"code": {"description": "Code to execute", "type": "string"}}, }, } }, { "type": "function", "function":{ "name": "browser.search", "description": "Search in browser", "parameters": { "type": "object", "properties": { "query": {"description": "Search query", "type": "string"}, "num": {"description": "Number of results to return", "type": "integer", "default": 10}, }, "required": ["query"], }, } }, { "type": "function", "function":{ "name": "browser.open", "description": "Open browser link", "parameters": { "type": "object", "properties": { "id": {"description": "ID or URL of the link to open", "type": ["integer", "string"]}, }, "required": ["id"], }, } }, { "type": "function", "function": { "name": "browser.find", "description": "Find pattern in the opened browser content", "parameters": { "type": "object", "properties": { "pattern": {"description": "Pattern to find", "type": "string"}, }, "required": ["pattern"], }, } }, ] ``` -------------------------------- ### Agentic Loop for Tool Use Source: https://context7.com/zai-org/glm-4.5/llms.txt This code demonstrates an agentic loop that continuously calls an AI model and processes its responses, including tool calls. It appends tool outputs back to the message history for the model to continue the conversation. ```python tool_map = {"python": python_tool, "browser.search": search_tool} messages = [ {"role": "user", "content": "Calculate the 1000th term of the Fibonacci sequence."}, ] # Agentic loop - continue until model stops calling tools finish_reason = None while finish_reason is None or finish_reason == "tool_calls": response = requests.post( f"{api_base}/chat/completions", headers={"Content-Type": "application/json"}, json={ "model": "glm-4.6", "messages": messages, "max_tokens": 2048, "temperature": 1.0, "stop": ["<|user|>", "<|endoftext|>", "<|observation|>", "<|assistant|>"], "tools": tools, "stream": False, } ) response.raise_for_status() choice = response.json()["choices"][0] finish_reason = choice["finish_reason"] if finish_reason == "tool_calls": messages.append(choice["message"]) for tool_call in choice["message"]["tool_calls"]: tool_name = tool_call["function"]["name"] tool_args = json.loads(tool_call["function"]["arguments"]) tool_output = tool_map[tool_name](**tool_args) messages.append({ "role": "tool", "name": tool_name, "content": tool_output, "tool_call_id": tool_call["id"], }) print(choice["message"]["content"]) ``` -------------------------------- ### Tool Calling with Function Definitions Source: https://context7.com/zai-org/glm-4.5/llms.txt Enable GLM to call external functions by providing tool definitions in OpenAI-compatible format. This involves defining the tool's name, description, and parameters. The model will then decide when to call these tools based on the user's request. ```python from openai import OpenAI import json client = OpenAI(api_key="EMPTY", base_url="http://127.0.0.1:8000/v1") # Define available tools tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current temperature for a given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City and country e.g. Tokyo, Japan", } }, "required": ["location"], "additionalProperties": False, }, }, }, { "type": "function", "function": { "name": "calculate", "description": "Perform mathematical calculations", "parameters": { "type": "object", "properties": { "expression": { "type": "string", "description": "Mathematical expression to evaluate, e.g. '2 * 3 + 4'", } }, "required": ["expression"], "additionalProperties": False, }, }, }, ] messages = [ {"role": "system", "content": "You are a helpful assistant. Use tools when needed."}, {"role": "user", "content": "What's the weather in Beijing?"}, ] # First request - model decides to call tool completion = client.chat.completions.create( model="glm-4.7-fp8", messages=messages, tools=tools, max_tokens=4096, temperature=0.0, ) tool_call = completion.choices[0].message.tool_calls[0] args = json.loads(tool_call.function.arguments) print(f"Tool: {tool_call.function.name}, Args: {args}") # Simulate tool execution and continue conversation messages.append(completion.choices[0].message) messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": '{"city": "Beijing", "temperature": "26C", "weather": "Sunny"}', }) # Second request - model processes tool result completion_2 = client.chat.completions.create( model="glm-4.7-fp8", messages=messages, tools=tools, extra_body={"chat_template_kwargs": {"enable_thinking": False}}, ) print(completion_2.choices[0].message.content) ``` -------------------------------- ### Load GLM-4.5 with Transformers Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Python snippet for loading the model using the Hugging Face transformers library. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch MODEL_PATH = "zai-org/GLM-4.5" tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, torch_dtype=torch.bfloat16, device_map="auto" ) ``` -------------------------------- ### Configure Thinking Mode for SGLang Source: https://github.com/zai-org/glm-4.5/blob/main/README.md Add this configuration to enable Preserved Thinking mode in SGLang for agentic tasks. ```json "chat_template_kwargs": { "enable_thinking": true, "clear_thinking": false } ``` -------------------------------- ### SGLang PD-Disaggregation Decode Server Source: https://github.com/zai-org/glm-4.5/blob/main/README.md Launches an SGLang server for the GLM-4.5-Air model in decode disaggregation mode. This command specifies the port, IB device, tensor parallel size, and the base GPU ID for decoding. ```shell python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air --disaggregation-mode decode --port 30001 --disaggregation-ib-device mlx5_0 --tp-size 4 --base-gpu-id 4 ``` -------------------------------- ### Make GLM-4.5 Model Request with Tools Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md Use this code to interact with the GLM-4.5 API. It includes defining available tools, sending a prompt, and processing the model's response, including executing tool calls. ```python from transformers import AutoTokenizer import json import requests api_base = "http://ip:port/v1" tools = [ { "name": "python", "description": "Interpreter for executing python code", "parameters": { "type": "object", "properties": {"code": {"description": "Code to execute", "type": "string"}}, }, }, { "name": "browser.search", "description": "Search in browser", "parameters": { "type": "object", "properties": { "query": {"description": "Search query", "type": "string"}, "num": {"description": "Number of results to return", "type": "integer", "default": 10}, }, "required": ["query"], }, }, { "name": "browser.open", "description": "Open browser link", "parameters": { "type": "object", "properties": { "id": {"description": "ID or URL of the link to open", "type": ["integer", "string"]}, }, "required": ["id"], }, }, { "name": "browser.find", "description": "Find pattern in the opened browser content", "parameters": { "type": "object", "properties": { "pattern": {"description": "Pattern to find", "type": "string"}, }, "required": ["pattern"], }, }, ] tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.6", trust_remote_code=True) messages = [ {"role": "user", "content": "Calculate the 1000th term of the Fibonacci sequence."}, ] while True: prompt = tokenizer.apply_chat_template(messages, tools, add_generation_prompt=True, tokenize=False) request_data = { "model": "glm-4.6", "prompt": prompt, "max_tokens": 2048, "temperature": 1.0, "stop": ["<|user|>", "<|endoftext|>", "<|observation|>", "<|assistant|>"], "stream": False, } response = requests.post( f"{api_base}/completions", headers={"Content-Type": "application/json"}, json=request_data, timeout=360 ) response_json = response.json() text = response_json["choices"][0]["text"] assistant_message = parse_model_response(text, tools) messages.append(assistant_message) tool_calls = assistant_message.get("tool_calls", []) if not tool_calls: break for tool_call in tool_calls: tool_name = tool_call["name"] tool_arguments = tool_call["arguments"] tool_output = tool_map[tool_name](**tool_arguments) messages.append({ "role": "tool", "name": tool_name, "content": tool_output, "tool_call_id": tool_call["tool_call_id"], }) print(assistant_message) ``` -------------------------------- ### Tool-Integrated Reasoning with Built-in Tools Source: https://context7.com/zai-org/glm-4.5/llms.txt Utilize GLM's built-in tools for Python code execution, web search, and browser operations. Define these tools in a similar format to custom tools, specifying their names, descriptions, and parameters. ```python import requests import json api_base = "http://127.0.0.1:8000/v1" # Define built-in tools tools = [ { "type": "function", "function": { "name": "python", "description": "Interpreter for executing python code", "parameters": { "type": "object", "properties": {"code": {"description": "Code to execute", "type": "string"}}, }, } }, { "type": "function", "function": { "name": "browser.search", "description": "Search in browser", "parameters": { "type": "object", "properties": { "query": {"description": "Search query", "type": "string"}, "num": {"description": "Number of results to return", "type": "integer", "default": 10}, }, "required": ["query"], }, } }, ] ``` -------------------------------- ### Configure GLM Models with Claude Code Router Source: https://context7.com/zai-org/glm-4.5/llms.txt This JSON configuration sets up GLM models to work with the claude-code-router. It defines the provider details and routing rules for different task types. ```json { "LOG": true, "Providers": [ { "name": "glm-4.7-sglang", "api_base_url": "http://your-server-ip:8000/v1/chat/completions", "api_key": "EMPTY", "models": [ "glm-4.7" ] } ], "Router": { "default": "glm-4.7-sglang,glm-4.7", "background": "glm-4.7-sglang,glm-4.7", "think": "glm-4.7-sglang,glm-4.7", "longContext": "glm-4.7-sglang,glm-4.7", "webSearch": "glm-4.7-sglang,glm-4.7" } } ``` -------------------------------- ### Chat Completion with Thinking Disabled Source: https://context7.com/zai-org/glm-4.5/llms.txt Disable thinking mode for faster responses on simple tasks. Set `enable_thinking` to `False` within `chat_template_kwargs` in the `extra_body` parameter. This is useful for quick, direct answers. ```python completion_fast = client.chat.completions.create( model="glm-4.7-fp8", messages=[{"role": "user", "content": "What is 2+2?"}], max_tokens=100, temperature=0.0, extra_body={"chat_template_kwargs": {"enable_thinking": False}} ) print(completion_fast.choices[0].message.content) ``` -------------------------------- ### Restart Claude Code Router Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md Output displayed after restarting the router service. ```text Service was not running or failed to stop. Starting claude code router service... ✅ Service started successfully in the background. ``` -------------------------------- ### Execute Python Code Safely Source: https://context7.com/zai-org/glm-4.5/llms.txt This function executes Python code within a controlled environment. It's useful for running dynamic code generated by an AI model. Ensure the code is designed to return a 'result' variable for explicit output. ```python def python_tool(code: str) -> str: # Execute Python code safely try: exec_globals = {} exec(code, exec_globals) return str(exec_globals.get('result', 'Code executed successfully')) except Exception as e: return f"Error: {str(e)}" ``` -------------------------------- ### Parse Model Response with Reasoning and Tool Calls Source: https://context7.com/zai-org/glm-4.5/llms.txt This Python function manually parses a raw string response from a GLM model, extracting reasoning, regular content, and structured tool calls. It uses regular expressions to identify and parse content enclosed in specific tags like and . ```python import re import json import uuid def parse_model_response(response: str, defined_tools: list): """Parse GLM response to extract reasoning and tool calls.""" text = response.strip() reasoning_content = None content = None tool_calls = [] # Extract reasoning content wrapped in ... if text.startswith(''): if '' in text: reasoning_content, text = text.rsplit('', 1) reasoning_content = reasoning_content.removeprefix('').strip() text = text.strip() else: reasoning_content = text.removeprefix('').strip() text = "" # Extract regular content before tool calls if '' in text: index = text.find('') content = text[:index].strip() text = text[index:].strip() else: content = text.strip() text = "" # Parse tool calls wrapped in ... tool_call_strs = re.findall(r'(.*?)', text, re.DOTALL) for call in tool_call_strs: func_name_match = re.match(r'([^ <]+)', call.strip()) func_name = func_name_match.group(1).strip() if func_name_match else None if func_name: pairs = re.findall(r'(.*?)\s*(.*?)', call, re.DOTALL) arguments = {} for arg_key, arg_value in pairs: arg_key = arg_key.strip() arg_value = arg_value.strip() try: arguments[arg_key] = json.loads(arg_value) except: arguments[arg_key] = arg_value tool_calls.append({ 'tool_call_id': "tool-call-" + str(uuid.uuid4()), 'name': func_name, 'arguments': arguments }) message = {'role': 'assistant'} if reasoning_content: message['reasoning_content'] = reasoning_content if content: message['content'] = content if tool_calls: message['tool_calls'] = tool_calls return message ```