### Server Startup Output
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Example output indicating the server has started successfully.
```text
[2025-07-26 16:09:07] INFO: Started server process [80269]
[2025-07-26 16:09:07] INFO: Waiting for application startup.
[2025-07-26 16:09:07] INFO: Application startup complete.
[2025-07-26 16:09:07] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[2025-07-26 16:09:08] INFO: 127.0.0.1:57722 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-07-26 16:09:11] INFO: 127.0.0.1:57732 - "POST /generate HTTP/1.1" 200 OK
[2025-07-26 16:09:11] The server is fired up and ready to roll!
```
--------------------------------
### Install SGLang
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Install the SGLang package on the server.
```bash
pip install sglang
```
--------------------------------
### Install and Run Claude Code Router with GLM Backend
Source: https://context7.com/zai-org/glm-4.5/llms.txt
These shell commands demonstrate how to install Claude Code and its router, configure it to use a GLM backend, and start a session. Ensure the router configuration file is saved to ~/.claude-code-router/config.json.
```shell
# Install Claude Code and Router
npm install -g @anthropic-ai/claude-code
npm install -g @musistudio/claude-code-router
# Save config to ~/.claude-code-router/config.json
# Restart the router
ccr restart
# Start Claude Code with GLM backend
ccr code
# Example session:
# > how can I improve this function's performance?
# ⏺ I'll analyze the function and suggest optimizations...
```
--------------------------------
### Install Claude Code and Router
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Install the required CLI tools globally using npm.
```bash
npm install -g @anthropic-ai/claude-code
npm install -g @musistudio/claude-code-router
```
--------------------------------
### Launch GLM-4.5 Server
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Start the model service with the specified configuration parameters.
```bash
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.5 \
--tp-size 16 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.7 \
--served-model-name glm-4.5 \
--port 8000 \
--host 0.0.0.0 # Or your server's internal/public IP address
```
--------------------------------
### Huggingface Login and Install Prerequisites
Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md
Log in to Huggingface and install the necessary Python packages and dependencies for vLLM on AMD GPUs. Ensure you are in the correct directory and have set the PYTORCH_ROCM_ARCH environment variable.
```shell
huggingface-cli login
```
```shell
pip uninstall vllm
pip install --upgrade pip
cd GLM-4.5/example/AMD_GPU/
pip install -r rocm-requirements.txt
git clone https://github.com/vllm-project/vllm.git
cd vllm
export PYTORCH_ROCM_ARCH="gfx942"
python3 setup.py develop
```
--------------------------------
### Example Usage of Model Response Parser
Source: https://context7.com/zai-org/glm-4.5/llms.txt
This example demonstrates how to use the `parse_model_response` function with a sample raw model output. It prints the extracted reasoning, content, and the number of tool calls found.
```python
raw_response = """
The user wants to calculate the 1000th Fibonacci number. I should use the Python interpreter.
I'll calculate the 1000th Fibonacci number for you.
python
code
def fib(n):
a, b = 0, 1
for _ in range(n):
a, b = b, a + b
result = a
"""
parsed = parse_model_response(raw_response, [])
print(f"Reasoning: {parsed.get('reasoning_content', 'None')[:50]}...")
print(f"Content: {parsed.get('content', 'None')}")
print(f"Tool calls: {len(parsed.get('tool_calls', []))}")
```
--------------------------------
### Deploy GLM on AMD GPUs with ROCm-vLLM
Source: https://context7.com/zai-org/glm-4.5/llms.txt
This sequence of shell commands outlines the steps to deploy GLM models on AMD MI300X GPUs using a ROCm-enabled vLLM container. It includes launching the Docker container, installing dependencies, building vLLM from source for ROCm, and starting the vLLM server.
```shell
# Step 1: Launch ROCm-vLLM Docker container
docker run -it --rm \
--cap-add=SYS_PTRACE \
-e SHELL=/bin/bash \
--network=host \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
-v /:/workspace \
--group-add video \
--ipc=host \
--name vllm_GLM \
rocm/vllm-dev:nightly
# Step 2: Inside the container, install dependencies
huggingface-cli login
pip uninstall vllm
pip install --upgrade pip
pip install -r rocm-requirements.txt
# Build vLLM from source for ROCm
git clone https://github.com/vllm-project/vllm.git
cd vllm
export PYTORCH_ROCM_ARCH="gfx942"
python3 setup.py develop
# Step 3: Start vLLM server
VLLM_USE_V1=1 vllm serve zai-org/GLM-4.5 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--disable-log-requests \
--no-enable-prefix-caching \
--trust-remote-code
```
--------------------------------
### Run vLLM Online Serving for GLM-4.5
Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md
Command to start vLLM online serving for the zai-org/GLM-4.5 model on AMD GPUs. Adjust tensor-parallel-size and gpu-memory-utilization as needed.
```shell
VLLM_USE_V1=1 vllm serve zai-org/GLM-4.5 --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --disable-log-requests --no-enable-prefix-caching --trust-remote-code
```
--------------------------------
### SGLang PD-Disaggregation Prefill Server
Source: https://github.com/zai-org/glm-4.5/blob/main/README.md
Starts an SGLang server for the GLM-4.5-Air model in prefill disaggregation mode. Requires specifying the model path, disaggregation mode, IB device, and tensor parallel size.
```shell
python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air --disaggregation-mode prefill --disaggregation-ib-device mlx5_0 --tp-size 4
```
--------------------------------
### SGLang PD-Disaggregation Router
Source: https://github.com/zai-org/glm-4.5/blob/main/README.md
Starts the SGLang router for PD-Disaggregation, connecting to the prefill and decode servers. This command configures the router to listen on all interfaces and a specified port.
```shell
python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
```
--------------------------------
### Claude Code Interface Output
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Example terminal output when running the Claude Code command.
```text
zr@MacBook GLM-4.5 % ccr code
Service not running, starting service...
╭───────────────────────────────────────────────────╮
│ ✻ Welcome to Claude Code! │
│ │
│ /help for help, /status for your current setup │
│ │
│ cwd: /Users/zr/Code/GLM-4.5 │
│ │
│ ─────────────────────────────────────────────── │
│ │
│ Overrides (via env): │
│ │
│ • API timeout: 600000ms │
│ • API Base URL: http://127.0.0.1:3456 │
╰───────────────────────────────────────────────────╯
※ Tip: Press Esc twice to edit your previous messages
```
--------------------------------
### Enable Preserved Thinking for Agentic Tasks
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Use this Python code to enable preserved thinking mode for agentic tasks with GLM-4.7. This retains all thinking blocks across turns for complex, long-horizon tasks. Ensure the server is started with the appropriate chat template kwargs.
```python
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://127.0.0.1:8000/v1")
# For agentic tasks with GLM-4.7, enable preserved thinking
# This retains all thinking blocks across turns for complex, long-horizon tasks
# When starting the server with SGLang, add these chat template kwargs:
# python3 -m sglang.launch_server \
# --model-path zai-org/GLM-4.7 \
# ... other args ...
# Then in your requests, enable preserved thinking:
completion = client.chat.completions.create(
model="glm-4.7",
messages=[
{"role": "system", "content": "You are a coding agent working on a complex refactoring task."},
{"role": "user", "content": "Analyze this codebase and create a plan to improve its architecture."},
],
max_tokens=8192,
temperature=0.7,
extra_body={
"chat_template_kwargs": {
"enable_thinking": True, # Enable thinking mode
"clear_thinking": False # Preserve thinking across turns
}
}
)
# The model will maintain reasoning context across multiple turns,
# reducing information loss and improving consistency in complex tasks
print(completion.choices[0].message.content)
```
--------------------------------
### Implement Tool Logic
Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md
Provide the actual execution logic for the tools and map them to their names.
```python
def python_tool(code: str) -> str:
# Your implementation of the python tool
return f"Result of python execution."
def search_tool(query: str, num: int = 10) -> str:
# Your implementation of the search tool
return f"Search results for query."
def open_tool(id) -> str:
# Your implementation of the open tool
return f"Opened result."
def find_tool(pattern: str) -> str:
# Your implementation of the find tool
return f"Found pattern."
tool_map = {
"python": python_tool,
"browser.search": search_tool,
"browser.open": open_tool,
"browser.find": find_tool,
}
```
--------------------------------
### Configure PD-Disaggregation with SGLang
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Split prefill and decode phases across separate GPU sets for improved throughput.
```shell
# Start prefill server (uses GPUs 0-3)
python -m sglang.launch_server \
--model-path zai-org/GLM-4.5-Air \
--disaggregation-mode prefill \
--disaggregation-ib-device mlx5_0 \
--tp-size 4
# Start decode server (uses GPUs 4-7)
python -m sglang.launch_server \
--model-path zai-org/GLM-4.5-Air \
--disaggregation-mode decode \
--port 30001 \
--disaggregation-ib-device mlx5_0 \
--tp-size 4 \
--base-gpu-id 4
# Start router to manage prefill/decode traffic
python -m sglang_router.launch_router \
--pd-disaggregation \
--prefill http://127.0.0.1:30000 \
--decode http://127.0.0.1:30001 \
--host 0.0.0.0 \
--port 8000
```
--------------------------------
### Launch GLM-4.6 with sglang
Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md
Configure the inference engine with specific parsers for reasoning and tool calls.
```bash
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.6 \
--tp-size 8 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.7 \
--served-model-name glm-4.6 \
--host 0.0.0.0 \
--port 8000
```
--------------------------------
### Run Claude Code
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Command to initiate the Claude Code interface.
```bash
ccr code
```
--------------------------------
### Serve Models with vLLM
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Deploy GLM models as an OpenAI-compatible API server using vLLM with speculative decoding.
```shell
# Start vLLM server with GLM-4.7 FP8 model
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-fp8
# For GLM-4.5 with full BF16 precision
vllm serve zai-org/GLM-4.5 \
--tensor-parallel-size 16 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.5
```
--------------------------------
### Basic Chat Completion with Thinking Enabled
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Use this for general chat completions where the model can 'think' to provide more detailed responses. Ensure the model is configured with appropriate parameters like `model`, `messages`, `max_tokens`, and `temperature`.
```python
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain how quicksort works and provide Python code"},
]
completion = client.chat.completions.create(
model="glm-4.7-fp8",
messages=messages,
max_tokens=4096,
temperature=0.7,
)
print(completion.choices[0].message.content)
```
--------------------------------
### Run Basic Inference with Transformers
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Load and run GLM models using Hugging Face Transformers. Requires torch and transformers libraries.
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "zai-org/GLM-4.7" # or "zai-org/GLM-4.5", "zai-org/GLM-4.6"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Prepare messages using chat template
messages = [{"role": "user", "content": "Write a Python function to calculate factorial"}]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1]:])
print(output_text)
# Output: def factorial(n):
# if n < 0:
# raise ValueError("Factorial is not defined for negative numbers")
# if n == 0 or n == 1:
# return 1
# return n * factorial(n - 1)
```
--------------------------------
### Serve Models with SGLang
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Deploy GLM models using SGLang with EAGLE speculative decoding for high-performance inference.
```shell
# Start SGLang server with GLM-4.7 FP8 model
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.7-FP8 \
--tp-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.8 \
--served-model-name glm-4.7-fp8 \
--host 0.0.0.0 \
--port 8000
# Server output when ready:
# [INFO] The server is fired up and ready to roll!
# [INFO] Uvicorn running on http://0.0.0.0:8000
```
--------------------------------
### vLLM Compilation Output
Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md
This output shows the compilation process for vLLM, including HIP compiler detection and extension building. It confirms successful configuration and generation of build files.
```shell
-- Found Torch: /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch.so
-- The HIP compiler identification is Clang 19.0.0
-- Detecting HIP compiler ABI info
-- Detecting HIP compiler ABI info - done
-- Check for working HIP compiler: /opt/rocm/lib/llvm/bin/clang++ - skipped
-- Detecting HIP compile features
-- Detecting HIP compile features - done
-- HIP supported arches: gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201
-- FetchContent base directory: /app/GLM-4.5/vllm/.deps
-- Enabling C extension.
-- Enabling moe extension.
-- Configuring done (19.7s)
-- Generating done (0.0s)
-- Build files have been written to: /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312
[1/35] Running hipify on _C extension source files.
/app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan.h -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan_hip.h [ok]
/app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/static_switch.h -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/static_switch.h [skipped, no changes]
/app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan_fwd.cu -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/mamba/mamba_ssm/selective_scan_fwd.hip [ok]
/app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/cuda_utils.h -> /app/GLM-4.5/vllm/build/temp.linux-x86_64-cpython-312/csrc/hip_utils.h [ok]
...
...
...
Using /usr/local/lib/python3.12/dist-packages
Finished processing dependencies for vllm==0.10.1.dev343+g54de71d0d.rocm641
```
--------------------------------
### vLLM Inference Server Configuration
Source: https://github.com/zai-org/glm-4.5/blob/main/README.md
Use this command to serve the GLM-4.7-FP8 model with vLLM, enabling tensor parallelism and specific tool/reasoning parsers for enhanced functionality.
```shell
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice \
--served-model-name glm-4.7-fp8
```
--------------------------------
### Make OpenAI-Compatible API Requests
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Connect to a locally deployed GLM server using the OpenAI Python client.
```python
from openai import OpenAI
import json
# Connect to local GLM server
client = OpenAI(
api_key="EMPTY",
base_url="http://127.0.0.1:8000/v1",
)
```
--------------------------------
### Launch ROCm-vllm Docker Container
Source: https://github.com/zai-org/glm-4.5/blob/main/example/AMD_GPU/README.md
Use this command to launch the ROCm-vllm development container with necessary device mappings and volume mounts.
```shell
docker run -it --rm \
--cap-add=SYS_PTRACE \
-e SHELL=/bin/bash \
--network=host \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
-v /:/workspace \
--group-add video \
--ipc=host \
--name vllm_GLM \
rocm/vllm-dev:nightly
```
--------------------------------
### Make GLM-4.5 Model Request with Tool Calls
Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md
This code demonstrates how to send a request to the GLM-4.5 chat completions API, including handling tool calls within a loop until a final response is generated. Ensure 'tools', 'tool_map', and 'api_base' are defined before execution.
```python
import requests
import json
api_base = "http://ip:port/v1"
messages = [
{"role": "user", "content": "Calculate the 1000th term of the Fibonacci sequence."},
]
finish_reason = None
while finish_reason is None or finish_reason == "tool_calls":
request_data = {
"model": "glm-4.6",
"messages": messages,
"max_tokens": 2048,
"temperature": 1.0,
"stop": ["<|user|>", "<|endoftext|>", "<|observation|>", "<|assistant|>"],
"tools": tools,
"stream": False,
}
response = requests.post(
f"{api_base}/chat/completions",
headers={"Content-Type": "application/json"},
json=request_data
)
response.raise_for_status()
response_json = response.json()
choice = response_json["choices"][0]
finish_reason = choice["finish_reason"]
if finish_reason == "tool_calls":
messages.append(choice["message"])
tool_calls = choice["message"]["tool_calls"]
for tool_call in tool_calls:
tool_name = tool_call["function"]["name"]
tool_arguments = json.loads(tool_call["function"]["arguments"])
tool_call_id = tool_call["id"]
tool_output = tool_map[tool_name](**tool_arguments)
messages.append({
"role": "tool",
"name": tool_name,
"content": tool_output,
"tool_call_id": tool_call_id,
})
print(choice["message"]["content"])
```
--------------------------------
### SGLang Inference Server Configuration
Source: https://github.com/zai-org/glm-4.5/blob/main/README.md
Launch the SGLang server for GLM-4.7-FP8 with specified tensor parallelism, speculative decoding parameters, and host/port settings. Includes configurations for tool and reasoning parsers.
```shell
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.7-FP8 \
--tp-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.8 \
--served-model-name glm-4.7-fp8 \
--host 0.0.0.0 \
--port 8000
```
--------------------------------
### Perform Web Search
Source: https://context7.com/zai-org/glm-4.5/llms.txt
A placeholder function for performing web searches. It returns a JSON-formatted string with dummy results. Replace the implementation with actual search logic.
```python
def search_tool(query: str, num: int = 10) -> str:
# Your search implementation
return json.dumps({"results": [{"title": f"Result for {query}", "url": "https://example.com"}]})
```
--------------------------------
### Define Tool Schemas
Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md
Define the function signatures for built-in tools to be provided to the model.
```python
tools = [
{
"type": "function",
"function":{
"name": "python",
"description": "Interpreter for executing python code",
"parameters": {
"type": "object",
"properties": {"code": {"description": "Code to execute", "type": "string"}},
},
}
},
{
"type": "function",
"function":{
"name": "browser.search",
"description": "Search in browser",
"parameters": {
"type": "object",
"properties": {
"query": {"description": "Search query", "type": "string"},
"num": {"description": "Number of results to return", "type": "integer", "default": 10},
},
"required": ["query"],
},
}
},
{
"type": "function",
"function":{
"name": "browser.open",
"description": "Open browser link",
"parameters": {
"type": "object",
"properties": {
"id": {"description": "ID or URL of the link to open", "type": ["integer", "string"]},
},
"required": ["id"],
},
}
},
{
"type": "function",
"function": {
"name": "browser.find",
"description": "Find pattern in the opened browser content",
"parameters": {
"type": "object",
"properties": {
"pattern": {"description": "Pattern to find", "type": "string"},
},
"required": ["pattern"],
},
}
},
]
```
--------------------------------
### Agentic Loop for Tool Use
Source: https://context7.com/zai-org/glm-4.5/llms.txt
This code demonstrates an agentic loop that continuously calls an AI model and processes its responses, including tool calls. It appends tool outputs back to the message history for the model to continue the conversation.
```python
tool_map = {"python": python_tool, "browser.search": search_tool}
messages = [
{"role": "user", "content": "Calculate the 1000th term of the Fibonacci sequence."},
]
# Agentic loop - continue until model stops calling tools
finish_reason = None
while finish_reason is None or finish_reason == "tool_calls":
response = requests.post(
f"{api_base}/chat/completions",
headers={"Content-Type": "application/json"},
json={
"model": "glm-4.6",
"messages": messages,
"max_tokens": 2048,
"temperature": 1.0,
"stop": ["<|user|>", "<|endoftext|>", "<|observation|>", "<|assistant|>"],
"tools": tools,
"stream": False,
}
)
response.raise_for_status()
choice = response.json()["choices"][0]
finish_reason = choice["finish_reason"]
if finish_reason == "tool_calls":
messages.append(choice["message"])
for tool_call in choice["message"]["tool_calls"]:
tool_name = tool_call["function"]["name"]
tool_args = json.loads(tool_call["function"]["arguments"])
tool_output = tool_map[tool_name](**tool_args)
messages.append({
"role": "tool",
"name": tool_name,
"content": tool_output,
"tool_call_id": tool_call["id"],
})
print(choice["message"]["content"])
```
--------------------------------
### Tool Calling with Function Definitions
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Enable GLM to call external functions by providing tool definitions in OpenAI-compatible format. This involves defining the tool's name, description, and parameters. The model will then decide when to call these tools based on the user's request.
```python
from openai import OpenAI
import json
client = OpenAI(api_key="EMPTY", base_url="http://127.0.0.1:8000/v1")
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country e.g. Tokyo, Japan",
}
},
"required": ["location"],
"additionalProperties": False,
},
},
},
{
"type": "function",
"function": {
"name": "calculate",
"description": "Perform mathematical calculations",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Mathematical expression to evaluate, e.g. '2 * 3 + 4'",
}
},
"required": ["expression"],
"additionalProperties": False,
},
},
},
]
messages = [
{"role": "system", "content": "You are a helpful assistant. Use tools when needed."},
{"role": "user", "content": "What's the weather in Beijing?"},
]
# First request - model decides to call tool
completion = client.chat.completions.create(
model="glm-4.7-fp8",
messages=messages,
tools=tools,
max_tokens=4096,
temperature=0.0,
)
tool_call = completion.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
print(f"Tool: {tool_call.function.name}, Args: {args}")
# Simulate tool execution and continue conversation
messages.append(completion.choices[0].message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": '{"city": "Beijing", "temperature": "26C", "weather": "Sunny"}',
})
# Second request - model processes tool result
completion_2 = client.chat.completions.create(
model="glm-4.7-fp8",
messages=messages,
tools=tools,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(completion_2.choices[0].message.content)
```
--------------------------------
### Load GLM-4.5 with Transformers
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Python snippet for loading the model using the Hugging Face transformers library.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
MODEL_PATH = "zai-org/GLM-4.5"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto"
)
```
--------------------------------
### Configure Thinking Mode for SGLang
Source: https://github.com/zai-org/glm-4.5/blob/main/README.md
Add this configuration to enable Preserved Thinking mode in SGLang for agentic tasks.
```json
"chat_template_kwargs": {
"enable_thinking": true,
"clear_thinking": false
}
```
--------------------------------
### SGLang PD-Disaggregation Decode Server
Source: https://github.com/zai-org/glm-4.5/blob/main/README.md
Launches an SGLang server for the GLM-4.5-Air model in decode disaggregation mode. This command specifies the port, IB device, tensor parallel size, and the base GPU ID for decoding.
```shell
python -m sglang.launch_server --model-path zai-org/GLM-4.5-Air --disaggregation-mode decode --port 30001 --disaggregation-ib-device mlx5_0 --tp-size 4 --base-gpu-id 4
```
--------------------------------
### Make GLM-4.5 Model Request with Tools
Source: https://github.com/zai-org/glm-4.5/blob/main/resources/glm_4.6_tir_guide.md
Use this code to interact with the GLM-4.5 API. It includes defining available tools, sending a prompt, and processing the model's response, including executing tool calls.
```python
from transformers import AutoTokenizer
import json
import requests
api_base = "http://ip:port/v1"
tools = [
{
"name": "python",
"description": "Interpreter for executing python code",
"parameters": {
"type": "object",
"properties": {"code": {"description": "Code to execute", "type": "string"}},
},
},
{
"name": "browser.search",
"description": "Search in browser",
"parameters": {
"type": "object",
"properties": {
"query": {"description": "Search query", "type": "string"},
"num": {"description": "Number of results to return", "type": "integer", "default": 10},
},
"required": ["query"],
},
},
{
"name": "browser.open",
"description": "Open browser link",
"parameters": {
"type": "object",
"properties": {
"id": {"description": "ID or URL of the link to open", "type": ["integer", "string"]},
},
"required": ["id"],
},
},
{
"name": "browser.find",
"description": "Find pattern in the opened browser content",
"parameters": {
"type": "object",
"properties": {
"pattern": {"description": "Pattern to find", "type": "string"},
},
"required": ["pattern"],
},
},
]
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.6", trust_remote_code=True)
messages = [
{"role": "user", "content": "Calculate the 1000th term of the Fibonacci sequence."},
]
while True:
prompt = tokenizer.apply_chat_template(messages, tools, add_generation_prompt=True, tokenize=False)
request_data = {
"model": "glm-4.6",
"prompt": prompt,
"max_tokens": 2048,
"temperature": 1.0,
"stop": ["<|user|>", "<|endoftext|>", "<|observation|>", "<|assistant|>"],
"stream": False,
}
response = requests.post(
f"{api_base}/completions",
headers={"Content-Type": "application/json"},
json=request_data,
timeout=360
)
response_json = response.json()
text = response_json["choices"][0]["text"]
assistant_message = parse_model_response(text, tools)
messages.append(assistant_message)
tool_calls = assistant_message.get("tool_calls", [])
if not tool_calls:
break
for tool_call in tool_calls:
tool_name = tool_call["name"]
tool_arguments = tool_call["arguments"]
tool_output = tool_map[tool_name](**tool_arguments)
messages.append({
"role": "tool",
"name": tool_name,
"content": tool_output,
"tool_call_id": tool_call["tool_call_id"],
})
print(assistant_message)
```
--------------------------------
### Tool-Integrated Reasoning with Built-in Tools
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Utilize GLM's built-in tools for Python code execution, web search, and browser operations. Define these tools in a similar format to custom tools, specifying their names, descriptions, and parameters.
```python
import requests
import json
api_base = "http://127.0.0.1:8000/v1"
# Define built-in tools
tools = [
{
"type": "function",
"function": {
"name": "python",
"description": "Interpreter for executing python code",
"parameters": {
"type": "object",
"properties": {"code": {"description": "Code to execute", "type": "string"}},
},
}
},
{
"type": "function",
"function": {
"name": "browser.search",
"description": "Search in browser",
"parameters": {
"type": "object",
"properties": {
"query": {"description": "Search query", "type": "string"},
"num": {"description": "Number of results to return", "type": "integer", "default": 10},
},
"required": ["query"],
},
}
},
]
```
--------------------------------
### Configure GLM Models with Claude Code Router
Source: https://context7.com/zai-org/glm-4.5/llms.txt
This JSON configuration sets up GLM models to work with the claude-code-router. It defines the provider details and routing rules for different task types.
```json
{
"LOG": true,
"Providers": [
{
"name": "glm-4.7-sglang",
"api_base_url": "http://your-server-ip:8000/v1/chat/completions",
"api_key": "EMPTY",
"models": [
"glm-4.7"
]
}
],
"Router": {
"default": "glm-4.7-sglang,glm-4.7",
"background": "glm-4.7-sglang,glm-4.7",
"think": "glm-4.7-sglang,glm-4.7",
"longContext": "glm-4.7-sglang,glm-4.7",
"webSearch": "glm-4.7-sglang,glm-4.7"
}
}
```
--------------------------------
### Chat Completion with Thinking Disabled
Source: https://context7.com/zai-org/glm-4.5/llms.txt
Disable thinking mode for faster responses on simple tasks. Set `enable_thinking` to `False` within `chat_template_kwargs` in the `extra_body` parameter. This is useful for quick, direct answers.
```python
completion_fast = client.chat.completions.create(
model="glm-4.7-fp8",
messages=[{"role": "user", "content": "What is 2+2?"}],
max_tokens=100,
temperature=0.0,
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
print(completion_fast.choices[0].message.content)
```
--------------------------------
### Restart Claude Code Router
Source: https://github.com/zai-org/glm-4.5/blob/main/example/claude_code/README.md
Output displayed after restarting the router service.
```text
Service was not running or failed to stop.
Starting claude code router service...
✅ Service started successfully in the background.
```
--------------------------------
### Execute Python Code Safely
Source: https://context7.com/zai-org/glm-4.5/llms.txt
This function executes Python code within a controlled environment. It's useful for running dynamic code generated by an AI model. Ensure the code is designed to return a 'result' variable for explicit output.
```python
def python_tool(code: str) -> str:
# Execute Python code safely
try:
exec_globals = {}
exec(code, exec_globals)
return str(exec_globals.get('result', 'Code executed successfully'))
except Exception as e:
return f"Error: {str(e)}"
```
--------------------------------
### Parse Model Response with Reasoning and Tool Calls
Source: https://context7.com/zai-org/glm-4.5/llms.txt
This Python function manually parses a raw string response from a GLM model, extracting reasoning, regular content, and structured tool calls. It uses regular expressions to identify and parse content enclosed in specific tags like and .
```python
import re
import json
import uuid
def parse_model_response(response: str, defined_tools: list):
"""Parse GLM response to extract reasoning and tool calls."""
text = response.strip()
reasoning_content = None
content = None
tool_calls = []
# Extract reasoning content wrapped in ...
if text.startswith(''):
if '' in text:
reasoning_content, text = text.rsplit('', 1)
reasoning_content = reasoning_content.removeprefix('').strip()
text = text.strip()
else:
reasoning_content = text.removeprefix('').strip()
text = ""
# Extract regular content before tool calls
if '' in text:
index = text.find('')
content = text[:index].strip()
text = text[index:].strip()
else:
content = text.strip()
text = ""
# Parse tool calls wrapped in ...
tool_call_strs = re.findall(r'(.*?)', text, re.DOTALL)
for call in tool_call_strs:
func_name_match = re.match(r'([^
<]+)', call.strip())
func_name = func_name_match.group(1).strip() if func_name_match else None
if func_name:
pairs = re.findall(r'(.*?)\s*(.*?)', call, re.DOTALL)
arguments = {}
for arg_key, arg_value in pairs:
arg_key = arg_key.strip()
arg_value = arg_value.strip()
try:
arguments[arg_key] = json.loads(arg_value)
except:
arguments[arg_key] = arg_value
tool_calls.append({
'tool_call_id': "tool-call-" + str(uuid.uuid4()),
'name': func_name,
'arguments': arguments
})
message = {'role': 'assistant'}
if reasoning_content:
message['reasoning_content'] = reasoning_content
if content:
message['content'] = content
if tool_calls:
message['tool_calls'] = tool_calls
return message
```