Blueprint Chapter - CllamaServer (llama.cpp server)
Overview
CllamaServer is implemented based on the Server mode of llama.cpp. It allows you to locally spin up a server compatible with the OpenAI API, offering support for multiple features:
- Local Inference Service: Launch an AI inference server locally without relying on external APIs.
- OpenAI API Compatible: Supports OpenAI-compatible API formats for easier migration and integration
- Multi-session Support: Supports multiple concurrent requests
- Tool Calling: Supports function calling capabilities
- Speech-to-Text: Supports Speech-to-Text functionality
- Visual Management: Built-in Server management interface in the editor
Differences from Cllama: * Cllama: Directly loads the model within the process for inference, capable of handling only one request at a time. * CllamaServer: Launches a standalone HTTP server capable of handling multiple concurrent requests, with API compatibility in OpenAI format.
Preparation work
Since it's running locally, you'll need to prepare the offline model files first, such as downloading Qwen1.5-1.8B-Chat-Q8_0.gguf
Place the model in a certain folder, for example, under the game project directory Content/LLAMA.
Create CllamaServer
Creating a Server Using Blueprints
Right-click in the blueprint to create the node Create Cllama Server In World
Configure Server Parameters
Create the Cllama Server Param node and configure essential parameters:
- Model: Path to the model file (required)
- Port: Server port (0 indicates automatic allocation)
- Host: Listening address, defaults to
127.0.0.1 - NGpuLayers: Number of GPU layers (-1 means to use all GPUs)
Bind callback events
Binding callback events for Server:
- On Started: Triggered when the server starts successfully
- On Stopped: Triggered when the server stops
- On Failed: Triggered when Server startup fails
Complete Creation Blueprint
The complete Server creation blueprint is as follows:
After running the blueprint, the Server will trigger the On Started event upon successful startup.
Detailed Explanation of Server Parameters
The FAIChatPlus_CllamaServerParam struct contains the following parameters:
Common Parameters
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| Model | FString | - | Model file path (required) |
| Port | int32 | 0 | Listening port, 0 indicates automatic allocation |
| Host | FString | 127.0.0.1 | Listening Address |
| NGpuLayers | int32 | -1 | Number of GPU layers, -1 means all |
| bUseJinja | bool | false | Use Jinja template |
| MMProj | FString | - | Multimodal projection file path |
| Temperature | float | 0.8 | Sampling temperature |
Reasoning Argument
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| CtxSize | int32 | 4096 | Context Size |
| NPredict | int32 | -1 | Number of tokens to predict, -1 denotes unlimited |
| Threads | int32 | -1 | CPU thread count, -1 denotes automatic |
| BatchSize | int32 | 2048 | Batch size |
Sampling Parameters
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| TopK | int32 | 40 | Top-K sampling |
| TopP | float | 0.9 | Top-P Sampling |
| MinP | float | 0.1 | Min-P sampling |
| RepeatPenalty | float | 1.0 | Repetition penalty |
Server Parameters
| Parameter | Type | Default Value | Description |
|---|---|---|---|
| ApiKey | FString | - | API Key (optional) |
| Timeout | int32 | 600 | Timeout duration (seconds) |
| Parallel | int32 | 1 | Number of parallel sequences |
| bNoWebUI | bool | false | Disable Web UI |
| bVerbose | bool | false | Verbose Logging |
Chat using CllamaServer
Create Chat Request
Once the server starts successfully, you can use the Send CllamaServer Chat Request node to send chat requests.
Configure Chat Options
Create a CllamaServer Chat Request Options node and set the BaseUrl to the Server address.
You can obtain Server information through the Get Server Info By ID node.
Create Messages
Create Messages array, add System Message and User Message.
Bind callback handlers to process responses
Bind the On Message or On Message Finished events to receive model responses.
Complete Chat Blueprint
The complete chat blueprint is as follows:
Execution Result
Run the blueprint, and you'll see the message returned by the model displayed on the screen.
Server Management
Get Server Information
Use the Get Server Info node to fetch detailed information about the Server.
Server Info includes the following information: * ServerID: Unique server identifier * Host: Listening Address * Port: Listening Port * Address: Complete address (host:port) * HttpAddress: HTTP Address (http://host:port) * bIsRunning: Whether it is running * Param: Server parameter
Stop Server
Use the Stop Server By ID node to stop the current Server.
Static Management Functions
AIChatPlus provides a set of static functions for managing all Servers.
| Function | Description |
|---|---|
Is Server Valid (Static) |
Check if the Server is Valid |
Is Server Running (Static) |
Check if the server is running |
Stop Server By ID |
Stop the specified Server by ID |
Stop All Servers |
Stop All Servers |
Get Server Info By ID |
Get Server Info By ID |
Get All Server IDs |
Retrieve All Server IDs |
Get Server By ID |
Retrieve Server instance by ID |
Multimodal Support
CllamaServer supports multimodal models (such as Moondream, Qwen2-VL, etc.).
Configure Multimodal Parameters
Set MMProj (multimodal projection file path) in the Server parameters:
Send image message
Add images in Messages:
Execution Results
Tool Calling
CllamaServer supports Tool Calling (function calling) functionality, with usage similar to OpenAI.
For detailed usage, please refer to Tool CallDocument.
When using CllamaServer for Tool Call, the following are required:
1. Set bUseJinja = true in the Server parameters
2. Define tools in the Tools field of Chat Options
Editor Server Management
AIChatPlus offers a visual CllamaServer management interface within the editor tool, simplifying the creation, monitoring, and administration of multiple Servers.
Open the editor tool by navigating to: Tools -> AIChatPlus -> AIChat, then open the Cllama Server Manager tab.
In the editor, you can: * Create a new Server * Check the status of the running Server * Stop the specified Server * Configure Server Parameters * Server configurations are automatically saved.
Relationship to Other APIs
Since CllamaServer is compatible with the OpenAI API format, you can also use OpenAI’s Chat Request node to communicate with CllamaServer—just set the BaseUrl to the address of CllamaServer.
Original: https://wiki.disenone.site/en
This post is protected by CC BY-NC-SA 4.0 agreement, should be reproduced with attribution.
Visitors. Total Visits. Page Visits.
This post was translated using ChatGPT; please provide feedback at FeedbackPoint out any omissions therein.

















