开发环境搭建实战

前面几篇咱们聊了很多概念：大模型是什么、怎么工作、有哪些可选、有什么局限。这篇终于要动手了——搭建开发环境，写代码调用大模型 API。

这篇结束后，你会：

深入了解主流开发平台的特点和选择策略
完全理解 OpenAI 协议的请求响应格式
用 Java 代码实现和大模型的对话
掌握非流式和流式两种调用方式
实现多轮对话的完整逻辑
学会用 Ollama 在本地部署和运行模型
知道常见的坑以及如何避免

咱们先从平台选择开始。

主流开发平台深度对比

大模型的使用方式主要分两种：调用云端 API 和 本地部署。

云端 API 平台详解

硅基流动（SiliconFlow）

硅基流动是国内领先的大模型 API 聚合平台，聚合了多家厂商的模型，提供统一的调用接口。

地址：https://www.siliconflow.cn/

核心优势：

模型丰富：支持 Qwen、DeepSeek、Llama、Mistral 等主流模型
协议统一：全部兼容 OpenAI 协议，换模型只需改模型名
价格低：整体定价在行业较低水平
国内服务：服务器在国内，访问稳定，无需科学上网
新用户福利：注册送免费额度，够学习实验

适用场景：学习实验、快速原型、中小项目、需要多种模型的场景

阿里云百炼

阿里云的官方大模型服务平台，Qwen 系列模型的第一方服务商。

地址：https://www.aliyun.com/product/qwen

核心优势：

Qwen 首发：新版本 Qwen 最先在这里上线
企业级支持：SLA 保障、技术支持、合规认证
阿里生态集成：和阿里云其他服务（OSS、ECS、数据库）无缝对接
安全合规：适合对数据安全有要求的企业

适用场景：企业级应用、需要 Qwen 最新版本、对 SLA 有要求

DeepSeek 官方

DeepSeek 的官方 API 服务，提供 DeepSeek 全系列模型。

地址：https://www.deepseek.com/

核心优势：

价格最低：官方直连，没有中间商赚差价
模型齐全：V3、R1、Coder 全系列
性能保障：直接对接 DeepSeek 的服务器集群

适用场景：只使用 DeepSeek 模型的场景

OpenAI

全球领先的大模型服务商，GPT 系列模型的提供者。

地址：https://openai.com/

核心优势：

综合能力最强：GPT-4o 仍然是综合能力天花板
生态最完善：各种框架、工具的支持最好
新功能首发：Function Calling、Vision 等新功能通常最先支持

主要限制：

国内无法直接访问，需要科学上网或使用代理
需要海外手机号注册、海外信用卡付款
价格相对较高

适用场景：需要 GPT 系列模型、能够解决网络问题的场景

本地部署工具对比

Ollama

最简单、最推荐的本地部署方案。

地址：https://ollama.com/

核心优势：

极简部署：一条命令安装，一条命令运行模型
跨平台：支持 Mac、Linux、Windows
兼容 OpenAI 协议：代码几乎不用改
模型库丰富：主流开源模型基本都有
自动资源管理：自动加载/卸载模型，管理显存

适用场景：学习实验、离线使用、数据敏感不能出网、快速原型

方案选择决策树

你想怎么用大模型？
    ↓
├─ 调用云端 API
│   ↓
│   是否能访问国外服务？
│   ├─ 能 → 追求最强能力？
│   │       ├─ 是 → OpenAI GPT-5
│   │       └─ 否 → DeepSeek 官方
│   └─ 不能 → 
│       ├─ 只用一种模型 → 对应官方平台
│       └─ 想灵活选择 → 硅基流动
│
└─ 本地部署
    ↓
    有 GPU 吗？
    ├─ 有 → Ollama（简单）或 vLLM（高性能）
    └─ 没有 → Ollama（会自动用 CPU）或 llama.cpp

平台选择建议

本篇选择：以硅基流动为例演示云端 API 调用，以 Ollama 为例演示本地部署。

学习阶段首选硅基流动（国内可访问、有免费额度、兼容 OpenAI 协议），本地运行模型首选 Ollama（一条命令安装）。

OpenAI 协议详解

OpenAI 协议是业界标准

在写代码之前，需要深入理解 OpenAI 的 Chat Completions API 协议。这是业界事实标准，几乎所有平台都兼容。

请求格式完整解析

{
  "model": "Qwen/Qwen3.5-plus",
  "messages": [
    {"role": "system", "content": "你是一个专业的技术顾问"},
    {"role": "user", "content": "什么是微服务架构？"},
    {"role": "assistant", "content": "微服务架构是一种将应用程序..."},
    {"role": "user", "content": "它和单体架构有什么区别？"}
  ],
  "temperature": 0.7,
  "top_p": 1.0,
  "max_tokens": 1000,
  "stream": false,
  "stop": ["\n\n"],
  "presence_penalty": 0,
  "frequency_penalty": 0
}

字段详解：

字段	类型	必填	说明
model	string	是	模型 ID，不同平台格式不同
messages	array	是	对话消息数组
temperature	float	否	随机性，0-2，默认 1
top_p	float	否	核采样，0-1，默认 1
max_tokens	int	否	最大输出 Token 数
stream	bool	否	是否流式返回，默认 false
stop	array	否	停止生成的字符串列表
presence_penalty	float	否	存在惩罚，-2 到 2
frequency_penalty	float	否	频率惩罚，-2 到 2

messages 数组的角色类型：

role	说明	用途
system	系统消息	定义模型的角色、行为、约束
user	用户消息	用户的输入、问题
assistant	助手消息	模型之前的回复（多轮对话时用）

响应格式完整解析

非流式响应：

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1699000000,
  "model": "Qwen/Qwen3.5-plus",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "微服务架构和单体架构的主要区别在于..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 156,
    "completion_tokens": 423,
    "total_tokens": 579
  }
}

关键字段：

choices[0].message.content：模型的回答内容
choices[0].finish_reason：结束原因
- stop：正常结束
- length：达到 max_tokens 限制被截断
- content_filter：内容过滤
usage：Token 使用统计

流式响应：

流式响应格式说明

流式响应采用 SSE（Server-Sent Events）格式，每个数据块是一行：

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"微"},"index":0}]}

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"服务"},"index":0}]}

data: {"id":"chatcmpl-abc123","choices":[{"delta":{"content":"架构"},"index":0}]}

data: [DONE]

注意：

每行以 data: 开头
内容在 choices[0].delta.content 里（不是 message）
最后一行是 data: [DONE]
中间可能有空行

硅基流动 API 调用实战

示例中项目地址

项目地址：https://github.com/java-up-up/super-agent
项目模块：ai-example-one

准备工作

1. 注册获取 API Key

访问参考教程

2. Maven 依赖配置

创建一个 Maven 项目，添加以下依赖：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <parent>
    <groupId>org.javaup</groupId>
    <artifactId>ai-example</artifactId>
    <version>${revision}</version>
  </parent>

  <artifactId>ai-example-one</artifactId>

  <name>ai-example-one</name>
  <description>ai示例1</description>

  <dependencies>
    <!-- macOS DNS 原生解析库（解决 Netty 在 macOS 上的 DNS 解析警告） -->
    <dependency>
      <groupId>io.netty</groupId>
      <artifactId>netty-resolver-dns-native-macos</artifactId>
      <classifier>osx-aarch_64</classifier>
    </dependency>
    <!-- OkHttp：HTTP 客户端 -->
    <dependency>
      <groupId>com.squareup.okhttp3</groupId>
      <artifactId>okhttp</artifactId>
      <version>4.12.0</version>
    </dependency>

    <!-- OkHttp SSE 支持（流式调用需要）-->
    <dependency>
      <groupId>com.squareup.okhttp3</groupId>
      <artifactId>okhttp-sse</artifactId>
      <version>4.12.0</version>
    </dependency>

    <!-- Gson：JSON 处理 -->
    <dependency>
      <groupId>com.google.code.gson</groupId>
      <artifactId>gson</artifactId>
      <version>2.10.1</version>
    </dependency>

    <!-- SLF4J + Logback：日志 -->
    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-classic</artifactId>
      <version>1.4.14</version>
    </dependency>
  </dependencies>
</project>

非流式调用完整实现

非流式调用最直观：发请求 → 等待 → 拿到完整结果。

示例代码：org.javaup.example.LLMClient

/**
 * 大模型 API 非流式调用示例
 * 演示如何调用硅基流动的 API 进行对话
 */
public class LLMClient {

    private static final MediaType JSON = MediaType.get("application/json; charset=utf-8");

    /**
     * API 配置
     * */
    private static final String API_URL = "https://api.siliconflow.cn/v1/chat/completions";
    /**
     * 替换成你的 API Key
     * */
    private static final String API_KEY = "设置成你的apiKey";

    private static final String MODEL = "Qwen/Qwen3.5-122B-A10B";
    /**
     * 硅基流动文档中该模型支持 thinking 模式，默认开启时非流式调用可能等待较久。
     * 这里示例默认关闭，以减少直接运行 demo 时的超时概率。
     */
    private static final boolean ENABLE_THINKING = false;
    private static final int MAX_TOKENS = 512;
    private static final int CONNECT_TIMEOUT_SECONDS = 30;
    private static final int READ_TIMEOUT_SECONDS = 180;
    private static final int WRITE_TIMEOUT_SECONDS = 30;
    private static final int CALL_TIMEOUT_SECONDS = 180;

    /**
     * HTTP 客户端（复用以提高性能）
     * */
    private final OkHttpClient httpClient;
    private final Gson gson;

    public LLMClient() {
        // 配置 HTTP 客户端
        this.httpClient = new OkHttpClient.Builder()
            .connectTimeout(CONNECT_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .readTimeout(READ_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .writeTimeout(WRITE_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .callTimeout(CALL_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .retryOnConnectionFailure(true)
            .build();

        this.gson = new GsonBuilder()
            .setPrettyPrinting()
            .create();
    }

    /**
     * 发送对话请求
     * 
     * @param systemPrompt 系统提示词，定义模型角色
     * @param userMessage  用户消息
     * @return 模型的回复
     */
    public String chat(String systemPrompt, String userMessage) throws IOException {
        // 1. 构建请求体
        JsonObject requestBody = new JsonObject();
        requestBody.addProperty("model", MODEL);
        requestBody.addProperty("temperature", 0.7);
        requestBody.addProperty("max_tokens", MAX_TOKENS);
        requestBody.addProperty("stream", false);
        requestBody.addProperty("enable_thinking", ENABLE_THINKING);

        // 构建 messages 数组
        JsonArray messages = new JsonArray();

        // 添加 system 消息
        if (systemPrompt != null && !systemPrompt.isEmpty()) {
            JsonObject systemMsg = new JsonObject();
            systemMsg.addProperty("role", "system");
            systemMsg.addProperty("content", systemPrompt);
            messages.add(systemMsg);
        }

        // 添加 user 消息
        JsonObject userMsg = new JsonObject();
        userMsg.addProperty("role", "user");
        userMsg.addProperty("content", userMessage);
        messages.add(userMsg);

        requestBody.add("messages", messages);

        // 2. 构建 HTTP 请求
        Request request = new Request.Builder()
            .url(API_URL)
            .addHeader("Authorization", "Bearer " + API_KEY)
            .addHeader("Content-Type", "application/json")
            .post(RequestBody.create(
                requestBody.toString(),
                JSON
            ))
            .build();

        // 3. 发送请求并处理响应
        long requestStart = System.nanoTime();
        try (Response response = httpClient.newCall(request).execute()) {
            String traceId = response.header("x-siliconcloud-trace-id");
            long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - requestStart);

            // 检查响应状态
            if (!response.isSuccessful()) {
                String errorBody = response.body() != null ? response.body().string() : "No error body";
                throw new IOException(String.format(
                    "API 请求失败: %d, traceId=%s, body=%s",
                    response.code(),
                    traceId != null ? traceId : "N/A",
                    errorBody
                ));
            }

            // 解析响应 JSON
            if (response.body() == null) {
                throw new IOException("API 响应体为空");
            }
            String responseBody = response.body().string();
            JsonObject json = gson.fromJson(responseBody, JsonObject.class);

            // 提取回答内容
            JsonArray choices = json.getAsJsonArray("choices");
            if (choices == null || choices.size() == 0) {
                throw new IOException("响应中没有 choices");
            }

            if (traceId != null) {
                System.out.printf("[请求耗时] %d ms, traceId=%s%n", elapsedMs, traceId);
            } else {
                System.out.printf("[请求耗时] %d ms%n", elapsedMs);
            }

            String answer = choices.get(0).getAsJsonObject()
                .getAsJsonObject("message")
                .get("content").getAsString();

            // 打印 Token 使用情况
            if (json.has("usage")) {
                JsonObject usage = json.getAsJsonObject("usage");
                System.out.printf("[Token 使用] 输入: %d, 输出: %d, 总计: %d%n",
                    usage.get("prompt_tokens").getAsInt(),
                    usage.get("completion_tokens").getAsInt(),
                    usage.get("total_tokens").getAsInt());
            }

            if (choices.get(0).getAsJsonObject().has("finish_reason")) {
                String finishReason = choices.get(0).getAsJsonObject()
                    .get("finish_reason").getAsString();
                if ("length".equals(finishReason)) {
                    System.out.printf("[提示] 输出触发 max_tokens=%d 上限，如需更完整回答可调大该值，同时建议同步调大 readTimeout/callTimeout。%n",
                        MAX_TOKENS);
                }
            }

            return answer;
        } catch (SocketTimeoutException e) {
            throw new SocketTimeoutException(String.format(
                "调用硅基流动超时（%d 秒）。当前示例已默认关闭 enable_thinking；如果你改回 true 或更换为更慢的推理模型，建议使用 stream=true，或继续增大 readTimeout/callTimeout。原始错误: %s",
                CALL_TIMEOUT_SECONDS,
                e.getMessage()
            ));
        }
    }

    public static void main(String[] args) {
        LLMClient client = new LLMClient();

        String systemPrompt = """
            你是一个专业的 Java 技术顾问，擅长解答 Java 相关的技术问题。
            回答要准确、简洁，如果涉及代码，请给出示例。
            如果不确定答案，请明确说明。
            """;

        String userMessage = "请解释一下 Java 中的 volatile 关键字有什么作用？";

        try {
            System.out.println("发送问题: " + userMessage);
            System.out.printf("等待回复... [model=%s, enableThinking=%s]%n%n", MODEL, ENABLE_THINKING);

            String answer = client.chat(systemPrompt, userMessage);

            System.out.println("=== 模型回答 ===");
            System.out.println(answer);
        } catch (IOException e) {
            System.err.println("调用失败: " + e.getMessage());
        }
    }
}

运行结果示例：

发送问题: 请解释一下 Java 中的 volatile 关键字有什么作用？
等待回复... [model=Qwen/Qwen3.5-122B-A10B, enableThinking=false]

[请求耗时] 61716 ms, traceId=ti_gowxb2k8nuoapfql24
[Token 使用] 输入: 67, 输出: 512, 总计: 579
[提示] 输出触发 max_tokens=512 上限，如需更完整回答可调大该值，同时建议同步调大 readTimeout/callTimeout。
=== 模型回答 ===
`volatile` 是 Java 中用于**保证变量的可见性**和**禁止指令重排序**的关键字，但它**不保证原子性**。

### 核心作用

1. **保证可见性（Visibility）**
   - 当一个线程修改了 `volatile` 变量，新值会立即刷新到主内存中。
   - 其他线程读取该变量时，会直接从主内存中获取最新值，而不是使用 CPU 缓存中的旧值。
   - 解决了多线程环境下，线程工作内存（CPU缓存）与主内存数据不一致的问题。

2. **禁止指令重排序（Ordering）**
   - 编译器和处理器为了优化性能，可能会重排指令执行顺序。
   - `volatile` 通过插入**内存屏障（Memory Barrier）**，禁止重排序操作，确保代码按照程序规定的顺序执行。
   - 典型应用是**双重检查锁（DCL）单例模式**中的 `instance` 变量。

### 关键限制：不保证原子性

`volatile` 只能保证“读”和“写”操作本身的可见性，无法保证复合操作的原子性（例如 `i++`）。

流式调用完整实现

流式调用可以让用户实时看到模型的输出，体验更好。

示例代码：org.javaup.example.LLMStreamClient

/**
 * 大模型 API 流式调用示例
 * 实现类似 ChatGPT 的打字机效果
 */
public class LLMStreamClient {

    private static final MediaType JSON = MediaType.get("application/json; charset=utf-8");
    private static final String API_URL = "https://api.siliconflow.cn/v1/chat/completions";
    private static final String API_KEY = "替换成你的apiKey";
    private static final String MODEL = "Qwen/Qwen3.5-122B-A10B";
    /**
     * 该模型支持 thinking 模式；流式返回时 reasoning_content 可能出现，而 content 为 null。
     * 示例默认关闭 thinking，避免首次体验时既慢又需要额外解析思考流。
     */
    private static final boolean ENABLE_THINKING = false;
    private static final int MAX_TOKENS = 512;
    private static final int CONNECT_TIMEOUT_SECONDS = 30;
    private static final int READ_TIMEOUT_SECONDS = 300;
    private static final int WRITE_TIMEOUT_SECONDS = 30;
    private static final int CALL_TIMEOUT_SECONDS = 300;

    private final OkHttpClient httpClient;
    private final Gson gson;

    public LLMStreamClient() {
        this.httpClient = new OkHttpClient.Builder()
            .connectTimeout(CONNECT_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .readTimeout(READ_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .writeTimeout(WRITE_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .callTimeout(CALL_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .retryOnConnectionFailure(true)
            .build();

        this.gson = new Gson();
    }

    /**
     * 回调接口，用于处理流式数据
     */
    public interface StreamCallback {
        /**
         * 收到新的文本片段
         */
        void onContent(String content);

        /**
         * 流式传输完成
         */
        void onComplete(String fullContent);

        /**
         * 发生错误
         */
        void onError(Exception e);
    }

    /**
     * 流式对话
     */
    public void chatStream(String systemPrompt, String userMessage, StreamCallback callback) {
        // 1. 构建请求体
        JsonObject requestBody = new JsonObject();
        requestBody.addProperty("model", MODEL);
        requestBody.addProperty("temperature", 0.7);
        requestBody.addProperty("max_tokens", MAX_TOKENS);
        requestBody.addProperty("stream", true);
        requestBody.addProperty("enable_thinking", ENABLE_THINKING);

        JsonArray messages = new JsonArray();

        if (systemPrompt != null && !systemPrompt.isEmpty()) {
            JsonObject systemMsg = new JsonObject();
            systemMsg.addProperty("role", "system");
            systemMsg.addProperty("content", systemPrompt);
            messages.add(systemMsg);
        }

        JsonObject userMsg = new JsonObject();
        userMsg.addProperty("role", "user");
        userMsg.addProperty("content", userMessage);
        messages.add(userMsg);

        requestBody.add("messages", messages);

        // 2. 构建请求
        Request request = new Request.Builder()
            .url(API_URL)
            .addHeader("Authorization", "Bearer " + API_KEY)
            .addHeader("Content-Type", "application/json")
            .addHeader("Accept", "text/event-stream")
            .post(RequestBody.create(
                requestBody.toString(),
                JSON
            ))
            .build();

        // 3. 异步执行请求
        httpClient.newCall(request).enqueue(new Callback() {
            @Override
            public void onFailure(Call call, IOException e) {
                callback.onError(e);
            }

            @Override
            public void onResponse(Call call, Response response) {
                long requestStart = System.nanoTime();
                try (response) {
                    String traceId = response.header("x-siliconcloud-trace-id");
                    if (!response.isSuccessful()) {
                        String errorBody = response.body() != null ? response.body().string() : "No error body";
                        callback.onError(new IOException(String.format(
                            "API 请求失败: %d, traceId=%s, body=%s",
                            response.code(),
                            traceId != null ? traceId : "N/A",
                            errorBody
                        )));
                        return;
                    }

                    if (response.body() == null) {
                        callback.onError(new IOException("API 响应体为空"));
                        return;
                    }

                    StringBuilder fullContent = new StringBuilder();
                    String finishReason = null;

                    try (BufferedReader reader = new BufferedReader(
                            new InputStreamReader(response.body().byteStream()))) {

                        String line;
                        while ((line = reader.readLine()) != null) {
                            if (line.isBlank() || !line.startsWith("data:")) {
                                continue;
                            }

                            String data = line.substring(5).trim();
                            if ("[DONE]".equals(data)) {
                                break;
                            }

                            try {
                                JsonObject chunk = gson.fromJson(data, JsonObject.class);
                                JsonArray choices = chunk.getAsJsonArray("choices");
                                if (choices == null || choices.size() == 0) {
                                    continue;
                                }

                                JsonObject choice = choices.get(0).getAsJsonObject();
                                JsonObject delta = getAsJsonObject(choice, "delta");
                                if (delta == null) {
                                    continue;
                                }

                                String content = getNullableString(delta, "content");
                                if (content != null && !content.isEmpty()) {
                                    fullContent.append(content);
                                    callback.onContent(content);
                                }

                                String chunkFinishReason = getNullableString(choice, "finish_reason");
                                if (chunkFinishReason != null) {
                                    finishReason = chunkFinishReason;
                                }
                            } catch (JsonParseException | IllegalStateException e) {
                                // SSE 中可能夹杂非 JSON 行，或字段结构与普通 completion 不同，跳过即可。
                            }
                        }

                        long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - requestStart);
                        System.out.printf("%n[流式请求耗时] %d ms, traceId=%s%n",
                            elapsedMs,
                            traceId != null ? traceId : "N/A");
                        if ("length".equals(finishReason)) {
                            System.out.printf("[提示] 输出触发 max_tokens=%d 上限，如需更完整回答可调大该值。%n",
                                MAX_TOKENS);
                        }
                        callback.onComplete(fullContent.toString());
                    } catch (SocketTimeoutException e) {
                        callback.onError(new SocketTimeoutException(String.format(
                            "流式调用超时（%d 秒）。如果切换回 thinking 模式或增大 max_tokens，建议同步调大 readTimeout/callTimeout。原始错误: %s",
                            CALL_TIMEOUT_SECONDS,
                            e.getMessage()
                        )));
                    }
                } catch (IOException e) {
                    callback.onError(e);
                } catch (RuntimeException e) {
                    callback.onError(new IOException("解析流式响应失败: " + e.getMessage(), e));
                }
            }
        });
    }

    private static JsonObject getAsJsonObject(JsonObject parent, String memberName) {
        if (parent == null || !parent.has(memberName)) {
            return null;
        }
        JsonElement element = parent.get(memberName);
        if (element == null || element.isJsonNull() || !element.isJsonObject()) {
            return null;
        }
        return element.getAsJsonObject();
    }

    private static String getNullableString(JsonObject parent, String memberName) {
        if (parent == null || !parent.has(memberName)) {
            return null;
        }
        JsonElement element = parent.get(memberName);
        if (element == null || element.isJsonNull()) {
            return null;
        }
        return element.getAsString();
    }

    public static void main(String[] args) throws InterruptedException {
        LLMStreamClient client = new LLMStreamClient();

        String systemPrompt = "你是一个编程助手，用简洁清晰的方式解答问题。";
        String userMessage = "用 Java 写一个单例模式的示例";

        System.out.println("发送问题: " + userMessage);
        System.out.printf("模型: %s, enableThinking=%s%n", MODEL, ENABLE_THINKING);
        System.out.println("\n=== 模型回答（流式）===\n");

        // 用于等待异步调用完成
        Object lock = new Object();

        client.chatStream(systemPrompt, userMessage, new LLMStreamClient.StreamCallback() {
            @Override
            public void onContent(String content) {
                // 实时打印收到的内容（打字机效果）
                System.out.print(content);
                System.out.flush();
            }

            @Override
            public void onComplete(String fullContent) {
                System.out.println("\n\n=== 回答完成 ===");
                System.out.println("总长度: " + fullContent.length() + " 字符");
                
                synchronized (lock) {
                    lock.notify();
                }
            }

            @Override
            public void onError(Exception e) {
                System.err.println("\n发生错误: " + e.getMessage());
                
                synchronized (lock) {
                    lock.notify();
                }
            }
        });

        // 等待异步调用完成
        synchronized (lock) {
            // 最多等待 5 分钟
            lock.wait(300000);  
        }
    }
}

运行效果：

你会看到文字一个一个地"打"出来，而不是等很久后突然出现一大段。这就是 ChatGPT 那种打字机效果。

多轮对话实现

多轮对话核心原理

真实的聊天应用需要支持多轮对话。核心是把历史对话都带在请求里。

示例代码：org.javaup.example.ConversationClient

public class ConversationClient {

    private static final MediaType JSON = MediaType.get("application/json; charset=utf-8");
    private static final String API_URL = "https://api.siliconflow.cn/v1/chat/completions";
    private static final String API_KEY = "设置你的apiKey";
    private static final String MODEL = "Qwen/Qwen3.5-122B-A10B";
    private static final boolean ENABLE_THINKING = false;
    private static final int MAX_TOKENS = 256;
    private static final int CONNECT_TIMEOUT_SECONDS = 30;
    private static final int READ_TIMEOUT_SECONDS = 180;
    private static final int CALL_TIMEOUT_SECONDS = 180;

    /**
     * 最大历史消息数（防止上下文过长）
     * */
    private static final int MAX_HISTORY_SIZE = 20;

    private final OkHttpClient httpClient;
    private final Gson gson;
    private final String systemPrompt;
    private final List<Message> conversationHistory;

    /**
     * 消息类
     */
    public static class Message {
        public String role;
        public String content;

        public Message(String role, String content) {
            this.role = role;
            this.content = content;
        }
    }

    public ConversationClient(String systemPrompt) {
        this.httpClient = new OkHttpClient.Builder()
            .connectTimeout(CONNECT_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .readTimeout(READ_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .callTimeout(CALL_TIMEOUT_SECONDS, TimeUnit.SECONDS)
            .retryOnConnectionFailure(true)
            .build();

        this.gson = new Gson();
        this.systemPrompt = systemPrompt;
        this.conversationHistory = new ArrayList<>();
    }

    /**
     * 发送消息并获取回复
     */
    public String send(String userMessage) throws IOException {
        List<Message> requestMessages = buildRequestMessages(userMessage);

        // 1. 构建请求
        JsonObject requestBody = new JsonObject();
        requestBody.addProperty("model", MODEL);
        requestBody.addProperty("temperature", 0.7);
        requestBody.addProperty("max_tokens", MAX_TOKENS);
        requestBody.addProperty("stream", false);
        requestBody.addProperty("enable_thinking", ENABLE_THINKING);

        // 构建 messages：system + 历史对话
        JsonArray messages = new JsonArray();

        // 添加 system 消息
        JsonObject systemMsg = new JsonObject();
        systemMsg.addProperty("role", "system");
        systemMsg.addProperty("content", systemPrompt);
        messages.add(systemMsg);

        // 添加历史对话（最近的 MAX_HISTORY_SIZE 条）
        for (Message msg : requestMessages) {
            JsonObject msgJson = new JsonObject();
            msgJson.addProperty("role", msg.role);
            msgJson.addProperty("content", msg.content);
            messages.add(msgJson);
        }

        requestBody.add("messages", messages);

        // 3. 发送请求
        Request request = new Request.Builder()
            .url(API_URL)
            .addHeader("Authorization", "Bearer " + API_KEY)
            .addHeader("Content-Type", "application/json")
            .post(RequestBody.create(
                requestBody.toString(),
                JSON
            ))
            .build();

        long requestStart = System.nanoTime();
        try (Response response = httpClient.newCall(request).execute()) {
            String traceId = response.header("x-siliconcloud-trace-id");
            if (!response.isSuccessful()) {
                String errorBody = response.body() != null ? response.body().string() : "No error body";
                throw new IOException(String.format(
                    "请求失败: %d, traceId=%s, body=%s",
                    response.code(),
                    traceId != null ? traceId : "N/A",
                    errorBody
                ));
            }

            if (response.body() == null) {
                throw new IOException("API 响应体为空");
            }

            String body = response.body().string();
            JsonObject json = gson.fromJson(body, JsonObject.class);

            JsonArray choices = json.getAsJsonArray("choices");
            if (choices == null || choices.size() == 0) {
                throw new IOException("响应中没有 choices");
            }

            String answer = json.getAsJsonArray("choices")
                .get(0).getAsJsonObject()
                .getAsJsonObject("message")
                .get("content").getAsString();

            // 4. 请求成功后再提交本轮历史，避免失败重试造成重复上下文
            conversationHistory.add(new Message("user", userMessage));
            conversationHistory.add(new Message("assistant", answer));

            long elapsedMs = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - requestStart);
            System.out.printf("[请求耗时] %d ms, traceId=%s%n",
                elapsedMs,
                traceId != null ? traceId : "N/A");

            // 打印 Token 使用情况
            if (json.has("usage")) {
                JsonObject usage = json.getAsJsonObject("usage");
                System.out.printf("[Token] 本轮: %d, 历史消息数: %d%n",
                    usage.get("total_tokens").getAsInt(),
                    conversationHistory.size());
            }

            String finishReason = getNullableString(choices.get(0).getAsJsonObject(), "finish_reason");
            if ("length".equals(finishReason)) {
                System.out.printf("[提示] 输出触发 max_tokens=%d 上限，如需更完整回答可调大该值，同时建议同步调大 readTimeout/callTimeout。%n",
                    MAX_TOKENS);
            }

            return answer;
        } catch (SocketTimeoutException e) {
            throw new SocketTimeoutException(String.format(
                "多轮对话调用超时（%d 秒）。当前示例已默认关闭 enable_thinking；如果改回 true 或调大 max_tokens，建议同步调大 readTimeout/callTimeout，或直接改成流式输出。原始错误: %s",
                CALL_TIMEOUT_SECONDS,
                e.getMessage()
            ));
        }
    }

    private List<Message> buildRequestMessages(String userMessage) {
        int startIndex = Math.max(0, conversationHistory.size() - MAX_HISTORY_SIZE);
        List<Message> requestMessages = new ArrayList<>(conversationHistory.subList(startIndex, conversationHistory.size()));
        requestMessages.add(new Message("user", userMessage));
        return requestMessages;
    }

    private static String getNullableString(JsonObject parent, String memberName) {
        if (parent == null || !parent.has(memberName)) {
            return null;
        }
        JsonElement element = parent.get(memberName);
        if (element == null || element.isJsonNull()) {
            return null;
        }
        return element.getAsString();
    }

    /**
     * 清空对话历史
     */
    public void clearHistory() {
        conversationHistory.clear();
        System.out.println("对话历史已清空");
    }

    /**
     * 获取对话历史
     */
    public List<Message> getHistory() {
        return new ArrayList<>(conversationHistory);
    }

    public static void main(String[] args) throws IOException {
        String systemPrompt = """
            你是一个友好的编程助手。
            记住用户之前说过的话，在对话中保持连贯性。
            回答要简洁明了。
            """;

        ConversationClient client = new ConversationClient(systemPrompt);
        Scanner scanner = new Scanner(System.in);

        System.out.println("=== 多轮对话演示 ===");
        System.out.println("输入 'quit' 退出，输入 'clear' 清空历史\n");

        while (true) {
            System.out.print("你: ");
            String input = scanner.nextLine().trim();

            if ("quit".equalsIgnoreCase(input)) {
                System.out.println("再见！");
                break;
            }

            if ("clear".equalsIgnoreCase(input)) {
                client.clearHistory();
                continue;
            }

            if (input.isEmpty()) {
                continue;
            }

            try {
                System.out.printf("助手: 正在思考... [model=%s, enableThinking=%s]%n",
                    MODEL,
                    ENABLE_THINKING);
                String answer = client.send(input);
                System.out.println("\n助手: " + answer + "\n");
            } catch (IOException e) {
                System.err.println("发生错误: " + e.getMessage());
            }
        }

        scanner.close();
    }
}

对话示例：

=== 多轮对话演示 ===
输入 'quit' 退出，输入 'clear' 清空历史

你: 你好，我叫小明
助手: 正在思考... [model=Qwen/Qwen3.5-122B-A10B, enableThinking=false]
[请求耗时] 2513 ms, traceId=ti_9e853c5mjbmufqudfl
[Token] 本轮: 64, 历史消息数: 2

助手: 你好，小明！很高兴认识你。有什么编程问题需要我帮忙吗？

你: 我想学习 Spring Boot
助手: 正在思考... [model=Qwen/Qwen3.5-122B-A10B, enableThinking=false]
[请求耗时] 13881 ms, traceId=ti_sxx4ah96dm296l9lai
[Token] 本轮: 194, 历史消息数: 4

助手: 太棒了，Spring Boot 是 Java 开发中最流行的框架之一，上手快且功能强大！

我们可以从以下几个步骤开始：
1. **基础准备**：确保你熟悉 Java 基础（如注解、接口）和 Maven/Gradle。
2. **第一个项目**：用 Spring Initializr 快速创建一个"Hello World"项目。
3. **核心概念**：了解自动配置、依赖注入和 RESTful API 开发。

你想先了解**如何创建第一个项目**，还是想直接聊聊**核心概念**？

你: 从基础开始吧
助手: 正在思考... [model=Qwen/Qwen3.5-122B-A10B, enableThinking=false]
[请求耗时] 23357 ms, traceId=ti_o2vojopycyiyp7bk8e
[Token] 本轮: 408, 历史消息数: 6

助手: 没问题，小明！我们一步步来。

首先，**创建第一个项目**是最快的上手方式：

1.  **访问官网工具**：打开 [Spring Initializr](https://start.spring.io)。
2.  **填写配置**：
    *   **Project**: Maven
    *   **Language**: Java
    *   **Spring Boot**: 选最新稳定版（如 3.x）
    *   **Dependencies**: 点击"Add dependencies"，搜索并添加 **Spring Web**（这是开发网页和 API 的核心）。
3.  **生成代码**：点击"Generate"下载压缩包，解压后用 IDE（如 IntelliJ IDEA 或 VS Code）打开。

**下一步**：
项目里通常会有一个 `DemoApplication.java` 文件和一个 `Controller.java`（或者你需要自己新建一个）。

你需要我演示一下**如何写第一个"Hello World"接口**，让你能在浏览器里看到结果吗？

Ollama 本地部署实战

Ollama 是在本地运行大模型的最简单方案。

安装 Ollama

方式一：官网下载（推荐）

访问 ollama.com/download，下载对应系统的安装包：

Mac：下载 .dmg 文件，拖到应用程序文件夹
Windows：下载 .exe 安装程序
Linux：运行安装脚本

方式二：命令行安装（Mac/Linux）

curl -fsSL https://ollama.com/install.sh | sh

方式三：Docker 安装

# 基础安装（仅 CPU）
docker run -d -p 11434:11434 --name ollama ollama/ollama

# 启用 GPU（需要 nvidia-container-toolkit）
docker run -d --gpus=all -p 11434:11434 --name ollama ollama/ollama

验证安装：

ollama --version
# 输出类似：ollama version 0.1.xxx

下载和运行模型

Ollama 的模型库在 ollama.com/library。

拉取模型：

# 拉取 Qwen 2.5 7B 模型（约 4.5GB）
ollama pull qwen2.5:7b

# 拉取更小的模型（适合内存有限的情况）
ollama pull qwen2.5:3b

# 拉取更大的模型（效果更好）
ollama pull qwen2.5:14b

直接运行（交互模式）：

ollama run qwen2.5:7b

进入交互模式后直接对话：

>>> 你好，介绍一下你自己
我是通义千问，一个由阿里云开发的大语言模型...

>>> 用 Python 写个快速排序
好的，这是一个 Python 实现的快速排序算法：

def quicksort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quicksort(left) + middle + quicksort(right)

>>> /bye  （退出）

常用命令：

# 查看已下载的模型
ollama list

# 查看模型信息
ollama show qwen2.5:7b

# 删除模型
ollama rm qwen2.5:7b

# 停止 Ollama 服务
ollama stop

通过 API 调用本地模型

Ollama 启动后会在 localhost:11434 提供 API 服务，兼容 OpenAI 协议。

命令行测试：

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [
      {"role": "user", "content": "你好"}
    ]
  }'

Java 代码调用：

Ollama 兼容 OpenAI 协议

只需要把之前代码中的 API_URL 改成本地地址：

// 云端调用
private static final String API_URL = "https://api.siliconflow.cn/v1/chat/completions";
private static final String API_KEY = "YOUR_API_KEY";

// 改成本地调用
private static final String API_URL = "http://localhost:11434/v1/chat/completions";
private static final String API_KEY = "ollama";  // 随便填，本地不校验

其他代码完全不用改！这就是兼容 OpenAI 协议的好处。

小结

这篇咱们完成了从理论到实践的跨越：

1. 平台选择

云端推荐硅基流动（模型多、价格低、协议统一）
本地推荐 Ollama（简单、跨平台、兼容 OpenAI 协议）

2. OpenAI 协议

请求核心：model、messages、temperature、max_tokens、stream
响应核心：choices[0].message.content、usage

3. 非流式调用

简单直接：发请求 → 等待 → 拿完整结果
适合不需要实时反馈的场景

4. 流式调用

打字机效果：边生成边返回
需要处理 SSE 格式
用户体验更好

5. 多轮对话

核心：每次请求带上历史消息
注意控制历史长度，避免超出上下文窗口

6. 本地部署

Ollama：一键安装、一键运行
API 兼容：改个 URL 就能用

7. 常见坑

超时要设够长
API Key 不要泄露
流式响应要正确解析
历史消息要控制长度

到这里，大模型的基础知识就讲完了。你已经具备了：

对大模型的系统认知（是什么、怎么工作、有什么局限）
核心概念的理解（Token、上下文、Temperature 等）
模型选型的能力
实际调用 API 的技能

接下来就可以进入更深入的主题了：Prompt 工程、RAG、Agent 等。这些技术都建立在今天学的基础之上，有了这些基础，后面学起来会轻松很多。

主流开发平台深度对比​

云端 API 平台详解​

硅基流动（SiliconFlow）​

阿里云百炼​

DeepSeek 官方​

OpenAI​

本地部署工具对比​

Ollama​

方案选择决策树​

OpenAI 协议详解​

请求格式完整解析​

响应格式完整解析​

硅基流动 API 调用实战​

示例中项目地址​

准备工作​

1. 注册获取 API Key​

2. Maven 依赖配置​

非流式调用完整实现​

流式调用完整实现​

多轮对话实现​

Ollama 本地部署实战​

安装 Ollama​

下载和运行模型​

通过 API 调用本地模型​

小结​

主流开发平台深度对比

云端 API 平台详解

硅基流动（SiliconFlow）

阿里云百炼

DeepSeek 官方

OpenAI

本地部署工具对比

Ollama

方案选择决策树

OpenAI 协议详解

请求格式完整解析

响应格式完整解析

硅基流动 API 调用实战

示例中项目地址

准备工作

1. 注册获取 API Key

2. Maven 依赖配置

非流式调用完整实现

流式调用完整实现

多轮对话实现

Ollama 本地部署实战

安装 Ollama

下载和运行模型

通过 API 调用本地模型

小结