Google Colab运行DeepSeek R1
最近我在测试本地运行DeepSeek R1,但CPU温度太高了。我的朋友说你为什么不使用 Google Colab?因为它为你提供了一个 免费GPU。

最近,我尝试在没有任何 GPU 的情况下在本地运行使用 Qwen 7B 蒸馏的 DeepSeek R1。我的所有 CPU 核心和线程都被推到了极限,最高温度达到 90 摄氏度(Ryzen 5 7600)。
我的朋友说你为什么不使用 Google Colab?因为它为你提供了一个 GPU(免费使用 3-4 小时)。他一直在使用它来解析超过 80 页的 pdf 并链接 LLM,因为我们仍然可以滥用(使用)google colab。
我确实尝试了 T4(20 系列),但有一些注意事项(我稍后会解释,TL:DR,因为它是免费的)。所以我一直在用它在 Google Colab 中测试 VLLM。使用 FastAPI 和 ngrok 向公众公开 API(出于测试目的,为什么不呢?)。
好吧,是时候解释一切以及我为什么这样做了。
(警告,这仅用于测试目的)。
1、PIP(Pip 安装包)
它允许你安装和管理标准 Python 库中未包含的其他库和依赖项。现在我们将使用 CLI 通过将 !
传递到 Jupyter Notebook 的代码中来安装它。
!pip install fastapi nest-asyncio pyngrok uvicorn
!pip install vllm
我们将安装 FastAPI、nest-asyncio、pyngrok 和 Uvicorn 作为处理来自外部源的请求的 Python 服务。 VLLM 主要是用于 LLM 推理和服务的库。虽然 Ollama 是一种选择,但我相信这种方法会更有效。
2、与VLLM 交互
# Load and run the model:
import subprocess
import time
import os
# Start vllm server in the background
vllm_process = subprocess.Popen([
'vllm',
'serve', # Subcommand must follow vllm
'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B',
'--trust-remote-code',
'--dtype', 'half',
'--max-model-len', '16384', # This is max token input and output that you send and retrieve
'--enable-chunked-prefill', 'true',
'--tensor-parallel-size', '1'
], stdout=subprocess.PIPE, stderr=subprocess.PIPE, start_new_session=True)
好的,这就是我加载模型的方式,通过在后台启动 vllm 服务器,因为如果你在 Jupyter Notebook 中,它将卡在运行 vllm 的进程中,我们无法公开它(我认为我们可以,但我只是这样做)。在这里我们可以看到
--trust-remote-code
,因此它信任远程代码。--dtype
,一半以减少内存使用量。--max-model-len
,用于你想要发送和检索的最大 token 输入 + 输出组合。--enable-chunked-prefill
,指的是在生成开始之前将 token 预加载到模型中的过程。--tensor-parallel-size
,将模型拆分到多个 GPU 上以加快推理速度。
通过这样做,我们不会受 T4 的限制,因为
- 注意 CUDA 内存不足错误(我们的 VRAM 限制为 15GB)
- Colab 的 GPU 内存限制可能需要参数调整
- 12 GB 的 RAM,足够了……我想。
现在运行它。
3、子进程
好的,因为我们使用子进程并将 start_new_session
设置为 true
,所以我们通常无法通过管道传输输出,如果出现错误,我们无法看到它,直到它出错。
import requests
def check_vllm_status():
try:
response = requests.get("http://localhost:8000/health")
if response.status_code == 200:
print("vllm server is running")
return True
except requests.exceptions.ConnectionError:
print("vllm server is not running")
return False
try:
# Monitor the process
while True:
if check_vllm_status() == True:
print("The vllm server is ready to serve.")
break
else:
print("The vllm server has stopped.")
stdout, stderr = vllm_process.communicate(timeout=10)
print(f"STDOUT: {stdout.decode('utf-8')}")
print(f"STDERR: {stderr.decode('utf-8')}")
break
time.sleep(5) # Check every second
except KeyboardInterrupt:
print("Stopping the check of vllm...")
它将每 5 秒检查一次,如果有错误,则尝试通过管道输出,或者将打印 vllm stderr 并输出。如果 vllm 一切正常,那么您可以继续执行下一个代码块。
4、创建函数调用 vLLM
import requests
import json
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from fastapi.responses import StreamingResponse
import requests
# Request schema for input
class QuestionRequest(BaseModel):
question: str
model: str = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" # Default model
def ask_model(question: str, model: str):
"""
Sends a request to the model server and fetches a response.
"""
url = "http://localhost:8000/v1/chat/completions" # Adjust the URL if different
headers = {"Content-Type": "application/json"}
data = {
"model": model,
"messages": [
{
"role": "user",
"content": question
}
]
}
response = requests.post(url, headers=headers, json=data)
response.raise_for_status() # Raise exception for HTTP errors
return response.json()
# Usage:
result = ask_model("What is the capital of France?", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
print(json.dumps(result, indent=2))
def stream_llm_response(question:str, model:str):
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": model,
"messages": [{"role": "user", "content": question}],
"stream": True # 🔥 Enable streaming
}
with requests.post(url, headers=headers, json=data, stream=True) as response:
for line in response.iter_lines():
if line:
# OpenAI-style streaming responses are prefixed with "data: "
decoded_line = line.decode("utf-8").replace("data: ", "")
yield decoded_line + "\n"
我们有两个 API 可供测试,
ask_model
函数
用途:向 vLLM 服务器发送请求并等待完整响应。
工作原理:
- 构造一个 POST 请求到 http://localhost:8000/v1/chat/completions。
- 发送包含以下内容的 JSON 负载:模型名称、用户的问题(作为消息)。
- 等待响应并将其作为 JSON 返回。
主要特点:
- 阻塞调用(等待生成完整响应)。
- 如果请求失败,则引发异常。
stream_llm_response
函数
目的:从 vLLM 流式传输响应,而不是等待完整输出。
工作原理:
- 发送带有
stream: True
的 POST 请求,启用分块响应。 - 使用
response.iter_lines()
实时处理响应块。 - 每个收到的块都被解码并作为流产生。
主要特点:
- 非阻塞流式传输(适用于聊天机器人和交互式应用程序)。
- 数据以小部分返回,减少感知延迟。
我们测试了它,它输出了类似这样的内容
{
"id": "chatcmpl-680bc07cd6de42e7a00a50dfbd99e833",
"object": "chat.completion",
"created": 1738129381,
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "<think>\nOkay, so I'm trying to find out what the capital of France is. Hmm, I remember hearing a few cities named after the myths or something. Let me think. I think Neuch portfolio is where the comma was named. Yeah, that's right, until sometimes they changed it, but I think it's still there now. Then there's Charles-de-Lorraine. I've seen that name written before in various contexts, maybe managers or something. And then I think there's Saint Mal\u25e6e as a significant city in France. Wait, I'm a bit confused about the last one. Is that the capital or somewhere else? I think the capital blew my mind once, and I still don't recall it. Let me think of the names that come to mind. Maybe Paris? But is there something else? I've heard about places likequalification, Guiness, and Agoura also named after mythological figures, but are they capitals? I don't think so. So among the prominent ones, maybe Neuch portfolio, Charles-de-Lorraine, and Saint Mal\u25e6e are the names intended for the capital, but I'm unsure which one it is. Wait, I think I might have confused some of them. Let me try to look up the actual capital. The capital of France is a city in the eastern department of\u5c55\u51fa. Oh, right, there's a special place called Place de la Confluense. Maybe that's where the capital is. So I think the capital is Place de la Confluense, not the city name. So the capital isn't the town; it's quite a vein-shaped area. But I'm a bit confused because some people might refer to just the town as the capital, but in reality, it's a larger area. So to answer the question, the capital of France is Place de la Confluense, and its formal name is la Confluense. I'm not entirely certain if there are any other significant cities or names, but from what I know, the others I listed might be historical places but not exactly capitals. Maybe the\u6bebot\u00e9 family name is still sometimes used for the capital, but I think it's not the actual name. So putting it all together, the capital is Place de la Confluense, and the correct name is \"la Confluense.\" The other names like Neuch portfolio are places, not capitals. So, overall, my answer would be the capital is la Confluense named at Place de la Confluense.\n</think>\n\nThe capital of France is called Place de la Confluense. Its official name is \"la Confluense.\"",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 550,
"completion_tokens": 540,
"prompt_tokens_details": null
},
"prompt_logprobs": null
}
5、API Pathing
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
import nest_asyncio
from pyngrok import ngrok
import uvicorn
import getpass
from pyngrok import ngrok, conf
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=['*'],
allow_credentials=True,
allow_methods=['*'],
allow_headers=['*'],
)
@app.get('/')
async def root():
return {'hello': 'world'}
@app.post("/api/v1/generate-response")
def generate_response(request: QuestionRequest):
"""
API endpoint to generate a response from the model.
"""
try:
response = ask_model(request.question, request.model)
return {"response": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/api/v1/generate-response-stream")
def stream_response(request:QuestionRequest):
try:
response = stream_llm_response(request.question, request.model)
return StreamingResponse(response, media_type="text/event-stream")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
好的,现在我们为我们创建的每个函数创建 API 路径,每个 API 将使用不同的函数,它将使用流方法还是仅生成响应但阻塞。这里我们只需要一些快速的东西,所以我们将允许一切。如果发生错误,我们只会返回内部服务器错误并详细说明错误。
6、ngrok -> public test
! ngrok config add-authtoken ${your-ngrok-token}
现在我们要添加配置令牌,只需从 ngork 仪表板复制粘贴你的令牌即可。
之后我们公开它。
port = 8081
# Open a ngrok tunnel to the HTTP server
public_url = ngrok.connect(port).public_url
print(f" * ngrok tunnel \"{public_url}\" -> \"http://127.0.0.1:{port}\"")

我们使用 curl 或 postman 返回公开的隧道。
nest_asyncio.apply()
uvicorn.run(app, port=port)
最后运行服务,哇,它运行完美……我想。

现在你可以像这样访问它。
curl --location 'https://6ee6-34-125-245-24.ngrok-free.app/api/v1/generate-response-stream' \
--header 'Content-Type: application/json' \
--data '{
"question": "where is paris?",
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
}'
如果你选择每个 token 流响应,则会得到更像这样的响应。

而且它真的很棒,响应有点快……我认为但输出答案非常好(如果你想要一个更简洁的答案,比如关于代码而不是关于创造力)。但对于 Facebook 的 Llama 之后的开源模型来说,这已经相当不错了。
完整代码可在此处访问。
7、结束语
完成所有这些后,我建议你们……如果你有钱,如果你想在本地运行它,请购买 GPU,你需要一个相当不错的 GPU 和电费的开销,但你可以选择拥有自己的数据。使用 chat.deepseek.com,如果想使用超快的deepseek llm(带有 671B 参数),你可以使用它。

原文链接:Trying out VLLM + DeepSeek R1 in Google Colab: A Quick Guide
汇智网翻译整理,转载请标明出处