本站文章一览:
现在市面上有很多的AI+搜索的应用或插件,一直想学习其背后的实现原理。今天咱们就学习一下,并且亲自动手实践,从0开始,搭建一个自己的AI搜索引擎。最终实现效果如下:
话不多说,开干。
本文代码参考:mp.weixin.qq.com/s/6F22Mls7z… 的API。
0. 框架
先来搞定框架。
代码中,服务端使用了Python + Flask框架,前端使用HTML。通过 Flask的render_template函数渲染HTML页面。render_template 函数是 Flask 提供的一个工具,用于渲染 Jinja2 模板。Jinja2 是一个 Python 的模板引擎,它允许你在 HTML 文件中使用 Python 变量和表达式。
代码如下:
from flask import Flask, render_template, request, jsonify
@app.route('/', methods=['GET'])
def index():
chat_history = history
return render_template('ai_search.html', history=chat_history)
代码中,HTML页面的名称为 “ai_search.html”。
注意,在使用此种方法渲染HTML页面时,需要将HTML文件放到templates文件夹下,否则找不到文件,报错。
也就是说,工程目录结构应该如下:
1. 服务端(Python + Flask)
服务端就是利用Flask封装一个个地接口,然后进行相应处理。
1.1 Search接口
@app.route('/search', methods=['GET', 'POST'])
def search():
if request.method == 'POST':
keyword = request.form['keyword']
elif request.method == 'GET':
keyword = request.args.get('keyword', '')
else:
keyword = ''
results = crawl_pages(keyword)
output = ""
for result in results:
output += f"<li><a id='myID' href='javascript:void(0);' onclick='handleLinkClick(\"{result['url']}\")'>{result['title']}</a></li><br>"
return output
Search接口接收用户输入的关键字,然后调用 crawl_pages
接口去获取检索结果。
1.1.1 crawl_pages接口
def crawl_pages(query_text, page_num=2):
browser = mechanicalsoup.Browser()
query_text_encoded = quote(query_text) # 关键字编码,例如关键字中的中文要转码才能作为URL的参数
results = []
for page_index in range(1, page_num+1):
url = f"https://search.cctv.com/search.php?qtext={query_text_encoded}&type=web&page={page_index}"
page = browser.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
web_content_links = soup.find_all('a', id=lambda x: x and x.startswith('web_content_'))
for i, link in enumerate(web_content_links):
target_page = parse_qs(urlparse(link['href']).query).get('targetpage', [None])[0]
results.append({'title': link.text, 'url': target_page})
return results
该接口通过关键字来去固定网页去检索该关键字,获取前两页的检索结果,通过前两页的检索结果,通过爬虫,将结果的标题和URL提取出来。
(1)url = f"https://search.cctv.com/search.php?qtext={query_text_encoded}&type=web&page={page_index}"
,这是表明去哪个网页搜索这个关键字。这个链接相当于以下操作,去CCTV网搜关键字:
(2)通过简单的爬虫,将以上获取到的检索结果界面中的所有结果的URL和标题提取出来:target_page = parse_qs(urlparse(link['href']).query).get('targetpage', [None])[0]
,例如这一句,提取URL。
(3)然后你就会获得一堆的URL,返回给Search接口后,通过 output += f"<li><a id='myID' href='javascript:void(0);' onclick='handleLinkClick(\"{result['url']}\")'>{result['title']}</a></li><br>"
组装结果,插入到HTML中去显示。也就是侧边栏的效果:
1.2 generate-text接口
@app.route('/generate-text', methods=['POST'])
def generate_text_api():
prompt = request.json['prompt']
result = generate_text(prompt)
return jsonify(result)
该接口是将用户输入的关键字当作Prompt,给大模型,让大模型根据这个信息回复点什么东西。中间没有什么特别的处理。要说值得注意的,就是 history.append({"user": prompt, "bot": generated_text})
来将对话信息添加到历史记录里面。
def get_openai_chat_completion(messages, temperature, model = "gpt-3.5-turbo-1106"):
response = client.chat.completions.create(
model = model,
messages = messages,
temperature = temperature,
)
return response
def generate_text(prompt, temperature=0.5):
messages = [
{
"role": "user",
"content": prompt,
}
]
response = get_openai_chat_completion(messages = messages, temperature=temperature)
generated_text = response.choices[0].message.content
history.append({"user": prompt, "bot": generated_text}) # 将用户输入和模型输出添加到历史记录中
return {"status": "success", "response": generated_text}
这一步的效果如下,与检索毫无关系:
1.3 page_content接口
该接口是通过URL来获取网页内容。就是一个简单的爬虫程序,将网页中的文字和图片提取出来。
@app.route('/page_content')
def page_content():
url = request.args.get('url', '')
if not url:
return '缺少 url 参数'
browser = mechanicalsoup.Browser()
page = browser.get(url)
page.encoding = 'utf-8' # 指定页面的编码为 utf-8
soup = BeautifulSoup(page.text, 'html.parser')
all_text = ''
all_images = []
# 获取页面中所有文本内容
for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'span']):
all_text += element.get_text() + ' '
# 获取页面中所有图片链接
for img in soup.find_all('img'):
img_src = img.get('src')
if img_src:
all_images.append("https:"+img_src)
return f"文本内容: {all_text}<br>图片链接: {', '.join(all_images)}"
2. 前端(HTML)
2.1 用户输入关键字后的动作
先来看下前端HTML代码中,当用户点击提交按钮后的动作,重点是下面几行。
inputForm.addEventListener('submit', async (event) => {
......
const aa = document.getElementById('listView');
aa.innerHTML = await getA(userInput);
const response = await generateText(userInput);
hideTypingAnimation(userMessage);
......
});
可以看到,当用户点击提交按钮后,首先调用了 getA 函数:
async function getA(prompt) {
const response = await fetch(SERVER_URL + `/search?keyword=${prompt}`, {
method: 'GET',
headers: {
'Content-Type': 'application/json'
}
});
return await response.text();
}
getA函数,调用了服务端的 Search接口,去固定网页检索关键字,获取URL和标题列表。
然后,紧接着调用了 generateText 函数:
async function generateText(prompt) {
const response = await fetch(SERVER_URL + '/generate-text', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
prompt
})
});
return await response.json();
}
generateText 函数,调用了服务端的 generate-text 函数,利用大模型进行回复。
2.2 用户点击侧边栏标题后的动作
当用户点击侧边栏的某个标题后,执行的动作如下:
async function handleLinkClick(link) {
const content = await getPageContent(link);
......
const response = await generateText("总结内容:" + content);
......
}
首先,调用了 getPageContent 接口,通过服务端的 page_content 接口,爬取了该URL中的所有文字内容和图片内容。
然后,通过 generateText 接口,调用服务端的 generate-text 接口,使用大模型对这些文字内容进行总结,从而形成下面的效果:
3. 完整代码
3.1 ai_search.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Chat with AI</title>
<style>
body {
display: flex;
flex-direction: column;
height: 100vh;
margin: 0;
font-family: Arial, sans-serif;
}
.website-container {
position: fixed;
top: 0;
right: 0;
width: 350px;
height: 100%;
border: 1px solid #ccc;
overflow-y: auto;
background-color: #f9f9f9;
padding: 10px;
}
.chat-container {
height: 100%;
width: 85%;
overflow: hidden;
overflow-y: auto;
padding: 10px;
margin-right: 220px;
/* 腾出右侧栏的宽度 */
}
.chat-container::-webkit-scrollbar {
display: none;
}
.avatar-user {
width: 40px;
height: 40px;
background-color: #7fb8e7;
/* 设置用户头像颜色 */
border-radius: 50%;
/* 将头像设置为圆形 */
margin-left: 10px;
/* 调整头像与消息之间的间距 */
}
.avatar-bot {
width: 40px;
height: 40px;
right: 0;
background-color: #28a745;
/* 设置机器人头像颜色 */
border-radius: 50%;
/* 将头像设置为圆形 */
margin-right: 10px;
/* 调整头像与消息之间的间距 */
object-fit: cover;
/* 防止头像变形 */
}
.message {
display: flex;
align-items: center;
/* 垂直居中消息和头像 */
margin-bottom: 1rem;
}
.message-text {
padding: 10px;
word-wrap: break-word;
border-radius: 6px;
max-width: 70%;
margin:100px;
}
.message-text-user {
padding: 10px;
border-radius: 6px;
max-width: 70%;
margin:100px;
word-wrap: break-word;
background-color: #ececec;
}
.user-message {
display: flex;
justify-content: flex-end;
}
.bot-message .message-text {
background-color: #2ea44f;
color: white;
}
.input-container {
position: fixed;
bottom: 0;
left: 0;
width: calc(100% - 220px);
/* 考虑右侧栏的宽度 */
display: flex;
align-items: center;
background-color: #f9f9f9;
padding: 10px;
}
.input-field {
flex-grow: 1;
padding: 0.75rem;
border: 1px solid #d1d5da;
border-radius: 6px;
margin-right: 1rem;
}
.send-button {
padding: 0.75rem 1rem;
background-color: #2ea44f;
color: white;
border: none;
border-radius: 6px;
cursor: pointer;
}
.del-button {
padding: 0.75rem 1rem;
background-color: #aeaeae;
color: white;
border: none;
margin-right: 10px;
border-radius: 6px;
cursor: pointer;
}
.send-button:disabled {
opacity: 0.5;
cursor: not-allowed;
}
.typing-indicator {
position: absolute;
margin-bottom: 50px font-size: 0.8rem;
color: #586069;
}
.typing:before,
.typing:after {
content: '';
display: inline-block;
width: 0.75rem;
height: 0.75rem;
border-radius: 50%;
margin-right: 0.25rem;
animation: typing 1s infinite;
}
@keyframes typing {
0% {
transform: scale(0);
}
50% {
transform: scale(1);
}
100% {
transform: scale(0);
}
}
/* 样式定义 */
.listView {
list-style-type: none;
margin: 0;
padding: 0;
}
.listView li {
background-color: #f4f4f4;
padding: 10px;
margin-bottom: 5px;
box-shadow: 2px 2px 5px rgba(0, 0, 0, 0.1);
transition: box-shadow 0.3s ease;
}
.listView li:hover {
box-shadow: 2px 2px 10px rgba(0, 0, 0, 0.2);
}
.listView li a {
text-decoration: none;
color: #333;
display: block;
transition: color 0.3s ease;
}
.listView li a:hover {
color: #ff6600;
}
</style>
</head>
<body style="display: flex; flex-direction: column; height: 100vh;">
<div id="website-container" class="website-container">
<ul class="listView" id="listView"></ul>
</div>
<div style="height: 90%; width:80%; overflow-y: auto; display: flex; flex-direction: column;">
<ul class="chat-container" id="chat-container">
{% for item in history %}
{% if loop.index == 1 %}
<!-- 对于第一条消息,可能想要做一些特殊处理 -->
<li class="message user-message">
<div class="message-text-user">{{ item.user }}</div> <!-- 这里应该插入用户消息 -->
<div class="avatar-user"></div>
</li>
<li class="message bot-message">
<div class="avatar-bot"></div>
<div class="message-text">{{ item.bot }}</div> <!-- 这里应该插入机器人消息 -->
</li>
{% else %}
<!-- 对于其他消息,正常处理 -->
<li class="message user-message">
<div class="message-text-user">{{ item.user }}</div>
<div class="avatar-user"></div>
</li>
<li class="message bot-message">
<div class="avatar-bot"></div>
<div class="message-text">{{ item.bot }}</div>
</li>
{% endif %}
{% endfor %}
</ul>
</div>
<form class="input-container" id="input-form" method="POST"
style="position: fixed; bottom: 0; left: 0; width: 65%;">
<button type="button" class="del-button" id="del-button" style="width: 100px;" onclick='del()'>清除</button>
<input type="text" placeholder="你负责搜,我负责找" class="input-field" id="input-field" name="prompt" autocomplete="off"
style="width: calc(100% - 100px);">
<button type="submit" class="send-button" id="send-button" disabled style="width: 100px;">搜索</button>
</form>
<script>
const SERVER_URL = '';
const inputForm = document.getElementById('input-form');
const inputField = document.getElementById('input-field');
const chatContainer = document.getElementById('chat-container');
inputField.addEventListener('input', () => {
const userInput = inputField.value.trim();
document.getElementById('send-button').disabled = !userInput;
});
inputForm.addEventListener('submit', async (event) => {
event.preventDefault();
const userInput = inputField.value.trim();
const chatContainer = document.getElementById('chat-container');
if (!userInput) {
return;
}
const userMessage = createMessageElement(userInput, 'user-message', "message-text-user", "avatar-user");
chatContainer.appendChild(userMessage);
inputField.value = '';
chatContainer.scrollTop = chatContainer.scrollHeight;
inputField.disabled = true;
document.getElementById('send-button').disabled = true;
showTypingAnimation(userMessage);
const aa = document.getElementById('listView');
aa.innerHTML = await getA(userInput);
const response = await generateText(userInput);
hideTypingAnimation(userMessage);
if (response.status === 'success') {
const botResponse = createMessageElement(response.response, 'bot-message', "message-text", "avatar-bot");
chatContainer.appendChild(botResponse);
printMessageText(botResponse);
} else {
alert(response.message);
}
inputField.disabled = false;
inputField.focus();
});
async function generateText(prompt) {
const response = await fetch(SERVER_URL + '/generate-text', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
prompt
})
});
return await response.json();
}
async function getA(prompt) {
const response = await fetch(SERVER_URL + `/search?keyword=${prompt}`, {
method: 'GET',
headers: {
'Content-Type': 'application/json'
}
});
return await response.text();
}
function createMessageElement(text, className, name, bot) {
const message = document.createElement('li');
message.classList.add('message', className, 'typing');
if (bot == "avatar-bot") {
message.innerHTML = `
<div class=${bot}></div>
<div class=${name}>${text}</div>
`;
} else {
message.innerHTML = `
<div class=${name}>${text}</div>
<div class=${bot}></div>
`;
}
return message;
}
function showTypingAnimation(element) {
const chatContainer = document.getElementById('chat-container');
chatContainer.scrollTop = chatContainer.scrollHeight + 10;
const rect = element.getBoundingClientRect();
const topPosition = rect.top + window.scrollY + rect.height;
const leftPosition = rect.left + window.scrollX;
const typingIndicator = document.createElement('div');
typingIndicator.classList.add('typing-indicator');
typingIndicator.style.top = `${topPosition}px`;
typingIndicator.style.left = `${leftPosition}px`;
typingIndicator.innerHTML = '思考中...';
document.body.appendChild(typingIndicator);
}
function hideTypingAnimation(element) {
const typingIndicator = document.querySelector('.typing-indicator');
if (typingIndicator) {
typingIndicator.remove();
}
element.classList.remove('typing');
}
// 添加逐字打印效果
function printMessageText(message) {
const chatContainer = document.getElementById('chat-container');
const text = message.querySelector('.message-text');
const textContent = text.textContent;
text.textContent = '';
for (let i = 0; i < textContent.length; i++) {
setTimeout(() => {
text.textContent += textContent.charAt(i);
chatContainer.scrollTop = chatContainer.scrollHeight;
}, i * 10); // 控制打印速度
}
}
async function handleLinkClick(link) {
const content = await getPageContent(link);
console.log(link);
console.log(content);
const userMessage = createMessageElement("总结中:" + link, 'user-message', "message-text-user", "avatar-user");
showTypingAnimation(userMessage);
const chatContainer = document.getElementById('chat-container');
chatContainer.appendChild(userMessage);
const response = await generateText("总结内容:" + content);
hideTypingAnimation(userMessage);
if (response.status === 'success') {
const botResponse = createMessageElement(response.response, 'bot-message', "message-text", "avatar-bot");
chatContainer.appendChild(botResponse);
printMessageText(botResponse);
} else {
alert(response.message);
}
}
function del(url) {
const response = fetch(SERVER_URL + `/clear`, {
method: 'POST'
});
location.replace("/");
return 0;
}
// 获取页面内容
async function getPageContent(url) {
const response = await fetch(SERVER_URL + `/page_content?url=${url}`, {
method: 'GET'
});
return await response.text();
}
</script>
</body>
</html>
3.2 ai_search.py
from flask import Flask, render_template, request, jsonify
from http import HTTPStatus
from openai import OpenAI
import mechanicalsoup
from bs4 import BeautifulSoup
from flask_cors import CORS
from urllib.parse import urlparse, parse_qs, quote
app = Flask(__name__)
client = OpenAI()
CORS(app)
history = []
def crawl_pages(query_text, page_num=2):
browser = mechanicalsoup.Browser()
query_text_encoded = quote(query_text)
results = []
for page_index in range(1, page_num+1):
url = f"https://search.cctv.com/search.php?qtext={query_text_encoded}&type=web&page={page_index}"
page = browser.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
web_content_links = soup.find_all('a', id=lambda x: x and x.startswith('web_content_'))
for i, link in enumerate(web_content_links):
target_page = parse_qs(urlparse(link['href']).query).get('targetpage', [None])[0]
results.append({'title': link.text, 'url': target_page})
return results
def get_openai_chat_completion(messages, temperature, model = "gpt-3.5-turbo-1106"):
response = client.chat.completions.create(
model = model,
messages = messages,
temperature = temperature,
)
return response
def generate_text(prompt, temperature=0.5):
messages = [
{
"role": "user",
"content": prompt,
}
]
response = get_openai_chat_completion(messages = messages, temperature=temperature)
generated_text = response.choices[0].message.content
history.append({"user": prompt, "bot": generated_text}) # 将用户输入和模型输出添加到历史记录中
return {"status": "success", "response": generated_text}
@app.route('/', methods=['GET'])
def index():
chat_history = history
return render_template('ai_search.html', history=chat_history)
@app.route('/generate-text', methods=['POST'])
def generate_text_api():
prompt = request.json['prompt']
result = generate_text(prompt)
return jsonify(result)
@app.route('/clear', methods=['POST'])
def clear():
global history
history = []
return '', HTTPStatus.NO_CONTENT
@app.route('/search', methods=['GET', 'POST'])
def search():
if request.method == 'POST':
keyword = request.form['keyword']
elif request.method == 'GET':
keyword = request.args.get('keyword', '')
else:
keyword = ''
results = crawl_pages(keyword)
output = ""
for result in results:
output += f"<li><a id='myID' href='javascript:void(0);' onclick='handleLinkClick(\"{result['url']}\")'>{result['title']}</a></li><br>"
return output
@app.route('/page_content')
def page_content():
url = request.args.get('url', '')
if not url:
return '缺少 url 参数'
browser = mechanicalsoup.Browser()
page = browser.get(url)
page.encoding = 'utf-8' # 指定页面的编码为 utf-8
soup = BeautifulSoup(page.text, 'html.parser')
all_text = ''
all_images = []
# 获取页面中所有文本内容
for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'span']):
all_text += element.get_text() + ' '
# 获取页面中所有图片链接
for img in soup.find_all('img'):
img_src = img.get('src')
if img_src:
all_images.append("https:"+img_src)
return f"文本内容: {all_text}<br>图片链接: {', '.join(all_images)}"
if __name__ == '__main__':
app.run(debug=True)
3.3 运行
运行 ai_search.py,打开提示中链接。
3.4 可能需要安装的依赖
pip install Flask -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install mechanicalsoup -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install Jinja2
3.5 一定是通过Jinja2加载HTML,而不是直接打开HTML
直接打开HTML文件会显示异常:
4. 总结
本文我们从0开始写了一个AI+搜索的搜索引擎。整体原理还是比较简单的,搜索的原理就是固定URL+关键字,然后爬取网页内的标题和URL,就算是结果了。至于文本总结就更不用多说了,前面的文章详细介绍和实践过。
这个例子很简单,但应该算比较完整了,可以作为后续类似项目的快速开始,在此基础上快速搭建出自己的原型产品。
大家可以上手运行一下,然后运行过程中,你会对这个例子产生一些改进的想法。
AI大模型应用怎么学?
这年头AI技术跑得比高铁还快,“早学会AI的碾压同行,晚入门的还能喝口汤,完全不懂的等着被卷成渣”!技术代差带来的生存压力从未如此真实。
兄弟们如果想入门AI大模型应用,没必要到处扒拉零碎教程,我整了套干货大礼包:从入门到精通的思维导图、超详细的实战手册,还有模块化的视频教程!现在无偿分享。
1.学习思维导图
AI大模型应用所有方向的技术点做的整理,形成各个领域的知识点汇总,它的用处就在于,你可以按照下面的知识点去找对应的学习资源,保证自己学得较为全面。
2.从入门到精通全套视频教程
网上虽然也有很多的学习资源,但基本上都残缺不全的,这是我自己整理的大模型视频教程,上面路线图的每一个知识点,我都有配套的视频讲解。
3.技术文档和电子书
整理了行业内PDF书籍、行业报告、文档,涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师,还是对AI大模型感兴趣的爱好者,这套报告合集都将为您提供宝贵的信息和启示。
朋友们如果有需要全套资料包,可以点下面卡片获取,无偿分享!