[8기 랭체인] LLM + Langchain으로 자연어로 DB 조회하기

안녕하세요 황인준입니다.

저는 이번 8기 스터디 주제로 LLM + Langchain으로 자연어로 DB 조회하기라는 주제를 선택했습니다.

배경

저는 소규모 스타트업에 오래 다녔었는데, 조직 규모가 작은 탓에 다양한 직군의 여러 팀원분들께서

저희 서비스에 관한 데이터를 많이 요청하는 편입니다.

예를 들면 “11월에 가입된 유저 정보” , “특정 이벤트에 참여한 회원” , “10월 11월에 가장 많이 판매되었던 상품“ 등 다양한 정보를 요청을 받습니다.

사실 이 부분은 백오피스에서 보고서로 보여주면 좋겠지만 매번 개발을 할 수 없어서 좋은 방법이 없을까 항상 고민 했었습니다.

그래서 이번 8기 주제를 LangChain 유틸리티 중 SQLDatabase 를 사용해서 자연어로 데이터 조회해서 사내의 비개발자 직군분들께서 좀 더 쉽게 데이터에 접근 하실 수 있도록 간단한 서비스를 만들어보려고 합니다.

우선 LangChain 공식 가이드에 있는 Querying a SQL DB 자료를 참고하여 만들었습니다.

다른 cookbook 예제와 다르게 Chinook.db 를 사용합니다.

*Chinook SQL Server, Oracle, MySQL 등에서 사용할 수 있는 샘플 데이터베이스입니다. 단일 SQL 스크립트를 실행하여 생성할 수 있습니다. 치누크 데이터베이스는 노스윈드 데이터베이스의 대안으로, 단일 및 다중 데이터베이스 서버를 대상으로 하는 데모 및 ORM 도구 테스트에 이상적입니다.

Chinook DB는

Chinook https://database.guide/2-sample-databases-sqlite/ 이곳에서 다운로드 받을 수 있으며 터미널에서 아래와 같은 명령어를 통해 sqlite3 로 Chinook의 샘플 데이터를 조회할 수 있습니다. 사용할 수 있습니다. (* )

sqlite3 Chinook.db
.read Chinook_Sqlite.sql
SELECT * FROM Artist LIMIT 10;

(이러한 Table을 갖고 있습니다.)

코드 설명

본격적으로 Cookbook 예시와 함께 코드 설명 간단히 하겠습니다.

from langchain.prompts import ChatPromptTemplate

template = """Based on the table schema below, write a SQL query that would answer the user's question:
{schema}

Question: {question}
SQL Query:"""

prompt = ChatPromptTemplate.from_template(template)

from langchain.utilities import SQLDatabase
db = SQLDatabase.from_uri("sqlite:////Users/hwang-injun/development/side-project/Chinook.db")

템플릿 및 데이터베이스 설정:
- 사용자 질문을 바탕으로 SQL 쿼리를 생성하기 위한 템플릿을 정의합니다.
- SQLDatabase: SQLite 데이터베이스에 연결하기 위한 설정을 합니다.

def get_schema(_):
    return db.get_table_info()

def run_query(query):
    return db.run(query)

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

model = ChatOpenAI(openai_api_key='sk-0pfxOgraYtsManEakYDgT3Bl*******')

sql_response = (
    RunnablePassthrough.assign(schema=get_schema)
    | prompt
    | model.bind(stop=["\nSQLResult:"])
    | StrOutputParser()
)

sql_result = sql_response.invoke({"question": user_question})
print(sql_result)

스키마 정보 가져오기:
- get_schema: 데이터베이스의 테이블 스키마 정보를 가져옵니다. 이 정보는 SQL 쿼리를 생성하는 데 사용됩니다.
SQL 쿼리 실행:
- run_query: 생성된 SQL 쿼리를 실행하고 결과를 반환합니다.
첫번째 체인 만들기
- 만들어준 템플릿으로 사용자의 질문과 , get_schema 로 Sql 쿼리문을 만들어줍니다.

template = """Write a natural language response based on the table schema, question, SQL query, and SQL response below. And present it in a table with markdown syntax And present it in a table with markdown syntax:
{schema}

Question: {question}
SQL Query: {query}
SQL Response: {response}"""

prompt_response = ChatPromptTemplate.from_template(template)

full_chain = (
    RunnablePassthrough.assign(query=sql_response) 
    | RunnablePassthrough.assign(
        schema=get_schema,
        response=lambda x: db.run(x["query"])
    )
    | prompt_response
    | model
)
result = full_chain.invoke({"question": user_question})

print(result.content)

SQL 쿼리 생성 및 실행:
- sql_response: 사용자 질문을 바탕으로 SQL 쿼리를 생성하고 실행합니다. 이 과정에서 OpenAI 모델과 템플릿이 사용됩니다.
최종 결과 생성:
- full_chain: SQL 쿼리와 그 결과를 바탕으로 사용자가 이해하기 쉬운 자연어 응답을 생성합니다.

실행 결과

아래는 위의 코드의 실행 결과 입니다.

1.How many pieces of data among composers in Trak are of unknown origin?
-> Query : SELECT COUNT(EmployeeId) AS TotalEmployees FROM Employee
-> Response : The LastName and FirstName of the person with EmployeeId number 1 are Adams and Andrew, respectively

2.What are the LastName and FirstName of the person with EmployeeId number 1
-> Query : SELECT COUNT(EmployeeId) AS TotalEmployees FROM Employee
-> Response : The LastName and FirstName of the person with EmployeeId number 1 are Adams and Andrew, respectively

3. Give me the Name of songs whose composer is Roy Z.
-> Query : 
SELECT Name
FROM Track
WHERE Composer = 'Roy Z'

-> Response : The songs whose composer is Roy Z are: "King In Crimson", "Chemical Wedding", "The Tower", "Gates Of Urizen", "Jerusalem", "Trupets Of Jericho", "The Alchemist", and "Realword".

4.Give me the title of songs whose composer is Roy Z.
-> Query : 
SELECT Title FROM Album 
JOIN Track ON Album.AlbumId = Track.AlbumId 
WHERE Track.Composer = 'Roy Z'

-> Response : The title of the songs whose composer is Roy Z is "Chemical Wedding".

5.Tell me the name of the customer with the largest [Total] value in the Invoice table
-> Query : 
SELECT c.FirstName, c.LastName
FROM Customer c
JOIN Invoice i ON c.CustomerId = i.CustomerId
ORDER BY i.Total DESC
LIMIT 1;

-> Response : The customer with the largest total value in the Invoice table is Helena Holý.

6.Group and count by [BillingCountry] in the Invoice table
-> Query : 
SELECT BillingCountry, COUNT(*) AS Count
FROM Invoice
GROUP BY BillingCountry

-> Response : The query was used to group and count the number of invoices by billing country in the Invoice table. The response shows the results, indicating the billing country and the corresponding count of invoices for each country. For example, there are 7 invoices from Argentina, 7 from Australia, 7 from Austria, and so on.

7. Group and count by [BillingCountry] in the Invoice table and show the three BillingCountries and Count values in order of highest to lowest.
-> Query : 
SELECT BillingCountry, COUNT(*) as Count
FROM Invoice
GROUP BY BillingCountry
ORDER BY Count DESC

-> Response : The three Billing Countries with the highest counts in the Invoice table are as follows:\n1. USA - 91 invoices\n2. Canada - 56 invoices\n3. France - 35 invoices

1~4 번 질문은 단순 테이블 조회였습니다.

5번 질문은 두개 이상의 테이블(Customer , Invoice)을 조인해야 결과를 얻을 수 있는 데이터였고

6번째 질문은 Group by 구문을 이용해야지 얻을 수 있는 데이터였습니다.

7번째의 경우 위에서 prompt를 약간 수정하여 마크다운 형식으로 테이블 형태로 보여달라고 했던 결과 였습니다.

가볍게 테스트 해봤는데 정말 개인적으로는 정확성에 대해서 놀라웠습니다.

정말 가볍게 사용할 수 있는 뿐만 아니라 조금은 복잡한 쿼리를 써야하는 일도 쉽게 해줄 수 있음을 확인했습니다.

한계점으로는 일부 데이터를 검색할 시에는 테이블명과 컬럼명을 조금 정확하게 또는 구분자를 추가해주어야지 원하는 결과를 얻을 수 있는 점이였습니다.

저는 이를 가지고 단순 질의를 하고 응답하는 형식에서 벗어나 사용자가 원하는 결과를 한번에 찾아내지 못할 시에는 조금 더 대화를 이어나서 정확한 응답 결과를 받을 수 있도록 개발해보겠습니다

감사합니다!

⏰ 가장 빠르게 AI를 배우는 곳 | 지피터스 AI스터디 17기 🚀

[8기 랭체인] LLM + Langchain으로 자연어로 DB 조회하기

배경

코드 설명

실행 결과

👉 이 게시글도 읽어보세요