[개발 A] GPT-3.5 16k api를 이용해 pdf 번역기 만들기

경고 1 이 내용은 태반이 파이썬 코드입니다. 파이썬을 모르신다면 이해하기 어렵습니다. 페이지의 마지막으로 가십시오.
경고 2 이 내용은 파이썬 개발자가 보기에 개판인 코드입니다.

DeepL이나 구글 번역기, 파파고 등을 이용해도 pdf를 번역할수는 있습니다. 다만 전문적인 문서의 경우에 특정 단어의 뜻이 달라지는 경우가 있습니다. (예: select는 선택이지만 C언어 코드에서는 함수명입니다.) 이러한 단어들도 번역해버려서 읽는데 문제는 없지만 집중력에 상당한 손실을 가져올때가 많습니다. 이러한 문제를 해결해 보고자. 그때 그때 나에게 맞는 번역기를 만들어보면 어떨까? 라는 생각에서 시작해본 프로젝트입니다.

결과부터 말씀드리면 이미지 처리를 제외하고는 상당히 만족합니다.

pdf에서 택스트 추출

import PyPDF2
import os

# PDF에서 텍스트 추출
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
    print("\n[INFO] Extracted text from PDF:")
    print(text)  # 처음 500자만 출력
    return text

gpt api를 이용해 택스트를 무작정 긁어다 번역을 요청할 경우 토큰량 오버로 시작도 못할 가능성이 높습니다.
그리고 반대로 단순 문자 숫자로 나눌경우에도 같은 문제가 발생할수 있습니다.
그래서 가장 먼저 토큰값을 미리 계산해보는 전처리 과정이 필요했습니다.

import tiktoken

enc = tiktoken.encoding_for_model("gpt-3.5-turbo-16k-0613")

def split_text_to_fit_token_limit(text, token_limit=7000):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    current_text = ""
    texts = []

    for sentence in sentences:
        potential_text = f"{current_text} {sentence}".strip()
        if len(enc.encode(potential_text)) > token_limit:
            texts.append(current_text.strip())
            current_text = sentence
        else:
            current_text = potential_text

    if current_text:
        texts.append(current_text.strip())
    # print("\n[INFO] Splitted texts for translation (showing first section):")
    # print(texts[0][:100])  # 첫 번째 섹션의 처음 500자만 출력
    return texts

이 과정을 통해서 입력 토큰량을 조절할수있고 같은 내용을 영어 + 한글로 작성한다고 가정했을때 토큰량이 2배까지 늘어나진 않을거라는 가정을 했고 그에 맞춰서 토큰값대로 나눠서 진행했습니다.

계산된 토큰값 만큼 번역을 하는건 좋은데 문제가 발생합니다.

이 문서가 어떤 내용인지 알려주지 않았더니, 간혹 번역방향이 이상하게 튀는 경우가 나왔습니다.
이를 해결하기 위해 pdf의 첫 12000토큰값 까지를 한번 미리 읽어서 이를 바탕으로 pdf의 내용이 무엇을 말하고싶은지 유추하게 했습니다

보통 pdf의 첫장부터 3번째 장 까지는 인트로, 목차 등의 내용이 주로 있기 때문에 이를 바탕으로 앞으로 어떤 이야기를 할지 파악하기 용의했습니다.

def get_summary(text):
    openai.api_key = "Your-key"
    prompt = "This text is the first page of a technical document. Based on this text, briefly describe what this document is about. [example : This document is part of RFC 9110. RFC 9110 provides the latest specification for HTTP. This document describes the overall architecture of HTTP, a stateless application-level protocol for distributed, collaborative hypertext information systems, establishes common terminology, and defines aspects of the protocol that are shared across all versions.]"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k-0613",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": text}
        ]
    )
    return response['choices'][0]['message']['content']

요약된 내용을 번역을 맡기는 프롬프트에 추가했습니다.

if __name__ == "__main__":
    ...
    initial_text = split_text_to_fit_token_limit(pages[0], 12000)[0]  # 첫 페이지의 첫 12000 토큰만큼만 추출
    summary = get_summary(initial_text)
    # 프롬프트 업데이트
    system_prompt = f"""**Translate from English to Korean.**
    For jargon, use the original language.
    Based on the initial summary: {summary}
"""
    print(system_prompt)
    threads = []
    for idx, page in enumerate(pages):
        api_key = API_KEYS[idx % len(API_KEYS)]
        thread = threading.Thread(target=threaded_translation, args=(page, idx, system_prompt, api_key, folder_name))
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

    print("\n[INFO] Translation process completed!")

시스템 프롬프트에 추가한 내용입니다.

system_prompt = f"""**Translate from English to Korean.**
    For jargon, use the original language.
    Based on the initial summary: {summary}
"""

이제 GPT에게 번역을 맞겨봅시다.

def translate_text_with_limit(text, system_prompt, api_key, source_language="English", target_language="Korean"):
    openai.api_key = api_key
    global token_count_in_current_minute, request_count_in_current_minute, last_request_time

    estimated_tokens = len(enc.encode(text))

    while estimated_tokens + token_count_in_current_minute > TOKEN_LIMIT_PER_MINUTE:
        time.sleep(0.5)
        elapsed_time = time.time() - last_request_time
        if elapsed_time >= SECONDS_IN_A_MINUTE:
            token_count_in_current_minute = 0

    while request_count_in_current_minute >= REQUEST_LIMIT_PER_MINUTE:
        time.sleep(0.5)
        elapsed_time = time.time() - last_request_time
        if elapsed_time >= SECONDS_IN_A_MINUTE:
            request_count_in_current_minute = 0

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k-0613",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ]
    )

    token_count_in_current_minute += estimated_tokens
    request_count_in_current_minute += 1
    last_request_time = time.time()
    print("\n[INFO] Translated text (showing beginning of translation):")
    return response['choices'][0]['message']['content'] + "\n\n\ ------------------- \n\n" + text

페이지 별로 나눠서 처리를 했고, + 앞에서 만든 토큰계산기로 한 페이지의 정보량이 너무 많을경우 나눠서 처리하게 했습니다.

# 번역 함수
TOKEN_LIMIT_PER_MINUTE = 60000
REQUEST_LIMIT_PER_MINUTE = 60
SECONDS_IN_A_MINUTE = 60
token_count_in_current_minute = 0
request_count_in_current_minute = 0
last_request_time = 0

GPT-3.5 api는 분당 요청횟수와 분당 최대 요청 토큰량이 정해져있습니다. 이문제를 해결하기 위해서 리밋을 걸어주는 내용도 추가했습니다.

번역은 잘 하는데….

번역은 잘 하는데 너무 느립니다. GPT-3.5 16k 모델을 선택한 이유는 빠르고 많은량을 한번에 지시할수있어서 였는데 그래도 16000토큰을 꽉꽉 눌러쓰려니 상당히 느립니다.

이 문제를 해결하기 위해 여러 키를 배열로 가지고 다중 쓰레드로 동시에 처리하게 합니다.
다만 문제를 해결하기 위해 새로운 문제가 발생합니다.

다소 CS적인 이야기이지만 하나의 파일을 여러 쓰레드가 읽고 쓰는경우 문제가 발생할수있기에 이 문제를 가장 쉽게 해결하기 위해 하나의 쓰레드당 한페이지씩 맞겨서 페이지별로 새로운 파일을 생성하게 했습니다.
- 쉽게 설명드리면 같은 a4용지에 여러 사람이 글을 쓰려고 할때 순서를 정해주지 않으면 문제가 발생하는것과 같습니다. 그래서 각각 한장씩 용지를 주고 글을 쓰게 한겁니다.

def create_folder_from_pdf_name(pdf_path):
    folder_name = os.path.splitext(pdf_path)[0]
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)
    return folder_name

# 번역된 텍스트 저장
def save_translated_text_to_folder(translated_text, folder_name, page_num):
    file_name = f"page_{page_num + 1}.txt"  # 페이지 번호는 0부터 시작하므로 +1을 해주었습니다.
    file_path = os.path.join(folder_name, file_name)
    
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(translated_text)

def threaded_translation(page, idx, system_prompt, api_key, folder_name):
    translated_text = translate_text_with_limit(page, system_prompt, api_key)
    save_translated_text_to_folder(translated_text, folder_name, idx)

전체코드

import PyPDF2
import openai
import time
import re
import tiktoken
import os
import threading

API_KEYS = ["key1", "key2", ...]  # 여러 개의 API 키를 리스트로 저장

def get_summary(text):
    openai.api_key = "key"
    prompt = "This text is the first page of a technical document. Based on this text, briefly describe what this document is about. [example : This document is part of RFC 9110. RFC 9110 provides the latest specification for HTTP. This document describes the overall architecture of HTTP, a stateless application-level protocol for distributed, collaborative hypertext information systems, establishes common terminology, and defines aspects of the protocol that are shared across all versions.]"
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k-0613",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": text}
        ]
    )
    return response['choices'][0]['message']['content']

def create_folder_from_pdf_name(pdf_path):
    folder_name = os.path.splitext(pdf_path)[0]
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)
    return folder_name

# PDF에서 텍스트 추출
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
    print("\n[INFO] Extracted text from PDF:")
    print(text)  # 처음 500자만 출력
    return text


# 텍스트를 토큰 한도에 맞게 문장의 경계에서 분할
enc = tiktoken.encoding_for_model("gpt-3.5-turbo-16k-0613")

def split_text_to_fit_token_limit(text, token_limit=7000):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    current_text = ""
    texts = []

    for sentence in sentences:
        potential_text = f"{current_text} {sentence}".strip()
        if len(enc.encode(potential_text)) > token_limit:
            texts.append(current_text.strip())
            current_text = sentence
        else:
            current_text = potential_text

    if current_text:
        texts.append(current_text.strip())
    print("\n[INFO] Splitted texts for translation (showing first section):")
    # print(texts[0][:100])  # 첫 번째 섹션의 처음 500자만 출력
    return texts



# 번역 함수
TOKEN_LIMIT_PER_MINUTE = 60000
REQUEST_LIMIT_PER_MINUTE = 60
SECONDS_IN_A_MINUTE = 60
token_count_in_current_minute = 0
request_count_in_current_minute = 0
last_request_time = 0


def translate_text_with_limit(text, system_prompt, api_key, source_language="English", target_language="Korean"):
    openai.api_key = api_key
    global token_count_in_current_minute, request_count_in_current_minute, last_request_time

    estimated_tokens = len(enc.encode(text))

    while estimated_tokens + token_count_in_current_minute > TOKEN_LIMIT_PER_MINUTE:
        time.sleep(0.5)
        elapsed_time = time.time() - last_request_time
        if elapsed_time >= SECONDS_IN_A_MINUTE:
            token_count_in_current_minute = 0

    while request_count_in_current_minute >= REQUEST_LIMIT_PER_MINUTE:
        time.sleep(0.5)
        elapsed_time = time.time() - last_request_time
        if elapsed_time >= SECONDS_IN_A_MINUTE:
            request_count_in_current_minute = 0

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k-0613",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": text}
        ]
    )

    token_count_in_current_minute += estimated_tokens
    request_count_in_current_minute += 1
    last_request_time = time.time()
    print("\n[INFO] Translated text (showing beginning of translation):")
    return response['choices'][0]['message']['content'] + "\n\n\ ------------------- \n\n" + text 

def threaded_translation(page, idx, system_prompt, api_key, folder_name):
    translated_text = translate_text_with_limit(page, system_prompt, api_key)
    save_translated_text_to_folder(translated_text, folder_name, idx)


# 번역된 텍스트 저장
def save_translated_text_to_folder(translated_text, folder_name, page_num):
    file_name = f"page_{page_num + 1}.txt"  # 페이지 번호는 0부터 시작하므로 +1을 해주었습니다.
    file_path = os.path.join(folder_name, file_name)
    
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(translated_text)


def extract_pages_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        pages = [reader.pages[page_num].extract_text() for page_num in range(len(reader.pages))]
    return pages


# 메인 코드
if __name__ == "__main__":
    pdf_path = input("Enter the path to the PDF file: ")
    folder_name = create_folder_from_pdf_name(pdf_path)

    pages = extract_pages_from_pdf(pdf_path)
    
    # 첫 4000 토큰만큼의 텍스트 추출 후 요약
    initial_text = split_text_to_fit_token_limit(pages[0], 12000)[0]  # 첫 페이지의 첫 12000 토큰만큼만 추출
    summary = get_summary(initial_text)
    # 프롬프트 업데이트
    system_prompt = f"""**Translate from English to Korean.**
    For jargon, use the original language.
    Based on the initial summary: {summary}
"""
    print(system_prompt)
# Based on the initial summary: {summary}
    threads = []
    for idx, page in enumerate(pages):
        api_key = API_KEYS[idx % len(API_KEYS)]
        thread = threading.Thread(target=threaded_translation, args=(page, idx, system_prompt, api_key, folder_name))
        thread.start()
        threads.append(thread)

    for thread in threads:
        thread.join()

    print("\n[INFO] Translation process completed!")

이번 6기를 진행하며 가장 성과가 좋았던 프로젝트였습니다. 활용도도 높았고, 만드는데 시간도 상당히 짧게 걸렸습니다.

3.5-16k버전의 경우 ChatGPT에서는 사용해볼수 없는 버전이기도 해서 많은 내용을 한번에 넣을때 상당히 유용하단 사실을 알수있었던 프로젝트였습니다.

16k버전은 입력 토큰이 기존으 4배로 매우 많습니다. 따라서 많은 양을 한번에 볼수록 더 맥락을 이해하기 좋은 번역과같은 작업에 적합합니다.

번외로 이 프로젝트로 유료 번역기를 만들어 볼까하는 생각도 덤으로 하게 되었습니다.

거기에 더해서 함수나 단어별 정의와 개념을 정리해서 마크다운으로 정리시키는 스크립트도 작성했었는데
단순 ChatGPT로 할때보다 압도적으로 편해서 이런 작업방식을 이후로도 자주 사용하게 될것같습니다.

시행착오과정이 좀 있었는데 모두 생략하고 결과물만 바로 보여드렸습니다.
늘 그러하듯 코딩의 대부분은 ChatGPT가 했습니다. 전 파이썬을 잘 못쓰거든요…

시행착오 과정을 포함한 프롬프트는 링크로 남겨두겠습니다.

https://chat.openai.com/share/695dada4-3602-478a-8c52-031fad7baa11

https://chat.openai.com/share/d7f6d2d1-b8c6-498c-b60b-a8148de1272e

https://chat.openai.com/share/324d3434-d564-4ddc-9b5f-54bd0db03f71

⏰ 가장 빠르게 AI를 배우는 곳 | 지피터스 AI스터디 17기 🚀

[개발 A] GPT-3.5 16k api를 이용해 pdf 번역기 만들기

👉 이 게시글도 읽어보세요