[문과생도 AI] 연구비서 만들기 2탄: 텍스트 교감봇

#11기문과생도AI

안녕하세요. 바호입니다. 11기엔 청강생으로 [문과생도 AI]에 함께 하게 되었습니다.

지난 캠프에 이어, 이번에도 저는 개인 연구비서를 만들어 내려는 시도를 계속하고 있습니다.

지난 시도 참고: #개인 연구비서 만들기! - 검색해서 데이터 긁어와...

바호의 연구 비서 만들기 프로젝트 제2탄!!

텍스트를 교감하는 봇을 만들기!!

저는 한문 텍스트 자료를 주로 취급하는데요. 자주 하는 작업 중 하나가 특정 책의 여러 판본을 비교하는 것입니다. 전문 용어로 ‘교감’한다고 하는데요.
특정 책의 어떤 구절에 대해, 그 책의 여러 판본들에 해당 구절이 어떻게 실려 있는지 검토하는 것입니다.
A판본이랑 B판본을 비교해보니, 같은 문장인데 요 글자는 다르게 썼더라.. 이 글자랑 저 글자 순서가 바뀌었더라.. 등등 체크하는 작업을 하다보면, 눈이 빠집니다…
인간이 하는 일인지라 열심히 본다고 보는데도 결국 찾아내지 못하는 차이들도 많구요..

[원고 내 원문]과 [ctext], [사고전서], [사부비요]라는 서로 다른 판본 상의 원문을 비교하는 작업입니다. 다른 글자 빨간 표시한 거 보이시죠… 이런 작업 무한 반복~~

교감봇이 해주었으면 하는 것

제가 엑셀 파일을 올리고, 두 개의 열을 지정해주면(예컨대, [원고 내 원문]과 [사고전서])
해당 열의 모든 행에 대해서 각각의 셀에 담긴 문자열을 비교해서
차이가 없으면 그냥 두고, 하나라도 차이가 있으면 해당 행을 노란색으로 칠해주는 것!!
그러면 제가 노란색으로 색칠된 행에 대해서만 무엇이 다른지 꼼꼼히 살펴보면 되니까요.

예를 들어, 빨간칸끼리 비교, 파란칸끼리 비교해서, 글자가 하나라도 다르면 셀을 노랗게 칠해주는 거죠.
이때, 띄어쓰기나 구두점 등은 무시하기를 원했습니다. 저한테는 글자들의 모양과 순서의 차이만 중요하거든요.

Grimore를 이용해서 파이썬 코드를 짜기

#10기문과생도AI 에서 파이썬 수업을 들었어서, 대강 ‘문자열’ 개념은 알고 있었습니다. 그래서 각 행의 문장들을 문자열로 만들어서 비교해달라고 했어요.
인풋과 아웃풋 정보를 정확하게 넣으려고 했고
예시를 들어주기도 했습니다.

이렇게 요구하고 한 번 코드를 받은 뒤에, 두 가지 사항을 추가 요청했습니다.
- 1) 열의 정보를 숫자로 안 넣고 알파벳으로 넣을 수도 있다. C열과 E열 이런식으로.
- 2) 비교해야 하는 문자들이 꼭 알파벳이 아닐 수도 있다. 주로 고전 중국어나 한국어를 대상으로 작업할 거다.
그랬더니 아래와 같은 코드를 짜주었습니다.

import pandas as pd
import openpyxl
from openpyxl.styles import PatternFill, Font
import re

def clean_string(s):
    return ''.join(re.findall(r'\w', s))

def column_letter_to_index(col):
    col = col.upper()
    index = 0
    for char in col:
        index = index * 26 + (ord(char) - ord('A')) + 1
    return index - 1  # Convert to 0-based index

def highlight_differences(file_path, x_col, y_col, output_path):
    # Load the Excel file
    df = pd.read_excel(file_path)
    
    # Convert column names to indices if necessary
    if isinstance(x_col, str):
        x_col = column_letter_to_index(x_col)
    if isinstance(y_col, str):
        y_col = column_letter_to_index(y_col)
    
    # Load the workbook and the active worksheet
    wb = openpyxl.load_workbook(file_path)
    ws = wb.active
    
    # Iterate over the rows
    for index, row in df.iterrows():
        x_str = clean_string(str(row.iloc[x_col]))
        y_str = clean_string(str(row.iloc[y_col]))
        
        # Compare characters in the strings
        if x_str != y_str:
            # Highlight the row in y_col
            cell = ws.cell(row=index+2, column=y_col+1)
            cell.fill = PatternFill(start_color="FFFF00", end_color="FFFF00", fill_type="solid")
            
            # Highlight differences in red
            for i, (x_char, y_char) in enumerate(zip(x_str, y_str)):
                if x_char != y_char:
                    if i < len(cell.value):
                        new_value = list(cell.value)
                        new_value[i] = f"{y_char}"
                        cell.value = ''.join(new_value)
                    else:
                        cell.value += f"{y_char}"
                    cell.font = Font(color="FF0000")
                    
    # Save the modified workbook
    wb.save(output_path)

# Example usage
input_file_path = 'input.xlsx'  # Replace with your input file path
output_file_path = 'output.xlsx'  # Replace with your output file path
x_column = 'A'  # Replace with the index or letter of the x column
y_column = 'B'  # Replace with the index or letter of the y column

highlight_differences(input_file_path, x_column, y_column, output_file_path)

이걸 실행했더니 아래와 같은 파일을 만들어주었어요.

정확도야 인간보다 정확한… 그건 만족했구요. (가끔 같은 글자인데도 다르다고 표시하는 경우는 있었는데, 아마 인간 눈엔 유사해보이지만 미세하게 글자체가 달라서 그런게 아닌가 싶었습니다. 다른 글자인데 같다고 표시하는 경우는 아직 한번도 없었습니다. 제가 랜덤 체크 해본 결과..)
x열과 y열 비교해서, 서로 내용 다르면 y열에만 노란 배경 칠해달라고 했는데, 그것도 잘 반영되었구요.
다만 x열에서와 다른 글자만!!! 빨갛게 표시해달라고 했는데, 보시다시피 해당 셀의 모든 글자를 붉게 칠해주더라구요. 이것이 문제상황(1) 이었습니다. 이걸 어떻게 해결했는지는 뒤에서 자세히 써볼게요.
그리고 x열이나 y열 중 어느 한 쪽이 아예 비어 있는 행들이 있는데(해당 판본에 원문이 없는 경우), 그 경우에도 비교를 진행하고는 문자열의 차이가 있다고 노랗게 칠해주더라구요. 어느 한 쪽이라도 셀이 비어 있으면 해당 행에 대해서는 비교를 진행하지 않도록 만들고 싶었습니다. 이것이 문제상황(2). 이것도 어떻게 해결했는지 뒤에 쓸게요.
어쨌든 다른 글자 붉게 칠하기 기능과 빈 칸 비교 스킵하기 기능은 일단 포기하고, 여기까지 구현된 기능만을 바탕으로 스트림릿을 만들어보았습니다.

스트림릿으로 ‘교감봇’ 만들기

매번 파일 경로 지정해주고, 교감된 파일 저장할 경로 지정해주는 게 귀찮아서 지난 캠프에서 박정기 파트너님께 배운 ‘스트림릿’을 사용해보기로 했습니다.
제가 구현하고 싶었던 것은, 스트림릿에서 파일을 업로드하고, 제가 비교하고 싶은 게 어떤 열인지 입력하면, 위의 코드대로 처리된 파일을 다운받는 프로세스였습니다.

처음 짜준 코드대로 스트림릿을 실행했더니, 엑셀 파일 내에 여러 개의 시트가 있었는데, 첫번째 시트에 대해서만 작업을 해서 돌려주더라구요. (~~나머지 시트들은 사라짐..~~)
그래서 여러 시트에 대해서도 모두 동일한 작업을 해줄 수 있게 코드를 수정해달라고 요청했습니다.
3-4번 정도 에러가 났는데, 에러가 날때마다 에러 코드를 Grimore에게 복붙해서 주고 문제를 해결해달라고 했어요. 여러 개의 시트를 처리하는 코드가 뭔가 불안정했던 모양인데, 그리모어가 결국 해냈습니다. (~~어떻게 해냈는지는 설명해줘도 모르겠어서 생략..~~)
그래서 얻은 코드입니다. 코드대로 스트림릿을 잘 실행되었구요.

import streamlit as st
import pandas as pd
import openpyxl
from openpyxl.styles import PatternFill
import re
from io import BytesIO
from openpyxl.utils.dataframe import dataframe_to_rows
from copy import copy

def clean_string(s):
    return ''.join(re.findall(r'\w', s))

def column_letter_to_index(col):
    col = col.upper()
    index = 0
    for char in col:
        index = index * 26 + (ord(char) - ord('A')) + 1
    return index - 1  # Convert to 0-based index

def highlight_differences(df, x_col, y_col):
    # Convert column names to indices if necessary
    if isinstance(x_col, str):
        x_col = column_letter_to_index(x_col)
    if isinstance(y_col, str):
        y_col = column_letter_to_index(y_col)
    
    # Create a new worksheet
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.title = "Sheet1"
    
    # Copy dataframe to worksheet
    for r in dataframe_to_rows(df, index=False, header=True):
        ws.append(r)
    
    # Iterate over the rows
    for index, row in df.iterrows():
        x_str = clean_string(str(row.iloc[x_col]))
        y_str = clean_string(str(row.iloc[y_col]))
        
        # Compare characters in the strings
        if x_str != y_str:
            # Highlight the cell in y_col with yellow background
            cell = ws.cell(row=index+2, column=y_col+1)  # Excel rows are 1-based, pandas rows are 0-based
            cell.fill = PatternFill(start_color="FFFF00", end_color="FFFF00", fill_type="solid")
                    
    return ws

def process_all_sheets(file, x_column, y_column):
    wb = openpyxl.load_workbook(file)
    new_wb = openpyxl.Workbook()
    new_wb.remove(new_wb.active)  # Remove the default sheet

    for sheet_name in wb.sheetnames:
        df = pd.read_excel(file, sheet_name=sheet_name)
        ws = highlight_differences(df, x_column, y_column)
        new_ws = new_wb.create_sheet(title=sheet_name)

        for row in ws.iter_rows():
            for cell in row:
                new_cell = new_ws[cell.coordinate]
                new_cell.value = cell.value
                if cell.has_style:
                    new_cell.font = copy(cell.font)
                    new_cell.border = copy(cell.border)
                    new_cell.fill = copy(cell.fill)
                    new_cell.number_format = copy(cell.number_format)
                    new_cell.protection = copy(cell.protection)
                    new_cell.alignment = copy(cell.alignment)

    output = BytesIO()
    new_wb.save(output)
    output.seek(0)
    return output

# Streamlit UI
st.title("Excel Column Comparison Tool")

uploaded_file = st.file_uploader("Choose an Excel file", type="xlsx")
x_column = st.text_input("Enter the X column (number or letter)")
y_column = st.text_input("Enter the Y column (number or letter)")

if st.button("Process File"):
    if uploaded_file and x_column and y_column:
        # Process all sheets in the uploaded file
        output = process_all_sheets(uploaded_file, x_column, y_column)
        
        # Provide the download link
        st.download_button(
            label="Download Processed File",
            data=output,
            file_name="processed_output.xlsx",
            mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
        )

일단 여기까지는 성공! 이렇게만 돌려도 작업의 많은 노고가 줄어들기 때문에 만족했습니다.
지난 캠프에서 배운 것 중 하나는, 뭔가를 만들 때는 단계별로 쪼개야 한다는 것!!
- 그래서 한꺼번에 모든 기능을 구현하려고 하기보다, 몇몇 기능은 포기하더라도 제가 구현하고자 하는 핵심 기능을 구현해내는 데에 집중해보았습니다.
- 그 다음에 제가 추가로 해결하고 싶은 것들을 따로 코드를 짜서 추가하고 그리모어에게 다시 검토받는 방식으로 추가 개발(?)을 진행했습니다.
- 이렇게 프로세스를 쪼개서 하니까, 뭔가 에러가 나고 스텝이 꼬여도 어디서부터 손대야 할지 모르겠어서 포기하는 상황은 생기지 않더라구요. 중간중간 아웃풋이 나오니까 성취감도 있어서 에러가 반복되는 순간에도 포기하지 않고 계속 해볼 힘이 났습니다.
아래에는 제가 추가적으로 두 개의 문제상황(다른 글자 붉게 칠하기 기능과 빈 칸 비교 스킵하기 기능)을 어떻게 해결했는지 부연해볼게요.

문제상황 (1) : 다른 글자만 표시해달라고 했는데, 셀 전체의 글자를 붉게 칠해 버림

제가 다시 부탁을 했어요. 매우 구체적으로 예시를 들어가면서.
이토록 친절하게 다시 부탁을 했는데, 결과는 동일…. 코드 설명도 해달라고 하고, 다른 글자 붉게 칠하는 것을 주문하는 코드를 긁어서 이거 왜 안되냐고 독촉(?)도 했더니 결국 답은 이 코드가 사용하는 라이브러리에서 특정 글자만 색칠해주는 기능 따위 제공하지 않는다는 것이었습니다. ~~(그런 건 좀 미리 말해주지..)~~

그리모어가 새로 제시해준 방법으로도 해봤는데, 여전히 뭔가 잘 안되고 너무 복잡해져서 좌절하다가, 발상의 전환(?!)을 했습니다.
- 색깔 따위 버리자. 내가 필요한 건 다른 글자 찾기니까, 다른 글자만 브라켓 씌워달라고 하자. 그건 해주겠지 설마?
- 그랬더니 잘 해주었습니다!!! (사진은 맨 뒤에 제공)
저는 스트림릿에서 이 기능은 추가기능으로 구현하고 싶었습니다. 제가 별도로 체크하지 않는 한, 그냥 노란색 배경칠만 한 파일을 주고, 별도로 요청하면, 다른 글자들을 브라켓에 표시해주게끔요. 그래서 기본 기능 스트림릿 코드를 먼저 받은 다음에, 아래와 같이 추가 요청을 했어요.

문제상황 (2): 빈 셀이 있는 행은 비교하지 말라고 했는데, 자꾸 비교해서 노란칠/브라켓처리 해줌

3열과 6열을 비교해달라고 한건데, 보시다시피 6열 쪽에는 셀이 비었습니다. 그러면 그냥 넘어가야 하는데 자꾸 저렇게 이상한 글자를 넣고 칠해주더라구요.
이 부분은 굉장히 오래 씨름을 했습니다. 제가 어떻게 분투했는지 보여드리기 위해, 제가 그리모어에게 했던 질문들을 복붙해서 나열해볼게요…

- x열과 y열의 특정 행을 비교할 때, 어느 한 쪽 셀이 완전히 비어 있으면 비교를 진행하지 말고, 비교를 안했으니 노란색 배경 처리나 [] 둘러싸는 처리도 안해도 돼. 그렇게 코드를 수정해줘.

- 여전히 특정 행을 처리할 때, x열과 y열 중 한 쪽의 행에 문자열이 없는 경우에도 비교가 진행되는데?

- 함수에서 x열과 y열의 특정 행이 비어 있는 경우 비교를 진행하지 않고, 노란색 배경 처리나 [ ] 둘러싸는 처리를 하지 않도록 수정하였습니다 -> 수정이 안되어 있어

- 계속 에러가 나. 이렇게 수정해보자. 일단 x열과 y열의 스트링을 다 만들어. clean_string을 사용해서. 그 다음에 각 스트링을 비교할 때, 어느 한 쪽의 스트링에 문자가 하나도 들어 있지 않은 공백 상태이면, 어떤 처리도 진행하지 말게 코드를 다시 짜 줘.

- 어느 한 셀이 비어 있는 경우는 clean_string 처리를 어떻게 하고 있는거야? 그 값은 어떻게 저장이 돼?

- 뭐가 안되는걸까. 이 코드에 주석을 상세히 달아서 줘봐

- 공백만 있는 경우에 'a' 나 'n'이라는 문자열로 혹시 처리되고 있니?

- clean_string 처리를 각 행에 한 상태에서 그 값을 프린트해서 반환해봐

- 비어있는 경우에 'nan'이라고 문자열이 정리되는데? : Row 125: x_str='行法至堅不以私欲亂所聞如是則可謂勁士矣', y_str='nan'

저의 분투가 보이시나요….😂 고생은 했지만, 나름의 ‘디버깅’ 프로세스를 경험한 것 같아서 공유드려봅니다. 뭐가 안되는지 보기 위해 코드에 주석을 달아달라고 했고, 에러가 나는 상황을 보기 위해 중간에 결과값을 프린트해보라고도 했습니다. 그리고 그 프린트된 값을 그리모어에게 주면서, 문제를 진단하도록 했습니다.
그랬더니 문제를 드.디.어. 진단해주더군요. (~~아니, 이런 거였으면 처음부터 말을 해줄 것이지.. 눈치라곤 1도 없는 똑똑이 같으니라고..~~)

고생 끝에 낙이 온다!

어쨌든, 이렇게 문제상황 두 개까지 해결해서 다음과 같이 최종 코드를 얻고, 최종 스트림릿 구현도 했습니다!!
스트림릿 제목이랑 설명도 추가해 넣었어요. 나중에 좀더 다듬어서 동료 연구자들에게도 공유하고 싶어서요!

import streamlit as st
import pandas as pd
import openpyxl
from openpyxl.styles import PatternFill
import re
from io import BytesIO
from openpyxl.utils.dataframe import dataframe_to_rows
from copy import copy


def clean_string(s):
    # Remove all non-alphanumeric characters (excluding underscores)
    return ''.join(re.findall(r'\w', s))


def column_letter_to_index(col):
    # Convert Excel column letters to 0-based index
    col = col.upper()
    index = 0
    for char in col:
        index = index * 26 + (ord(char) - ord('A')) + 1
    return index - 1  # Convert to 0-based index


def highlight_differences(df, x_col, y_col, add_brackets):
    # Convert column names to indices if necessary
    if isinstance(x_col, str):
        x_col = column_letter_to_index(x_col)
    if isinstance(y_col, str):
        y_col = column_letter_to_index(y_col)

    # Create a new worksheet
    wb = openpyxl.Workbook()
    ws = wb.active
    ws.title = "Sheet1"

    # Copy dataframe to worksheet
    for r in dataframe_to_rows(df, index=False, header=True):
        ws.append(r)

    # Create a list to store the cleaned strings for debugging purposes
    cleaned_strings = []

    # Iterate over the rows
    for index, row in df.iterrows():
        x_cell = str(row.iloc[x_col]).strip() if pd.notna(row.iloc[x_col]) else ''  # Handle NaN
        y_cell = str(row.iloc[y_col]).strip() if pd.notna(row.iloc[y_col]) else ''  # Handle NaN

        x_str = clean_string(x_cell)  # Remove non-alphanumeric characters from x cell
        y_str = clean_string(y_cell)  # Remove non-alphanumeric characters from y cell

        # Add the cleaned strings to the list for debugging
        cleaned_strings.append((x_str, y_str))

        # Skip comparison if either cleaned string is empty
        if x_str == '' or y_str == '':
            continue

        # Compare characters in the strings
        if x_str != y_str:
            # Highlight the cell in y_col with yellow background
            cell = ws.cell(row=index + 2, column=y_col + 1)  # Excel rows are 1-based, pandas rows are 0-based
            cell.fill = PatternFill(start_color="FFFF00", end_color="FFFF00", fill_type="solid")

            if add_brackets:
                new_x_str = ""
                new_y_str = ""
                # Iterate through the characters of both strings
                for i in range(max(len(x_str), len(y_str))):
                    if i < len(x_str) and i < len(y_str) and x_str[i] != y_str[i]:
                        # If characters differ, add brackets around them
                        new_x_str += f'[{x_str[i]}]'
                        new_y_str += f'[{y_str[i]}]'
                    else:
                        # If characters are the same, add them without brackets
                        if i < len(x_str):
                            new_x_str += x_str[i]
                        if i < len(y_str):
                            new_y_str += y_str[i]

                # Update the worksheet with the new strings
                ws.cell(row=index + 2, column=x_col + 1).value = new_x_str
                ws.cell(row=index + 2, column=y_col + 1).value = new_y_str

    # Print cleaned strings for debugging
    for idx, (x_clean, y_clean) in enumerate(cleaned_strings):
        print(f"Row {idx + 1}: x_str='{x_clean}', y_str='{y_clean}'")

    return ws


def process_all_sheets(file, x_column, y_column, add_brackets):
    wb = openpyxl.load_workbook(file)  # Load the input workbook
    new_wb = openpyxl.Workbook()  # Create a new workbook
    new_wb.remove(new_wb.active)  # Remove the default sheet

    for sheet_name in wb.sheetnames:
        df = pd.read_excel(file, sheet_name=sheet_name)  # Load each sheet into a dataframe
        ws = highlight_differences(df, x_column, y_column, add_brackets)  # Process the dataframe
        new_ws = new_wb.create_sheet(title=sheet_name)  # Create a new sheet in the new workbook

        for row in ws.iter_rows():
            for cell in row:
                new_cell = new_ws[cell.coordinate]
                new_cell.value = cell.value
                if cell.has_style:
                    # Copy cell style
                    new_cell.font = copy(cell.font)
                    new_cell.border = copy(cell.border)
                    new_cell.fill = copy(cell.fill)
                    new_cell.number_format = copy(cell.number_format)
                    new_cell.protection = copy(cell.protection)
                    new_cell.alignment = copy(cell.alignment)

    output = BytesIO()  # Save the new workbook to a BytesIO object
    new_wb.save(output)
    output.seek(0)
    return output


# Streamlit UI
st.title("교감봇")

# Add an expandable section for the program description
with st.expander("교감봇이란?"):
    st.write("""
        - 교감봇은 Excel 파일의 모든 시트에서 두 열의 텍스트의 동일성을 검토할 수 있습니다.
        - 열은 문자 또는 숫자로 지정할 수 있습니다.
        - 지정된 열에 대해, 각 행마다 텍스트의 동일성을 검토하고, 차이가 있는 경우 해당 열을 노란색 배경으로 강조 표시합니다.
        - 띄어쓰기, 구두점 등의 차이는 검토하지 않습니다. 오직 문자의 모양과 순서의 차이만을 검토합니다.('_' 언더바의 경우에는 예외적으로 검토 대상에 포함) 
        - 원한다면, 서로 다른 문자를 대괄호를 통해 표시할 수 있습니다.
    """)

uploaded_file = st.file_uploader("엑셀 파일을 업로드하세요", type="xlsx")
x_column = st.text_input("비교할 첫번째 열을 입력하세요 (숫자 or 알파벳)")
y_column = st.text_input("비교할 두번째 열을 입력하세요 (숫자 or 알파벳)")
add_brackets = st.checkbox("서로 다른 글자를 대괄호로 표시")

if st.button("교감 진행"):
    if uploaded_file and x_column and y_column:
        # Process all sheets in the uploaded file
        output = process_all_sheets(uploaded_file, x_column, y_column, add_brackets)

        # Provide the download link
        st.download_button(
            label="교감된 파일 다운로드",
            data=output,
            file_name="processed_output.xlsx",
            mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
        )

추후 과제

복수의 열에 대해 비교를 진행할 수 있는 기능 추가하기! 예컨대, A열과 C열, A열과 E열 각각 비교를 한번에 동시에 진행하도록 하기. 그 경우 브라켓을 어떻게 구분할지 생각해봐야 할 것 같아요.
가능하다면 브라켓 말고, 다른 글자에 색깔을 입힐 수 있는 방법이 있을지 천천히 찾아보기.
동료 연구자들에게 스트림릿 코드가 담긴 파이썬 파일(교감봇.py)을 공유하면, 그 분들이 로컬 컴터에서 터미널로 해당 파일 실행할 수 있게 안내하는 방법 찾아보기. 이것저것 깔으라고 하면 이 봇을 활용하는 데에 진입장벽이 높을 것 같아서요. (참고로 저는 PyCharm 프로젝트 내에서 터미널 돌리고 있습니다. 파이참을 경유하지 않고 그냥 터미널에서 바로 돌리면 왜인지 안되더라구요.. 컴알못이라 왜인지 원인 파악을 못하고 있습니다.)

추후 과제 관련해서 조언 주실 수 있다면 넘나 감사하겠습니다!!

긴 사례 읽어주셔서 감사합니다~~

멀지 않은 미래에 바호의 연구 비서 만들기 프로젝트 제3탄으로 다시 뵐 수 있길 바라며…😍

⏰ (마감까지 D-5) 가장 빠르게 AI를 배우는 곳 | 지피터스 AI스터디 17기 모집 중 🚀