MusicGen 특정 장르로 training 후 테스트

SunoV3, Udio 등 훌륭한 music llm 서비스들이 나왔고,

만족하였으나, 아쉬운 부분 몇가지 있었다.

4분짜리 verse chorus 구성(coherent한 구성)은 힘들다는 것.

Chorus 부분이 사람 노래처럼 멋지게 하이라이트를 치지 못한다는 것.

직접 모델 하나 가지고 트레이닝 하면 어떨 결과가 나올지 궁금해서 진행해보았다.

욕심 같아서는 verse 모델, chorus 모델용으로 동일 모델을 다른 dataset chunk로 훈련시키고, 메인 모델로 verse, chorus를 받아서 연결? 시키는 컨셉을 상상해보았으나,

여력상 그냥 verse 부분만 training 해봄. 이것만 하는데도, 처음해서 그런지 엄청난 시행착오와 시간을 소비하게 된다.

0) 모델 선택

suno-bark, musicGen 중에 데모 소리로는 musicGen이 더 좋아보여 선정.

(뒤늦게 musicLM 더 좋아보여서 이거 선택 할걸 아쉽다)

1) data 준비

Youtube에서 발라드 긴거 하나 선정.

https://www.youtube.com/watch?v=-sTA2LkPwmU

wav 만 뽑아내기

4K YouTube to MP3 라는 프로그램을 다운받는다.
https://www.4kdownload.com/products/youtubetomp3-72

다른 인터넷 무료 서비스들은 유료 내지 않으면, 긴 비디오는 안됨.

무료 서비스들은 바이러스와 광고가 50개 뜸.

4K YouTube to MP3 가 가장 좋다. 하루에 6개 영상까지 무료.

3시간이 넘는 wav에 여러곡이 들어있다.

트랙별로 자르기.

손노가다는 가능도 하지만, 불가능으로 여기는게 낫다.

구글링에 비슷한 질문들이 많이 있는데, audacity 를 사용하라는 의견이 몇개씩 보여서,

Audacity 다운받아 설치한다.
https://www.audacityteam.org/

추출한 wav를 로딩하고,

ctrl+A 전체선택후,

사운드 레이블로 가서,
팝업 나오면 default설정으로 바로 적용 누르거나, 잘 안잘려지면,

유투브 가서 multi track으로 자르는거 찾아보면,

db강도가 아주 작은 영역이 몇초 이상 지속될때 자르는 조건으로 자르면 된다.

그리하여, 90개 정도로 나뉘었고, 너무 짧은 구간 버리고,

적당한 구간만 추출하니 80개 정도 나왔다.

여러개의 파일 옵션으로 내보내기 하면,

.wav들이 쭉 다운 받아 진다.

2) dataset 전처리

MusicGen 은 training시 무조건 30초 이상의 input이 들어가야 한다.

그래서 나뉘어진 파일들중

30초 이하는 따로 치워버리고,

30초 이상만 쭉 모아둔다.

git clone https://github.com/chavinlo/musicgen_trainer.git

train_musicGen.ipynb 를 받거나, 아래 내가 사용한 링크를 가져가 사용해도 됨.

Google Colaboratory

저 코드 중에 wav를 쪼개는 부분을 나에 맞게 따로 수정하였다.


def process_audio(file_path, output_dir, global_count, segment_length=30):
    print("global_count : ", global_count)
    # Load audio file
    audio = AudioSegment.from_file(file_path)

    # Get file name without extension for caption
    file_name = os.path.splitext(os.path.basename(file_path))[0]

    # Convert segment length to milliseconds
    segment_length_ms = segment_length * 1000

    # Set the sample rate to 32000 Hz
    audio = audio.set_frame_rate(32000)

    # Calculate the number of segments
    num_segments = (len(audio) + segment_length_ms - 1) // segment_length_ms
    print(num_segments)
    
    # Loop through segments
    # First 30 sec
    first_30sec = True
    for i in range(num_segments):
        # Get start time for the segment
        start_time = i * segment_length_ms

        # If this is the last segment, adjust start_time
        if i == num_segments - 1:
            start_time = len(audio) - segment_length_ms

        # Get end time for the segment
        end_time = start_time + segment_length_ms

        # Extract the segment
        segment = audio[start_time:end_time]

        # Save the segment
        # segment.export(os.path.join(output_dir, f'segment_{i:03d}.wav'), format='wav')
        if first_30sec == True:
            segment.export(os.path.join(output_dir, f'verse1_{global_count}.wav'), format='wav')
        else:
            segment.export(os.path.join(output_dir, f'segment_{global_count}.wav'), format='wav')

        # Save the caption
        # with open(os.path.join(output_dir, f'segment_{i:03d}.txt'), 'w') as f:
        if first_30sec == True:
            with open(os.path.join(output_dir, f'verse1_{global_count}.txt'), 'w') as f:  
                f.write(file_name)
        else:
            with open(os.path.join(output_dir, f'segment_{global_count}.txt'), 'w') as f:  
                f.write(file_name)

        global_count += 1
        first_30sec = False
    
    return global_count

# Directory setup
output_directory = 'output'
samples_directory = 'raw'

# Check if the output directory exists, if not, create it
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

# Iterate through the files in the "samples" directory
global_count = 0
for file_name in os.listdir(samples_directory):
    if file_name.endswith('.wav') or file_name.endswith('.mp3'):
        file_path = os.path.join(samples_directory, file_name)
        global_count = process_audio(file_path, output_directory, global_count, segment_length=30)

wav를 모두 불러와서, 매 30초 마다 끊는데,

verse (첫 30초)와 그외로 나눠서 쪼갰다.

그리고 30초 컷된 wav와 함께 빈 txt가 생성되는데, caption으로 사용될 파일이다.

캡션을 손으로 다 완벽하게 음악에 맞게 채우면 최고지만, 시간 여력상 코드로 아래 규칙으로 채웠다.

verse는 verse1 이라는 키워드 꼭 포함, 그외 발라드에 어울리는 단어들 5개씩 기입.

(아래 코드 template으로 자동 변환이 안됨)

import os

import random

# Function to read files and write content with random stringsdef process_files(prefix): file_list = [file for file in os.listdir('.') if file.startswith(prefix) and file.endswith('.txt')] # random_strings = ["물결", "이별", "그리움", "사랑", "추억", "눈물", "아픔", "끝", "떠나다", "기억", "바람", "가슴", "슬픔", "비", "울다", "행복", "빛", "눈부심", "느낌", "서글픔", "그림자", "소중함", "희망", "달콤함", "아침", "저녁", "세월", "설렘", "미소", "온기", "노래", "향기", "기다림", "무게", "허전함", "한잔", "밤", "삶", "무한", "첫사랑", "용기", "숨", "동화", "영원", "가을", "겨울", "봄", "여름", "빗속", "길", "헤매다", "발걸음", "너머", "바닷가", "섬", "별빛", "새벽", "구름", "창가", "달", "약속", "붉은색", "마음", "소리", "햇살", "추억", "금지", "느낌", "사람", "희망", "사랑스럽다", "기다리다", "손", "달콤하다", "자유", "아픈", "어둠", "미련", "지나다", "미래", "혼자", "안녕", "마주치다", "눈빛", "소리", "얼굴", "시간", "부드럽다", "떠나가다", "생각", "잊다", "가깝다", "가슴속", "아프다", "편안하다", "속삭임", "그리다", "분명하다", "사랑받다", "잊혀지다", "가장", "부드럽다", "돌아가다", "다시", "함께", "이뤄지다", "떠오르다", "나타나다", "무릎", "가슴뛰다", "눈부시다", "따뜻하다"] random_strings = ["Wave", "separation", "longing", "love", "memories", "tears", "pain", "end", "depart", "memory", "wind", "chest", "sadness", "rain", "cry", "happiness", "light", "brightness", "feeling", "melancholy", "shadow", "preciousness", "hope", "sweetness", "morning", "evening", "time", "excitement", "smile", "warmth", "song", "fragrance", "waiting", "weight", "emptiness", "drink", "night", "life", "infinity", "first love", "courage", "breath", "fairy tale", "eternity", "autumn", "winter", "spring", "summer", "rainstorm", "road", "wander", "step", "beyond", "beach", "island", "starlight", "dawn", "cloud", "window", "moon", "promise", "red", "heart", "sound", "sunshine", "people", "farewell", "lovely", "waiting", "hand", "sweet", "freedom", "pain", "darkness", "regret", "pass", "future", "alone", "goodbye", "encounter", "gaze", "face", "time", "gentle", "leave", "thought", "forget", "close", "within", "hurt", "whisper", "draw", "clear", "beloved", "forgotten", "most", "soft", "return", "again", "together", "fulfill", "arise", "appear", "knee", "heartbeat", "dazzling", "warm"]

for file_name in file_list: with open(file_name, 'rb') as file: try: content = file.read().decode('utf-8') except UnicodeDecodeError: try: content = file.read().decode('cp949') # Try different encoding (e.g., cp949) except UnicodeDecodeError: print(f"Error decoding file '{file_name}'") continue content = content.strip() random_selection = random.sample(random_strings, 5) if prefix == "verse1": output = f"{prefix} {' '.join(random_selection)}\n{content}\n" else: output = f"Chorus {' '.join(random_selection)}\n{content}\n" with open(file_name, 'w', encoding='utf-8') as output_file: output_file.write(output)

# Process verse filesprocess_files("verse1")
# Process segment filesprocess_files("segment")

드디어 전처리가 끝났다.

3) 환경

pytorch, cuda, audiocraft 간 호환성 때문에 4시간 이상 버렸다.

audiocraft 가 필수인데, 반드시 pytorch 2.1.0 만 된다.

그래서 cuda 버전과 pytorch 2.1.0에 맞는걸 찾아서 설치해줘야 한다.

https://pytorch.org/get-started/previous-versions/

# CUDA 11.8 conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia # CUDA 12.1 conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia # CPU Only conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 cpuonly -c pytorch

알맞는거 설치하자.

4) 트레이닝

musicGen small 300M 짜리 작은 모델인데도, GTX 1660 SUPER 16GB RAM 으로 못돌린다.

Out of CUDA Memory로 죽음.

허무하지만, 바로 colab 유료로 시도했다.

colab 이 너무 맘에 안든다.

A100은 거의 안잡히고, 다른 GPU 잡아서 하면,

자리 좀만 비워도 session 끊기고, google drive에 저장 안해두면, 모든게 전부 날라간다.

computing hour 소비하고 건지는게 하나도 없다.

350개 wav 파일 training이 수시간 걸리는데, 수시간 부재 했다가, session 끊기고 모든게 다 날라갔다 🤬

그래서 verse 부분만 50개 트레이닝 했다.

L4 GPU 22GB짜리는 트레이닝 가능하다.

우여곡절 끝에 트레이닝이 끝나고

model = musicgen.MusicGen.get_pretrained('small', device='cuda')

model.set_generation_params(duration=8)

5) 듣기

만든 모델을 아래 처럼 올리고

model.lm.load_state_dict(torch.load('models/lm_final.pt'))

training 된 llm으로 음악을 생성했다.

이런 모델 만들려고 수만곡을 training 할텐데,

50개 트레이닝한 결과물은 처참했다. 큰 변화 없이, 가사도 나오지 않았다.

트레이닝 한 싸이클 돌아보는 시행착오 경험에 의미가 있다.

⏰ (마감까지 D-2) 가장 빠르게 AI를 배우는 곳 | 지피터스 AI스터디 17기 모집 중 🚀