Notice

Recent Posts

Recent Comments

Link

« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

테크매니아

FakeFace 최적화 하기 본문

카테고리 없음

FakeFace 최적화 하기

SciomageLAB 2024. 10. 20. 16:47

개요

GAN

얼굴 만들어 주는 GAN 모델

GAN(Generative Adversarial Network, 생성적 적대 신경망)

FakeFace

GitHub Repo
Style-Based GAN in PyTorch
128px 해상도. stylegan-256px-celeba-550000.model 파일 686MB

style-based-gan-pytorch을 기반으로 연구됨

이런 식으로 Generator의 결과로 랜덤 생성된 얼굴이 나옴

모델 분석

torch.save(
    {
        'generator': generator.module.state_dict(),
        'discriminator': discriminator.module.state_dict(),
        'g_optimizer': g_optimizer.state_dict(),
        'd_optimizer': d_optimizer.state_dict(),
        'g_running': g_running.state_dict(),
    },
    f'checkpoint/{str(i + 1).zfill(6)}.model',
)

train.py에서 checkpoint를 저장할 때 이런식으로 저장함
state_dict로 generator와 discriminator, 각각의 optimizer를 저장 한다.
transfer learning을 하거나 이어서 학습 하려면 이런 정보들이 모두 필요하기 때문에 모두 저장 한다.

generator = StyledGenerator(512).to(device)
try: generator.load_state_dict(torch.load(args.path)["g_running"])
except: generator.load_state_dict(torch.load(args.path))
generator.eval()

얼굴을 만드는 부분은 generate만 사용하기 때문에 generate를 자세히 확인해 봤다.
generate.py에서 모델의 로드는 위와같이 모델을 시각화 하기 위해서 torch.save로 다시 저장하면 다음과 같다.

generator만 따로 generator.pth으로 save 하면 100MB 파일이 나온다.
구조는 이런식으로 블럭으로 돼 있다. 이런 식으로 pth만 저장하면 state_dict에 있는 값만 나오기 때문에 블럭만 둥둥 떠다녀서 연결 관계를 파악할 수 없다.

구조를 더 보기좋게 시각화 하기 위해서 onnx 포맷으로 수정해서 저장하고 netron으로 확인했다.

torch.save(generator, "generator.pth")
dummy_data = torch.empty(1, 512, dtype = torch.float32).cuda()
torch.onnx.export(generator, dummy_data, "generator.onnx")

onnx로 저장하면 연결 관계를 포함한 구조를 더 자세히 볼 수 있다.

복잡하다(...)

하나씩 차근히 보면..

GEMM가 많이 보이는데, 이건 "General Matrix to Matrix Multiplication". 그냥 행렬 곱셈이다.
더 자세히 알아보기 위해서 프로파일링을 했다.

벤치마킹

torch.utils.benchmark를 사용하여 코드를 벤치마킹해 봤다.
다음과 같이 호출 하려는 함수를 string으로 stmt에 인자로넣고, 각 파라미터는 globals 인자에 넣고 호출하면 된다. timeit만큼 반복한 평균 결과를 알 수 있다.

# Benchmark
t0 = benchmark.Timer(
    stmt='sample(generator, step, mean_style, size, device)',
    setup='from __main__ import sample',
    globals={'x': dummy_data,
             'generator': generator,
             'step': step, 
             'mean_style': mean_style, 
             'size': args.n_row * args.n_col, 
             'device': device,
             })
print(t0.timeit(10))

GPU(GUDA)

Device를 cuda로 테스트 했다.
생성되는 이미지는 파라미터인 row, col에 의해 정해진 만큼 생성되는데, 생성하는 이미지 개수에 비례한 결과가 나온다. 개수에 따라 정수배 차이가 나는거 같진 않다.

1 x 1 => 6.38ms
1 x 5 => 17.72ms
3 x 5 => 44.68ms
6 x 10 => 151.13ms

CPU

cuda는 원래 당연히 빠르니까 cpu로 성능을 확인해 봤다.

1 x 1 => 371.93 ms
1 x 5 => 1.85sec
3 x 5 => 5.72sec
6 x 10 => 23.38sec

(ms)	GPU	CPU
1 x 1	6.38	372
1 x 5	17.72	1850
3 x 5	44.68	5720
6 x 10	151.13	23380

한 장 기준으로 보면 GPU 에서는 6.38ms, CPU에서는 372ms 정도로 약 62배 정도 발생한다.
CPU에서 코어 1개만 썼고, 372ms면 약 0.3초, 3FPS정도로 볼 수 있어서 아주 못쓸만큼은 아니다.

프로파일링

이 모델(코드)가 하는 더 자세한 내용 확인 해 보기
각각의 레이어, 함수별로 자세히 분석해 보기

PyTorch profiler

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    img = sample(generator, step, mean_style, args.n_row * args.n_col, device)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='cpu_time_total', row_limit=5))

위 코드로 프로파일을 돌리면 아래와 같은 결과를 확인 할 수 있다.

모델의 구조에 따라 레이어별로 한 번에 알아보기 어렵다.

model.py 분석

다음과 같이 model.py파일의 Generator클래스의 forward를 수정해서 conv부분만 프로파일링을 할 수 있다. 이런식으로 부분별로 프로파일링을 해 본다.

class Generator(nn.Module):
    def __init__(self, ...):
        ...
        self.progression = nn.ModuleList(
            [
                StyledConvBlock(512, 512, 3, 1, initial=True),  # 4
                StyledConvBlock(512, 512, 3, 1, upsample=True),  # 8
                StyledConvBlock(512, 512, 3, 1, upsample=True),  # 16
                StyledConvBlock(512, 512, 3, 1, upsample=True),  # 32
                StyledConvBlock(512, 256, 3, 1, upsample=True),  # 64

                ...
            ]
        ...

    def forward(self, ...):
        ...
        for i, (conv, to_rgb) in enumerate(zip(self.progression, self.to_rgb)):
            ...
            with profiler.record_function(f"CONV FORWARD {i}"):
                out = conv(out, style_step, noise[i]) # <- 여기
            ...

Generator 클래스를 보면 self.progression에 StyledConvBlock을 여러개 두고 forward에서 for-loop으로 쓴다.

여기에 profiler.record_function를 이용해서 레이어별로 확인해 보면 다음과 같은 결과를 확인 할 수 있다.

export_chrome_trace

구조적으로 더 더 시각화 하는 방법이다.

with profiler.profile(with_stack=True, profile_memory=True) as prof:
    img = sample(generator, step, mean_style, args.n_row * args.n_col, device)

print(prof.key_averages(group_by_stack_n=5).table(sort_by='cpu_time_total', row_limit=5))
rof.export_chrome_trace("trace.json") # <-- 추가

export_chrome_trace으로 프로파일링 결과를 json으로 내보낼 수 있다.
chrome://tracing/의 기능을 이용해 프로파일링 결과를 구조적으로 확인할 수 있다.

레이어5만 더 자세히 보면 다음과 같다.

테크매니아

테크매니아

FakeFace 최적화 하기 본문

FakeFace 최적화 하기

개요

GAN

FakeFace

모델 분석

벤치마킹

GPU(GUDA)

CPU

프로파일링

PyTorch profiler

model.py 분석

export_chrome_trace

참고자료

티스토리툴바