무료 데이터셋으로 리텐션 분석 연습하기(SQL)

데이터 분석

무료 데이터셋으로 리텐션 분석 연습하기(SQL)

JOHAWK 2026. 1. 23. 16:47

안녕하세요, SQL로 리텐션 분석을 연습하고 싶은 분께 추천드리고 싶은 데이터셋이 있어서 소개드립니다.

바로 Iranian Cosmetics & Personal Care Sales Dataset 입니다. (클릭 시 kaggle 사이트로 이동)

실제 이란의 2025년 화장품 및 퍼스널 케어 구매 데이터(트랜잭션)를 참고한 합성 데이터셋으로 Kaggle에 공개되어있습니다. 다운로드를 받으면 csv 파일로 구할 수 있습니다.

1. 데이터셋 살펴보기

데이터는 아래와 같이 구성되어 있습니다.

select * from invoices limit 1

컬럼명	의미
InvoiceId	송장 고유번호(INT)
OrderDate	주문일자(YYYY-MM-DD)
ProductId	제품 고유번호(INT)
CustomerId	고객 고유번호(INT)
ShipMode	배송 방식(Standard, Second, First, SameDay)
Quantity	구매한 제품 수량
Sales	수량을 반영한 판매 금액(단가 * Quantity)
Profit	판매자 기준 예상 이익
Discount	할인율(0, 0.05, 0.1)
TotalAmount	할인까지 적용된 실제 판매 금액(Sales*(1-Discount))

총 14,586건의 거래(트랜잭션) 데이터가 있고, 2496명의 고객, 고유 제품 500개, 4가지의 배송 방식과 3가지 할인율이 있습니다.

select
  count(*) as total_transactions,
  count(CustomerId) as total_customers,
  count(distinct CustomerId) as unique_customers,
  count(distinct ProductId) as unique_products,
  count(distinct ShipMode) as cat_shipmode,
  count(distinct Discount) as cat_discount
from invoices

이 데이터가 합성데이터셋인 것을 어떻게 알 수 있을까요?
월간 활성 유저 수와 거래 수, 그리고 배송 방식별 트랜잭션의 특징을 보면 알 수 있습니다.

1. 월간 활성 유저 수(MAU)와 주문 수가 모두 비슷합니다. 실제 미용 관련 커머스에서 이뤄지기 어려운 형태입니다.

with add_month as (
    select
        *,
        date_format(OrderDate, '%Y-%m-01') as OrderMonth
    from invoices
)
select OrderMonth,
    count(distinct CustomerId) as "MAU",
    count(InvoiceId) as cnt_transactions
from add_month
group by OrderMonth

2. 배송 방식에 따라 주문 수, 평균 이익, 평균 양, 평균 금액 등을 비교했을 때 큰 차이가 없어 합성 데이터셋임을 알 수 있습니다.

select
    ShipMode,
    count(InvoiceId) as cnt_orders,
    avg(Sales) as avg_price,
    avg(Quantity) as avg_quantity,
    avg(TotalAmount) as avg_amount
from invoices
group by ShipMode
order by avg_amount desc

3. 하나의 송장에 여러 제품들이 동시에 들어있는 경우는 없었습니다.

select
    count(InvoiceId) as total_invoice,
    count(distinct InvoiceId) as unique_invoice
from invoices

2. 리텐션 분석 쿼리 연습

리텐션을 추출하기 위해 필요한 컬럼은 CustomerId, OrderDate 두 개입니다.

1) '첫 주문 월'을 기준으로 코호트 구성하기

첫 주문 월을 기준으로 코호트를 만들기 위해 각 주문마다 주문 월과 해당 고객의 첫 주문 월을 넣어줍니다.

with add_month as (
    select
        CustomerId,
        OrderDate,
        date_format(OrderDate, '%Y-%m-01') as OrderMonth
    from invoices
), -- date_format 함수로 구매 연월 컬럼 생성
first_order as (
    select
        CustomerId,
        min(OrderMonth) as first_order_month
    from add_month group by CustomerId
) -- 코호트 구별을 위해 고객별 첫 주문 월 추출

select a.CustomerId, a.OrderMonth, f.first_order_month
from add_month a
left join first_order f
    on a.CustomerId = f.CustomerId

2) 리텐션 계산

case when과 date_add 함수를 이용해 첫 주문 이후 n개월 차에도 구매를 했는 지 확인하는 것이 중요합니다.

with add_month as (
select CustomerId, OrderDate, date_format(OrderDate, '%Y-%m-01') as OrderMonth from invoices
), -- date_format 함수로 구매 연월 컬럼 생성
first_order as (
select CustomerId, min(OrderMonth) as first_order_month from add_month group by CustomerId
) -- 코호트 구별을 위해 고객별 첫 주문 월 추출

select
    f.first_order_month as months, -- 첫 주문 기준 코호트
    count(distinct a.CustomerId) as month0, -- 첫 주문을 한 고객 수(고유 고객 수를 추출해야하므로 distinct를 사용)
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 1 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month1, -- 첫 주문 고객 수 대비 1개월이 지난 후 주문을 한 고객의 비율(%)
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 2 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month2, -- 첫 주문 고객 수 대비 2개월이 지난 후 주문을 한 고객의 비율(%)
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 3 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month3,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 4 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month4,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 5 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month5,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 6 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month6,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 7 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month7,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 8 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month8,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 9 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month9,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 10 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month10,
    round((count(distinct case when OrderMonth = date_add(f.first_order_month, interval 11 month) then a.CustomerId end) / count(distinct a.CustomerId)*100), 2) as month11 -- 첫 주문 고객 수 대비 11개월이 지난 후 주문을 한 고객의 비율(%)
    from add_month a
        left join first_order f
        on a.CustomerId = f.CustomerId
    group by f.first_order_month -- 첫 주문 월을 기준으로 코호트를 나누기 위해 group by 사용

이렇게 각 월마다 첫 주문을 한 고객 수(명)와 n개월이 지날 때마다 결제를 한 고객의 비율(%)을 구할 수 있습니다.

mysql에서는 반복문이 없기 때문에 각 컬럼마다 반복해서 작성해야합니다 ㅠㅠ

3) 해석

sql로 추출한 리텐션 데이터를 보다 쉽게 해석하기 위해 엑셀(구글스프레드 시트)의 조건부 서식 기능을 이용해서 히트맵을 만들었습니다.

month0은 고객수를 기준으로 막대 그래프를 넣었습니다. 2025년 1월의 첫 주문 고객 수가 975명으로 가장 많고 12월에 첫 주문을 한 고객 수는 3명으로 가장 적습니다.
month1부터 month11까지 리텐션은 녹색이 진할수록 높습니다.

리텐션은 가로, 세로, 대각선(↗) 방향으로 볼 수 있습니다.

1) 가로: 첫 주문 코호트 별 리텐션의 변화
- 25년 1월 코호트(첫 주문 고객)는 11개월이 지나도 37% 이상의 리텐션을 꾸준히 기록하고 있습니다.
- 25년 8월 코호트의 리텐션은 3개월 후(11월) 34%로 높았으나 12월에 21%로 크게 감소했습니다.

2) 세로: 각 코호트의 n개월 차 리텐션 비교
- 25년 8월 코호트는 다른 코호트에 비해 리텐션이 낮습니다.

3) 대각선(↗): 특정 월의 리텐션
- 25년 12월 리텐션을 비교했을 때, 10월 및 11월 코호트가 60%로 가장 높았고 8월 코호트가 21.88%로 가장 낮았습니다.

만약 이 데이터가 실제 데이터였으면, 저는 8월에 첫 주문을 한 고객들의 리텐션이 낮다는 것에 주목해 리서치를 통해 그 원인을 찾아보고자 했을 것입니다.

하지만 여기서 가장 중요한 것은 시간이 지날수록 첫 주문을 하는 고객 수가 급격히 줄어드는 것이죠. 리텐션을 걱정하기보다 신규 고객이 감소하는 원인을 찾는 것이 더 중요합니다.

지금까지 Kaggle의 데이터셋을 활용해서 연습하는 예시를 만들어봤습니다.

SQL을 배웠는데 실제로 연습할 데이터셋이 부족한 분들께 도움이 되셨길 바랍니다.

감사합니다.

이 글은 데이터리안 ‘SQL 데이터 분석 캠프 실전반’을 수강하며 작성한 내용입니다.

'데이터 분석' 카테고리의 다른 글

가장 간단한 방법으로 문제 해결하기: 결제일 예측 프로세스 개선 (0)	2025.10.29
데이터 분석에서 B2B와 B2C의 차이 (0)	2025.10.29

현재글무료 데이터셋으로 리텐션 분석 연습하기(SQL)

JOHAWK_DATABASE

심리학 전공자의 데이터 분석가 도전기

리텐션, 데이터리안, 스프린트데이터분석2기, 프로젝트, EDA, 데이터분석, 취업까지달린다, 회고, 코드잇스프린트, AARRR, 스마일커브, 그로스해킹, 연관분석, 데이터 분석, sql 실전, PCA, 파이썬, SQL, 클러스터링, 스프린트 회고,

Today :
Yesterday :

JOHAWK_DATABASE