데이터 분석가:Applied Data Analytics/판다스 데이터분석

신용거래 이상탐지 데이터 다루기

데이터분석 2025. 2. 19. 14:30
320x100
728x90

이번내용은 모두연 프로덕트데이터분석가1기 수료과정 중 과제를 참고하였습니다.

 

신용카드 이상감지 모델을 만들기 위한 데이터이며, is_fraud가 사기거래 여부를 나타내는 변수이다.

최종적으로 사기거래 여부를 예측하기 위한 데이터를 준비해보는 실습입니다.

  1. 사기거래 여부를 예측하는 데 불필요한 컬럼을 제거합니다.
  2. 최소 2개 이상의 새로운 피처(변수)를 생성해봅시다.
    • [힌트] 구매 금액, 시간 등의 변수를 이용할 수 있습니다.
    1. 데이터의 위도/경도 정보를 활용해볼 수 있을까?
    2. 통계적 관점으로 접근하여 유용한 변수를 만들어낼 수 있을까?
평가문항
상세기준
1. 다수의 컬럼이름을 변경하고, 다양한 방법으로 결측치를 처리할 수 있는가?

컬럼 이름의 변경이 명확하게 반영되어 있고, 다양한 방법으로 효과적인 결측치 처리를 진행할수 있는가?
2. 데이터 안의 이상치를 감지하고, 이를 적절하게 수정할 수 있는가? 데이터의 이상치를 감지하여 수정하고 수정후 데이터의 일관성이 유지되며 품질 또한 향상 되었는가?
3. 주어진 변수에서 추가적인 변수를 생성해 낼 수 있는가? 추가적인 변수를 창의적으로 생성하고 유용하게 데이터 분석에 기여 하였는가?

 

주어진 데이터.csv를 가지고 실습하는 내용입니다. 진행하는 프로세스만 참고하시기 바랍니다.

import pandas as pd
import numpy as np
import seaborn as sns

dating_df = pd.read_csv('data/fraud.csv')

# 컬럼명 변경
dating_df.rename(columns={"dob": "birthday", "trans_date_trans_time": "trans_time"}, inplace=True)

dating_df.head(10)

dating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 491134 entries, 0 to 491133
Data columns (total 18 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   trans_time  491134 non-null  object 
 1   merchant    491134 non-null  object 
 2   category    491134 non-null  object 
 3   amt         491134 non-null  float64
 4   gender      491134 non-null  object 
 5   street      491134 non-null  object 
 6   city        491134 non-null  object 
 7   state       491134 non-null  object 
 8   zip         491134 non-null  int64  
 9   lat         491134 non-null  float64
 10  long        491134 non-null  float64
 11  city_pop    491134 non-null  int64  
 12  job         491134 non-null  object 
 13  birthday    491134 non-null  object 
 14  unix_time   491134 non-null  int64  
 15  merch_lat   491134 non-null  float64
 16  merch_long  491134 non-null  float64
 17  is_fraud    491134 non-null  int64  
dtypes: float64(5), int64(4), object(9)
memory usage: 67.4+ MB

dating_df.describe()

dating_df.columns

Index(['trans_time', 'merchant', 'category', 'amt', 'gender', 'street', 'city',
       'state', 'zip', 'lat', 'long', 'city_pop', 'job', 'birthday',
       'unix_time', 'merch_lat', 'merch_long', 'is_fraud'],
      dtype='object')

for i in dating_df.columns:
    print(i)

trans_time
merchant
category
amt
gender
street
city
state
zip
lat
long
city_pop
job
birthday
unix_time
merch_lat
merch_long
is_fraud

# 데이터 전처리 (날짜변환, 불필요한 컬럼을 제거, 최소 2개 이상의 새로운 피처(변수)

# 날짜 변환 (datetime 형식으로 변경)
dating_df["trans_time"] = pd.to_datetime(dating_df["trans_time"])
dating_df["birthday"] = pd.to_datetime(dating_df["birthday"])

print(dating_df.dtypes)

trans_time    datetime64[ns]
merchant              object
category              object
amt                  float64
gender                object
street                object
city                  object
state                 object
zip                    int64
lat                  float64
long                 float64
city_pop               int64
job                   object
birthday      datetime64[ns]
unix_time              int64
merch_lat            float64
merch_long           float64
is_fraud               int64
dtype: object

print(dating_df.dtypes)

trans_time    datetime64[ns]
merchant              object
category              object
amt                  float64
gender                object
street                object
city                  object
state                 object
zip                    int64
lat                  float64
long                 float64
city_pop               int64
job                   object
birthday      datetime64[ns]
unix_time              int64
merch_lat            float64
merch_long           float64
is_fraud               int64
dtype: object

# 삭제할 컬럼 리스트 : 개인 정보이거나 거래번호로 의미없음삭제후 중간에 데이터를 다시 로드 하니 이미 컬럼이 삭제되있음.
columns_to_drop = ["zip", "street"]  #추가 컬럼 삭제  zip, stree는 위도, 경도와 중복 데이터 
# 새로운 데이터프레임으로 저장
dating_df = dating_df.drop(columns=columns_to_drop)

print(dating_df["state"].value_counts())

state
TX    49621
MI    28451
NY    23368
PA    20458
WA    19691
AL    18967
NC    18957
AR    17518
SC    16062
IL    16054
CA    16050
MD    15339
FL    15336
WV    15309
MO    12413
IN    12408
KS    11670
LA    11653
NM    10960
AZ     8749
MS     8735
NJ     8039
MN     8037
OH     8034
VT     8030
KY     8024
TN     7305
WI     7304
VA     7297
NV     4386
UT     4386
CT     4382
ND     4381
NE     4378
IA     4375
CO     4375
ME     4374
MT     4365
SD     3660
MA     3652
OK     3651
GA     3646
WY     3643
ID     3641
Name: count, dtype: int64

pd.set_option('display.max_columns', 50)
dating_df.head(10)

# A. 데이터 로드 및 구조확인 및 결측치 확인 필요한 컬럼명변경 및 생성, 이상치 처리
# B. 데이터 전처리(개인정보, 중복데이터 정리, 날짜데이터 변환, 거리 변수 생성(고객 과 가맹점거리)
# C. 범주형 데이터 인코딩
# D. 수치형 데이터 표준화
# E. category, state는 One-Hot Encoding적용
# F. 금액(amt)과 거리는 정규화
# G. 클래스 불균형 해결
# ===== C1조원들 회의결과 =======
# zip, lat, long, street 는 lat, long로 가능해서 zip, stree는 삭제하는 의견
# lat, long 와 merch_lat, merch_long의 두 곳의 위치로 거리가 나오므로 상관관계 파악 가능
# dob 생년월일이므로 최신거래년도와의 차이로 대략적인 나이를 개산하여 age로 새로운 컬럼으로 생성
# State는 위도,경도로 가능하지만, 주별로 사기 발생빈도를 알수 있으므로 제거 안함.또한 주별 사기방지 정책이 다를수 있으므로 제거 안함.
# 직업별, 성별, 연령별 파악가능
# 머신러닝 만드는 1.목적을 생각하고 2최적에 데이터를 준비하는것이 목적 

max_time = dating_df["trans_time"].max() #birthday와의 날자계산으로 age컬럼 생성하기 위해
print("가장 최근 거래 시간:", max_time)

가장 최근 거래 시간: 2020-12-31 23:59:34

# 기준 날짜 설정
reference_date = pd.to_datetime("2020-12-31")
# birthday 컬럼을 datetime 형식으로 변환
dating_df["birthday"] = pd.to_datetime(dating_df["birthday"])
# 나이 계산 (연도 차이 계산)
dating_df["age"] = reference_date.year - dating_df["birthday"].dt.year
# 결과 확인
print(dating_df.head())

           trans_time                           merchant     category     amt  \
0 2019-01-01 00:00:44    fraud_Heller, Gutmann and Zieme  grocery_pos  107.23   
1 2019-01-01 00:12:34  fraud_Schultz, Simonis and Little  grocery_pos   44.71   
2 2019-01-01 00:17:16                  fraud_Kling-Grant  grocery_net   46.28   
3 2019-01-01 00:20:15              fraud_Deckow-O'Conner  grocery_pos   64.09   
4 2019-01-01 00:23:41             fraud_Balistreri-Nader     misc_pos   25.58   

  gender         city state      lat      long  city_pop  \
0      F       Orient    WA  48.8878 -118.2105       149   
1      M    Elizabeth    NJ  40.6747  -74.2239    124967   
2      F   Plainfield    NJ  40.6152  -74.4150     71485   
3      M      Romulus    MI  42.2203  -83.3583     31515   
4      F  Baton Rouge    LA  30.4066  -91.1468    378909   

                                 job   birthday   unix_time  merch_lat  \
0  Special educational needs teacher 1978-06-21  1325376044  49.159047   
1             Operational researcher 1980-12-21  1325376754  40.079588   
2             Leisure centre manager 1974-07-19  1325377036  40.021888   
3                     Police officer 1971-11-05  1325377215  42.360426   
4                Designer, furniture 1977-02-22  1325377421  29.737426   

   merch_long  is_fraud  age  
0 -118.186462         0   42  
1  -74.848087         0   40  
2  -74.228188         0   46  
3  -83.552316         0   49  
4  -90.853194         0   43  

#거리 컬럼 생성(고객과 가맹정 거리)
# geopy.distance의 geodesic 함수 임포트
from geopy.distance import geodesic
dating_df["distance"] = dating_df.apply(lambda row: geodesic((row["lat"], row["long"]), (row["merch_lat"], row["merch_long"])).kilometers, axis=1)

print(dating_df.head())

           trans_time                           merchant     category     amt  \
0 2019-01-01 00:00:44    fraud_Heller, Gutmann and Zieme  grocery_pos  107.23   
1 2019-01-01 00:12:34  fraud_Schultz, Simonis and Little  grocery_pos   44.71   
2 2019-01-01 00:17:16                  fraud_Kling-Grant  grocery_net   46.28   
3 2019-01-01 00:20:15              fraud_Deckow-O'Conner  grocery_pos   64.09   
4 2019-01-01 00:23:41             fraud_Balistreri-Nader     misc_pos   25.58   

  gender         city state      lat      long  city_pop  \
0      F       Orient    WA  48.8878 -118.2105       149   
1      M    Elizabeth    NJ  40.6747  -74.2239    124967   
2      F   Plainfield    NJ  40.6152  -74.4150     71485   
3      M      Romulus    MI  42.2203  -83.3583     31515   
4      F  Baton Rouge    LA  30.4066  -91.1468    378909   

                                 job   birthday   unix_time  merch_lat  \
0  Special educational needs teacher 1978-06-21  1325376044  49.159047   
1             Operational researcher 1980-12-21  1325376754  40.079588   
2             Leisure centre manager 1974-07-19  1325377036  40.021888   
3                     Police officer 1971-11-05  1325377215  42.360426   
4                Designer, furniture 1977-02-22  1325377421  29.737426   

   merch_long  is_fraud  age   distance  
0 -118.186462         0   42  30.216618  
1  -74.848087         0   40  84.714605  
2  -74.228188         0   46  67.768167  
3  -83.552316         0   49  22.322745  
4  -90.853194         0   43  79.398244  

print(dating_df.dtypes)

trans_time    datetime64[ns]
merchant              object
category              object
amt                  float64
gender                object
city                  object
state                 object
lat                  float64
long                 float64
city_pop               int64
job                   object
birthday      datetime64[ns]
unix_time              int64
merch_lat            float64
merch_long           float64
is_fraud               int64
age                    int32
distance             float64
dtype: object

dating_df.head(10)

dating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 491134 entries, 0 to 491133
Data columns (total 18 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   trans_time  491134 non-null  datetime64[ns]
 1   merchant    491134 non-null  object        
 2   category    491134 non-null  object        
 3   amt         491134 non-null  float64       
 4   gender      491134 non-null  object        
 5   city        491134 non-null  object        
 6   state       491134 non-null  object        
 7   lat         491134 non-null  float64       
 8   long        491134 non-null  float64       
 9   city_pop    491134 non-null  int64         
 10  job         491134 non-null  object        
 11  birthday    491134 non-null  datetime64[ns]
 12  unix_time   491134 non-null  int64         
 13  merch_lat   491134 non-null  float64       
 14  merch_long  491134 non-null  float64       
 15  is_fraud    491134 non-null  int64         
 16  age         491134 non-null  int32         
 17  distance    491134 non-null  float64       
dtypes: datetime64[ns](2), float64(6), int32(1), int64(3), object(6)
memory usage: 65.6+ MB

print(dating_df['merchant'].value_counts())

merchant
fraud_Kilback LLC                      1602
fraud_Cormier LLC                      1455
fraud_Kuhn LLC                         1446
fraud_Schumm PLC                       1392
fraud_Boyer PLC                        1375
                                       ... 
fraud_Little-Gleichner                  228
fraud_Douglas, DuBuque and McKenzie     228
fraud_Satterfield-Lowe                  227
fraud_Reichert-Weissnat                 226
fraud_Jerde-Hermann                     226
Name: count, Length: 693, dtype: int64

# gender 컬럼을 수치형으로 변환 (0: Male, 1: Female)
dating_df["gender_encoded"] = dating_df["gender"].replace({"M": 0, "F": 1}).infer_objects(copy=False)

# 결과 확인 (Jupyter Notebook 환경에서 사용)
import matplotlib.pyplot as plt
import seaborn as sns

# gender_encoded 분포 시각화
sns.countplot(x=dating_df["gender_encoded"])
plt.xlabel("Gender Encoded (0: Male, 1: Female)")
plt.ylabel("Count")
plt.title("Gender Encoding Distribution")
plt.show()

# 데이터프레임 미리보기
dating_df.head()

# gender 컬럼 삭제
dating_df.drop(columns=["gender"], inplace=True)
# 데이터프레임 확인
dating_df.head()

# is_fraud를 기준으로 그래프 시각화

import matplotlib.pyplot as plt

# 거래 금액(amt)과 거리(distance)를 기준으로 사기 거래 여부(is_fraud) 시각화
plt.figure(figsize=(6, 6))
plt.scatter(dating_df["amt"], dating_df["distance"], c=dating_df["is_fraud"], cmap="coolwarm", alpha=0.5)
plt.xlabel("Transaction Amount (amt)")
plt.ylabel("Transaction Distance (km)")
plt.title("Fraud Detection: Amount vs. Distance")
plt.colorbar(label="Fraud (0: Normal, 1: Fraud)")
plt.show()

# 거래 금액(amt)과 연령(age)를 기준으로 사기 거래 여부(is_fraud) 시각화
plt.figure(figsize=(6, 6))
plt.scatter(dating_df["amt"], dating_df["age"], c=dating_df["is_fraud"], cmap="coolwarm", alpha=0.5)
plt.xlabel("Transaction Amount (amt)")
plt.ylabel("Age")
plt.title("Fraud Detection: Amount vs. Age")
plt.colorbar(label="Fraud (0: Normal, 1: Fraud)")
plt.show()

 

# pandas hist 시각화
dating_df.hist(figsize=(15, 12));  # hist() 데이터프레임에 포함된 모든 숫자형열에 대한 히스트그램을 그린다.

# 오류 원인 찾기 위해 is_fraud컬럼 체크
print(dating_df['is_fraud'].isnull().sum())  # NaN 개수 확인

0

print(dating_df['is_fraud'].unique())  # 고유한 값 출력

[0 1]

# 컬럼 확인
missing_cols = [col for col in ['is_fraud', 'amt', 'age'] if col not in dating_df.columns]
print("존재하지 않는 컬럼:", missing_cols)

존재하지 않는 컬럼: []

if all(col in dating_df.columns for col in ['is_fraud', 'amt', 'age']):
   dating_df_subset = dating_df[['is_fraud', 'amt', 'age']]
else:
    print("일부 컬럼이 데이터프레임에 존재하지 않습니다.")

dating_df.columns = dating_df.columns.str.strip().str.lower()  # 공백 제거 & 소문자 변환

dating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 491134 entries, 0 to 491133
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   trans_time      491134 non-null  datetime64[ns]
 1   merchant        491134 non-null  object        
 2   category        491134 non-null  object        
 3   amt             491134 non-null  float64       
 4   city            491134 non-null  object        
 5   state           491134 non-null  object        
 6   lat             491134 non-null  float64       
 7   long            491134 non-null  float64       
 8   city_pop        491134 non-null  int64         
 9   job             491134 non-null  object        
 10  birthday        491134 non-null  datetime64[ns]
 11  unix_time       491134 non-null  int64         
 12  merch_lat       491134 non-null  float64       
 13  merch_long      491134 non-null  float64       
 14  is_fraud        491134 non-null  int64         
 15  age             491134 non-null  int32         
 16  distance        491134 non-null  float64       
 17  gender_encoded  491134 non-null  int64         
dtypes: datetime64[ns](2), float64(6), int32(1), int64(4), object(5)
memory usage: 65.6+ MB

# 'trans_time' 컬럼이 datetime64[ns] 타입이여서 문제소지가 있어 제외하고 실행

# seaborn pairplot 시각화
import seaborn as sns
import matplotlib.pyplot as plt
vis_cols = ['amt', 'gender_encoded', 'city_pop', 'unix_time', 'age', 'distance', 'is_fraud'] # is_fraud컬럼추가
sns.pairplot(data=dating_df[vis_cols], hue='is_fraud');

 

# 현재 데이터프레임 컬럼 확인
print("현재 데이터프레임의 컬럼 목록:", dating_df.columns)
# 'is_fraud' 컬럼이 존재하는지 확인
if 'is_fraud' in dating_df.columns:
    # 수치형 변수만 선택
    vis_cols = ['trans_time', 'amt', 'gender_encoded', 'city_pop', 'unix_time', 'age', 'distance']
  
    # pairplot 실행
    sns.pairplot(data=dating_df[vis_cols + ['is_fraud']], hue='is_fraud')
    
    # 그래프 출력
    plt.show()
else:
    print("'is_fraud' 컬럼이 데이터프레임에 존재하지 않습니다. 컬럼명을 확인하세요.")

현재 데이터프레임의 컬럼 목록: Index(['trans_time', 'merchant', 'category', 'amt', 'city', 'state', 'lat',
       'long', 'city_pop', 'job', 'birthday', 'unix_time', 'merch_lat',
       'merch_long', 'is_fraud', 'age', 'distance', 'gender_encoded'],

 

# 이상 머신러닝을 위한 데이터준비과정이였습니다.