Pandas 기초 1

1. Pandas 란?

- Pandas는 파이썬으로 작성된 라이브러리로서 강력하고, 유연하며, 사용하기 쉬운 오픈소스 데이터 분석./조작 툴 : 위키백과

- 공식사이트 : https://pandas.pydata.org/

- Pandas는 “Python Data Analysis Library”의 약자라는 설과 “panel data”로 부터 나왔다는 설이 있다.

2. Pandas 자료구조

1) Series

인덱스과 값으로 구성되어 배열과 비슷하지만 인덱스를 명시적으로 지정할 수 있다.

import pandas as pd

data = pd.Series([1,2,3,4])

print(data)

0    1
1    2
2    3
3    4
dtype: int64

index를 별도로 지정할 수 있다.

data = pd.Series([1,2,3,4,5],index=['a','b','c','d','e'])

print(data)

a    1
b    2
c    3
d    4
e    5
dtype: int64

딕셔너리로 Series 만들기. 딕셔너리의 Key가 Series의 index가 됨

fruit_price_dict = {

'apple' : 100,

'banana' : 300 ,

'grape' : 400,

'orange' : 150

}

print(fruit_price_dict)

{'apple': 100, 'banana': 300, 'grape': 400, 'orange': 150}

fruit_price = pd.Series(fruit_price_dict)

print(fruit_price)

apple     100
banana    300
grape     400
orange    150
dtype: int64

print(fruit_price.values)

[100 300 400 150]

2) Data Frame

여러개의 Series를 모으면 행과 열이 있는 Data Frame

fruit_stock_dict = {

'apple' : 2000,

'banana' : 1000 ,

'grape' : 3100,

'orange' : 4150

}

fruit_stock = pd.Series(fruit_stock_dict)

fruit_dataframe = pd.DataFrame({

'fruit price': fruit_price,

'fruit stock': fruit_stock

})

print(fruit_dataframe)

        fruit price  fruit stock
apple           100         2000
banana          300         1000
grape           400         3100
orange          150         4150

print(fruit_dataframe.index)

Index(['apple', 'banana', 'grape', 'orange'], dtype='object')

print(fruit_dataframe.columns)

Index(['fruit price', 'fruit stock'], dtype='object')

print(fruit_dataframe['fruit stock'])

apple     2000
banana    1000
grape     3100
orange    4150
Name: fruit stock, dtype: int64

3) Data Frame , Series , List

print(fruit_dataframe['fruit stock']) # 컬럼명만으로 지정하면 Series를 return

apple     2000
banana    1000
grape     3100
orange    4150
Name: fruit stock, dtype: int64

#print(fruit_dataframe['fruit price','fruit price']) #Error

print(fruit_dataframe[['fruit price','fruit price']]) #리스트형태로 지정하면 DataFrame을 return

        fruit price  fruit price
apple           100          100
banana          300          300
grape           400          400
orange          150          150

3. 데이터 조작

1) 데이터 프레임의 저장과 읽기

excel 저장을 위해서는 openpyxl , xlrd라이브러리가 설치되어 있어야함

fruit_dataframe.to_csv("./fruit.csv")

fruit_dataframe.to_excel("./fruit.xlsx")

fruit2 = pd.read_csv("./fruit.csv")

fruit3 = pd.read_excel("./fruit.xlsx")

2) 데이터프레임의 인덱싱/슬라이싱

- loc : 명시적인덱스를 참조

- iloc : 숫자인덱스를 참조

print(fruit_dataframe.loc['apple'])

fruit price             100
fruit stock            2000
total stock price    200000
Name: apple, dtype: int64

print(fruit_dataframe.loc['banana':'grape', 'fruit stock']) # fruit stock만

banana    1000
grape     3100
Name: fruit stock, dtype: int64

print(fruit_dataframe.loc['banana':'grape', :'fruit stock']) # fruit stock까지

        fruit price  fruit stock
banana          300         1000
grape           400         3100

print(fruit_dataframe.iloc[0])

fruit price             100
fruit stock            2000
total stock price    200000
Name: apple, dtype: int64

print(fruit_dataframe.iloc[1:3, 2] )# fruit stock만

banana     300000
grape     1240000
Name: total stock price, dtype: int64

print(fruit_dataframe.iloc[1:3, :2]) # fruit stock까지

        fruit price  fruit stock
banana          300         1000
grape           400         3100

5) 데이터프레임 조작

score_frame = pd.DataFrame(columns=['학번', '성명', '국어', '영어'])

score_frame.loc[0] = [1, '김철수', 80 , 90]

score_frame.loc[1] = {'학번': 2, '성명':'김영이', '국어':95, '영어':80}

score_frame.loc[1,'영어'] = 85 # 김영이의 영어성적 변경

score_frame.loc[2,'학번'] = 3 # 3번 row 생성

print(score_frame)

  학번   성명   국어   영어
0  1  김철수   80   90
1  2  김영이   95   85
2  3  NaN  NaN  NaN

# 컬럼추가

score_frame['수학'] = np.nan #nan = Not a Number

score_frame.loc[0, '수학'] = 100

print(score_frame)

  학번   성명   국어   영어     수학
0  1  김철수   80   90  100.0
1  2  김영이   95   85    NaN
2  3  NaN  NaN  NaN    NaN

6) 데이터 누락체크

print(score_frame.isnull())

      학번     성명     국어     영어     수학
0  False  False  False  False  False
1  False  False  False  False   True
2  False   True   True   True   True

print(score_frame.notnull())

     학번     성명     국어     영어     수학
0  True   True   True   True   True
1  True   True   True   True  False
2  True  False  False  False  False

print(score_frame.dropna()) # 실제 삭제하지는 않음

  학번   성명  국어  영어     수학
0  1  김철수  80  90  100.0

4. 데이터 프레임 연산

1) Series의 연산

위에서 사용한 fruit_dataframe을 계속 예제로 사용

total_stock_price = fruit_dataframe['fruit price'] * fruit_dataframe['fruit stock']

fruit_dataframe['total stock price'] = total_stock_price

print(fruit_dataframe)

        fruit price  fruit stock  total stock price
apple           100         2000             200000
banana          300         1000             300000
grape           400         3100            1240000
orange          150         4150             622500

2) 동일하지 않은 Series의 연산

seriesA = pd.Series([2,4,6], index=[0,1,2])

seriesB = pd.Series([1,3,5], index=[1,2,3])

print(seriesA + seriesB) # 양쪽에 모두 있는 인덱스가 아니면 NaN

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

print(seriesA.add(seriesB, fill_value=0)) # 비어있는 인덱스는 0으로 채워서 더함

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

3) 집계연산

data = {

'A': [ i+5 for i in range(5)] ,

'B': [ i**2 for i in range(5)]

}

df = pd.DataFrame(data)

print(df)

print(df['A'].sum)

<bound method Series.sum of 0    5
1    6
2    7
3    8
4    9
Name: A, dtype: int64>

print(df.sum())

A    35
B    30
dtype: int64

print(df.mean())

A    7.0
B    6.0
dtype: float64

4) 정렬

df = pd.DataFrame({

'col1': [3,1,9,6,7,4] ,

'col2': ['AA','AB','BF', np.nan , 'C','C'],

'col3': [10,35,71,41,26,3]

})

print(df)

   col1 col2  col3
0     3   AA    10
1     1   AB    35
2     9   BF    71
3     6  NaN    41
4     7    C    26
5     4    C     3

print(df.sort_values('col1'))

   col1 col2  col3
1     1   AB    35
0     3   AA    10
5     4    C     3
3     6  NaN    41
4     7    C    26
2     9   BF    71

print(df.sort_values('col1',ascending=False))

   col1 col2  col3
2     9   BF    71
4     7    C    26
3     6  NaN    41
5     4    C     3
0     3   AA    10
1     1   AB    35

print(df.sort_values(['col2','col1']))

   col1 col2  col3
0     3   AA    10
1     1   AB    35
2     9   BF    71
5     4    C     3
4     7    C    26
3     6  NaN    41

print(df.sort_values(['col2','col1'],ascending=[True,False]))

   col1 col2  col3
0     3   AA    10
1     1   AB    35
2     9   BF    71
4     7    C    26
5     4    C     3
3     6  NaN    41

저작자표시 비영리 변경금지 (새창열림)

'Tech-Pyhton' 카테고리의 다른 글

Heap with Python (파이썬으로 힙 자료구조 이용하기) (1)	2020.08.09
Numpy 요약 정리 (0)	2020.04.04
Numpy 설치하기 (1)	2019.12.25
[Python] Dictionary (0)	2019.03.27
[Python] Map, Filter, Zip (0)	2019.03.27

까마구네집

Pandas 기초 1

'Tech-Pyhton' 카테고리의 다른 글

댓글

티스토리툴바

Pandas 기초 1

'Tech-Pyhton' 카테고리의 다른 글

관련글

댓글

티스토리툴바