Python Pandas - Tutorial (Part N ° 2)

Share this post

This article is the thenext part of the first part of my tutorial on the Python Pandas library . The idea here is to go further in the discovery of this library and in particular to give the keys to the set manipulations it offers. To put it simply, you have DataFrame but you will have to combine them together in order to work on your characteristics. In short, you will want to do “SQL like” on your data! Good news, this bookstore allows you to do it in the most elegant and simple way.

Index

Let’s prepare our dataset

For this tutorial we will create two matrices A and B:

import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3],
                  'Col2': [4, 4, 6]},
                 index=['rowA1', 'rowA2', 'rowA3'])
print("Matrice A\n", A)
B = pd.DataFrame({'Col1': [1, 3],
                  'Col3': [7, 8],
                  'Col4': [9, 10]},
                 index=['rowB1', 'rowB2'])
print("\nMatrice B\n", B)

Here they are :

Matrice A
        Col1  Col2
rowA1     1     4
rowA2     2     4
rowA3     3     6

Matrice B
        Col1  Col3  Col4
rowB1     1     7     9
rowB2     3     8    10

Count the values of a column

The value_counts () method appended to a column of a DataFrame is used to list the values of this column with their number of occurrences:

A["Col1"].value_counts()

3    1
2    1
1    1
Name: Col1, dtype: int64

Browse the rows of a DataFrame

Sometimes it can be useful to step through a DataFrame line by line. for this Pandas offers two iterrows () and itertuples () iterators . Both of these methods have their subtlety as itertuples()which is supposed to be faster than iterrows(), or iterrows() might not match from row to row. In short, I advise you to take a good look at the Pandas documentation before embarking on this path.

To iterate therefore, you can therefore use:

for index, ligne in monDataFrame.iterrows():
    print (ligne["Colonne X"])

Filter the rows of a DataFrame

Filtering is an essential action when working with datasets. For this we have two methods, one Python and the other via Pandas.

B.loc[B['Col3'] == 7]

Or just

B[B['Col3'] == 7]

In both cases we filter the DataFrame on the column ‘Col3’ which has the value 7.

We can of course combine several conditions but be careful not to use the boolean operators and and or in this case but & and | in place:

B.loc[(B['Col3'] == 7) | (B['Col3'] > 1)]['Col1']
B.loc[(B['Col3'] == 7) & (B['Col3'] > 1)]['Col1']

Now we can use the Filter method of the DataFrame Pandas. It does not provide exactly the same services but can be very practical to like for example:

A.filter(like='A3', axis=0)

Joins

To do this, nothing could be simpler, use the merge method and specify the join mode:

INNER : strict join
LEFT : left joint
RIGHT : right joint
OUTER : full outer

pd.merge(A, B, how='inner', on='Col1')

Union

To do this, nothing could be simpler, use the concat method:

pd.concat([A, B])

Group By

The groupby method is used to group rows on a criterion and to perform operations on the columns. This is exactly what SQL Group By does.

print ("Group By / Count:\n ", A.groupby("Col2")['Col1'].size(), "\n")
print ("Group By / Moyenne:\n ", A.groupby("Col2")['Col1'].mean(), "\n")

It is possible to specify several transformations on the same group:

# Multiple opérations
import numpy as np
A.groupby('Col2').agg({'Col1': np.mean, 'Col2': np.size})

Download or check out the Jupyter notebook with the above examples on GitHub .

Share this post

Python Pandas – Tutorial (Part N ° 2)

Let’s prepare our dataset

Count the values of a column

Browse the rows of a DataFrame

Filter the rows of a DataFrame

Joins

Union

Group By

Benoit Cayla

Leave a Reply Cancel reply

Let’s prepare our dataset

Count the values ​​of a column

Browse the rows of a DataFrame

Filter the rows of a DataFrame

Joins

Union

Group By

Benoit Cayla

You might also like

SQLite… SQL in a file!

Publish your Machine Learning models with Flask!

The persistence of machine learning models

Leave a Reply Cancel reply

Count the values of a column