Five Pandas Must-Know Techniques for Machine Learning Projects

. 5 min read

In the same way that the journey of a thousand miles begins with a single step, the journey of a machine learning project begins with loading your data.

Load your data right, and you will lay a strong foundation for your model.

Data for a machine learning project may take many forms like tabular, image, html or text. Files may be loaded from different formats such as pickle (.pkl), standard excel formats (such as .csv or .xlsx), stata files (.dta), HDF (.hd5) and many more. While getting started, you’ll usually encounter csv or excel file formats. These can be loaded through several libraries, including Pandas, NumPy and python’s csv module. Pandas is a popular and user-friendly option for most machine learning practitioners, and it provides a great set of functionalities to correctly parse your data.

This article will familiarize you with five useful Pandas data loading and preprocessing capabilities for csv or excel files.

Getting started

We will use the open source Metro Interstate Traffic Volume data. Download this from the UCI Machine Learning Repository, and make sure the file is stored in the same directory as your python notebook or script. Load the Pandas library and import your data using read_csv. The corresponding method for excel files is read_excel. Both these methods return the data as a DataFrame, which is a Pandas data structure for 2-dimensional data. We’ll assign the DataFrame to a variable called df.

For excel files:

df = pd.read_excel(‘file_path/file_name.xlsx’, sheet_name = ‘Sheet_1’)

The parameter sheet_name in read_excel tells Pandas which sheet to load your data from. You can specify this either as a string or a 0-indexed sheet number. The default value is 0, which means the first sheet will be parsed.  To read all sheets, just set this parameter to None.

1. Getting your column names right

Pandas usually assigns column names by interpreting the first lines of the file as the header row. However if you want to replace these with your own custom column names, you can specify this with the header and names parameters. Below are some commonly encountered scenarios:

Scenario 1: Your file doesn’t have headers, and data starts from the first row
In this scenario, you want to tell Pandas that your first row is actually data, and is not to be used as column names. Setting the header parameter to None will achieve this.  In the absence of headers, Pandas will assign generic column names like col_0, col_1 etc. To avoid this, you can pass your own column names to the data. For instance,within read_csv you can pass:

header=None, names = [‘My first column’, ‘My second column’, ..]

Scenario 2: Your data has some blank rows, and headers start thereafter

Use the skiprows parameter in read_csv or read_excel to specify how many lines of data to ignore. The next available line will automatically get assigned as your data header.

Scenario 3: Your headers are in the first row, but you want to use some other column names

Use the header and names parameters together to explicitly read the current headers, and replace them with your own. Let’s try this with the traffic dataset, by changing the column names to make them more descriptive and convert them to sentence case.

2.Getting your row and column selection right

You often don’t want to read all the columns in your data. Subsetting columns right at the data loading stage will reduce your memory usage right away. The usecols parameter in your read_csv or read_excel will achieve this. For instance, to read only the holiday, temperature and datetime columns:

Alternatively, you can also use a list of strings as the argument to usecols:

Similarly, the nrows parameter specifies how many rows we want to read from the file, starting from the top. This is a useful way to read only a  sample of your data if your file is very large. Let’s try reading only the first 100 rows out of 48,204 rows in the traffic data:

3.Getting your row indices right

Pandas assigns a default range of 0-indexed integers as labels for each row. However in machine learning projects, we often use more meaningful variables such as dates, unique transaction id, store id etc as the index column.  Specify this in your read_csv or read_excel function using the index_col parameter:

You can also use index_col = 7, which is the column index number of the datetime column.

If you’re using a subset of your columns with the usecols parameter, make sure you use the new index number for the date_time column based on the new column list. For instance:

What changed? We used 2 as the index number for date_time instead of 7!

4.Getting the right data types

Pandas infers the data type for all columns, but sometimes you may wish to specify your own datatypes for certain columns. To do this, you can use the dtype parameter with a dictionary of columns and their respective data types.

Let’s look at the current datatypes in the traffic data:

The date_time column has been parsed as string. We’d like to set this to the datetime format, so let’s do this using the parse_dates parameter. We’ll pass the index number of the date_time column as the argument to this.

Voila, we parsed the date column right at the data loading stage!

5.Handling specific missing values

Some datasets may not explicitly have missing values as NA, #N/A etc. They may instead have strings like ‘Unknown’ to indicate that the value is missing. If you know beforehand what strings you’d like to treat as missing values, you can pass these as a list to the na_values parameter.

For instance,we have 4 entries of the string 'squalls' in the weather_description column. Suppose you knew that the weather was actually pleasant on those day, and these entries are actually missing values. You want to set this as NA right away. You can do this as below:

You can see that the 4 rows with ‘squall’ in the weather description now have missing values. This makes it easier to handle missing values at later stages of your project.

Conclusion:
By now, you should be confident about using Pandas to read your next machine learning data set. This article gives you the basics you need to get started in your next machine project.



Get Started - Future proof your career

Join 150,000 aspirants. Learn Today - Apply Today. Try Free Programs

Learn Data Science Free with GLabs