In the same way that the journey of a thousand miles begins with a single step, the journey of a machine learning project begins with loading your data.
Load your data right, and you will lay a strong foundation for your model.
Data for a machine learning project may take many forms like tabular, image, html or text. Files may be loaded from different formats such as pickle (.pkl), standard excel formats (such as .csv or .xlsx), stata files (.dta), HDF (.hd5) and many more. While getting started, you’ll usually encounter csv or excel file formats. These can be loaded through several libraries, including Pandas, NumPy and python’s csv module. Pandas is a popular and user-friendly option for most machine learning practitioners, and it provides a great set of functionalities to correctly parse your data.
This article will familiarize you with five useful Pandas data loading and preprocessing capabilities for csv or excel files.
We will use the open source Metro Interstate Traffic Volume data. Download this from the UCI Machine Learning Repository, and make sure the file is stored in the same directory as your python notebook or script. Load the Pandas library and import your data using
read_csv. The corresponding method for excel files is
read_excel. Both these methods return the data as a DataFrame, which is a Pandas data structure for 2-dimensional data. We’ll assign the DataFrame to a variable called
For excel files:
df = pd.read_excel(‘file_path/file_name.xlsx’, sheet_name = ‘Sheet_1’)
read_excel tells Pandas which sheet to load your data from. You can specify this either as a string or a 0-indexed sheet number. The default value is 0, which means the first sheet will be parsed. To read all sheets, just set this parameter to None.
1. Getting your column names right
Pandas usually assigns column names by interpreting the first lines of the file as the header row. However if you want to replace these with your own custom column names, you can specify this with the
names parameters. Below are some commonly encountered scenarios:
Scenario 1: Your file doesn’t have headers, and data starts from the first row
In this scenario, you want to tell Pandas that your first row is actually data, and is not to be used as column names. Setting the header parameter to
None will achieve this. In the absence of headers, Pandas will assign generic column names like
col_1 etc. To avoid this, you can pass your own column names to the data. For instance,within
read_csv you can pass:
header=None, names = [‘My first column’, ‘My second column’, ..]
Scenario 2: Your data has some blank rows, and headers start thereafter
skiprows parameter in
read_excel to specify how many lines of data to ignore. The next available line will automatically get assigned as your data header.
Scenario 3: Your headers are in the first row, but you want to use some other column names
names parameters together to explicitly read the current headers, and replace them with your own. Let’s try this with the traffic dataset, by changing the column names to make them more descriptive and convert them to sentence case.
2.Getting your row and column selection right
You often don’t want to read all the columns in your data. Subsetting columns right at the data loading stage will reduce your memory usage right away. The
usecols parameter in your
read_excel will achieve this. For instance, to read only the holiday, temperature and datetime columns:
Alternatively, you can also use a list of strings as the argument to
nrows parameter specifies how many rows we want to read from the file, starting from the top. This is a useful way to read only a sample of your data if your file is very large. Let’s try reading only the first 100 rows out of 48,204 rows in the traffic data:
3.Getting your row indices right
Pandas assigns a default range of 0-indexed integers as labels for each row. However in machine learning projects, we often use more meaningful variables such as dates, unique transaction id, store id etc as the index column. Specify this in your
read_excel function using the
You can also use
index_col = 7, which is the column index number of the
If you’re using a subset of your columns with the
usecols parameter, make sure you use the new index number for the
date_time column based on the new column list. For instance:
What changed? We used 2 as the index number for
date_time instead of 7!
4.Getting the right data types
Pandas infers the data type for all columns, but sometimes you may wish to specify your own datatypes for certain columns. To do this, you can use the
dtype parameter with a dictionary of columns and their respective data types.
Let’s look at the current datatypes in the traffic data:
date_time column has been parsed as string. We’d like to set this to the datetime format, so let’s do this using the
parse_dates parameter. We’ll pass the index number of the
date_time column as the argument to this.
Voila, we parsed the date column right at the data loading stage!
5.Handling specific missing values
Some datasets may not explicitly have missing values as NA, #N/A etc. They may instead have strings like ‘Unknown’ to indicate that the value is missing. If you know beforehand what strings you’d like to treat as missing values, you can pass these as a list to the
For instance,we have 4 entries of the string 'squalls' in the
weather_description column. Suppose you knew that the weather was actually pleasant on those day, and these entries are actually missing values. You want to set this as NA right away. You can do this as below:
You can see that the 4 rows with ‘squall’ in the weather description now have missing values. This makes it easier to handle missing values at later stages of your project.
By now, you should be confident about using Pandas to read your next machine learning data set. This article gives you the basics you need to get started in your next machine project.