Pandas is a python's third-party library that provides computational fast and flexible data structures in manipulating and analyzing tabular, multi-dimensional and time-series data.
In most machine learning challenges, the dataset provided will be in the form of .csv files that need to be loaded into the workspace. The loaded dataset will be used in prediction, and predicted values also should be submitted in the form of .csv files as per there submission guidelines.
In this article, I will explain some different ways of loading the dataset into your workspace and making an output submission file.
The first way is to load the dataset directly without passing any arguments, which is useful when all the columns are used in model building.
import pandas as pd
data = pd.read_csv('train.csv')
data.head() # To view the first five rows of the dataset
The second way is to load the dataset using an index_col argument when one of the columns is unique and will not be used in model building.
data = pd.read_csv('train.csv', index_col='Index')
data.head() # To view the first five rows of the dataset
Note: The type of the index_col argument will be a string when only one column needs to be used as an index; it will be a list of strings when two or more columns need to be used as an index.
The Third way is to load the dataset using parse_dates argument when one or more columns in the dataset are of datatype 'datetime[ns]'
data = pd.read_csv('train.csv', parse_dates='Date')
data.head() # To view the first five rows of the dataset
Note: The type of the parse_dates argument will be a string when only one column needs to be parsed; it will be a list of strings when two or more columns need to be used as an index.
Finally, in the end, after all the model building, model tuning and prediction stuff. You need to save the prediction in the specified format as in the submission guidelines.
In most cases, the submission file contains two more columns in which of the column will be the predicted value.
For example, if the submission guidelines say that the output should have two columns with column names filename and value, then the code looks similar to this:
The variable index contains the filename values, and the variable prediction has the prediction values.
output = pd.DataFrame({'filename': filename, 'value': prediction})
output.to_csv('output.csv', index=False)
Note: the index should be False to ensure that the to_csv does not add an extra index column to the output file.