Load data

Pandas features a number of functions for reading tabular or dataframe data.

When you already have a source of data you want to use for your analysis, it is probably going to be in another format than Pandas dataframes. In that case, Pandas has you covered; there are many ways to import existing files in a myriad of formats!

Not every file is a comma delimiter separated file. Pandas has a number of methods for reading tabular data as a DataFrame object. For most of the formats methods are available:

    read_csv
    read_fwf
    read_clipboard
    read_excel
    read_hdf
    read_html
    read_json
    read_msgpack
    read_pickle
    read_sas
    read_sql
    read_stata
    read_feather

The methods reads the data directly into a Pandas DataFrame. Most of these methods have options to skip NaN values, read a specific part of the file by defining a number of rows of or the chunk size or skip the footer.

    na_values
    skiprows
    sep
    nrows
    chunksize
    skip_footer
    encoding

As an example, the following code a csv file is read, containing no headers, tab separated, and the first 10 rows are skipped.

    df = pd.read_csv(args[1], 
                     header = None, 
                     delimiter="\t", 
                     encoding='utf-8', 
                     skiprows = 10,
                     names = ['doc', 'line', 'text'])
    

Note that Pandas is pretty smart about importing the correct datatypes; but you should always check that columns have the types you think they should have. Use the ".info()" and ".dtypes" functions for this. (See the "Data Inspection" chapter for more info on this.)

Last updated

Was this helpful?