Load data
Pandas features a number of functions for reading tabular or dataframe data.
When you already have a source of data you want to use for your analysis, it is probably going to be in another format than Pandas dataframes. In that case, Pandas has you covered; there are many ways to import existing files in a myriad of formats!
Not every file is a comma delimiter separated file. Pandas has a number of methods for reading tabular data as a DataFrame object. For most of the formats methods are available:
read_csv
read_fwf
read_clipboard
read_excel
read_hdf
read_html
read_json
read_msgpack
read_pickle
read_sas
read_sql
read_stata
read_feather
The methods reads the data directly into a Pandas DataFrame. Most of these methods have options to skip NaN values, read a specific part of the file by defining a number of rows of or the chunk size or skip the footer.
na_values
skiprows
sep
nrows
chunksize
skip_footer
encoding
As an example, the following code a csv file is read, containing no headers, tab separated, and the first 10 rows are skipped.
df = pd.read_csv(args[1],
header = None,
delimiter="\t",
encoding='utf-8',
skiprows = 10,
names = ['doc', 'line', 'text'])
Note that Pandas is pretty smart about importing the correct datatypes; but you should always check that columns have the types you think they should have. Use the ".info()" and ".dtypes" functions for this. (See the "Data Inspection" chapter for more info on this.)
Last updated
Was this helpful?