For the cold days a little
data science advent calendar.
All gifts are mainly for python or pandas 🎅:
data:image/s3,"s3://crabby-images/c017a/c017ab0c29b45c1a40a2aa169825caae250a1f5c" alt="Responsive image"
01: Remove outliers
The first door is a one-liner:
df[np.abs(df.Data-df.Data.mean())<=(3*df.Data.std())]
#Remove outliers that are not within +3 to -3 standard deviations in the column 'Data'.
Source
Software used:
- Python
- Pandas
- Numpy
data:image/s3,"s3://crabby-images/2dc18/2dc186ded0560cdc6ee8569a4b46a40ed4a62e47" alt="Responsive image"
02: Global Shark Attacks
The second door is a small EDA:
How dangerous are sharks?
Link to Jupyter Notebook
Software used:
- Python
- Pandas
- Matplotlib
data:image/s3,"s3://crabby-images/1d0b3/1d0b39d6ebc252be30f35ccd692dfd5d851f0dc0" alt="Responsive image"
03: RegEx E-Mail-Address matching
The third door is a common regex for emails:
email = re.compile(u"([a-z0-9!#$%&'*+\/=?^_`{|.}~-]+@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)", re.IGNORECASE)
print(email.match("test@mail.com"))
Source: The common regex library
Software used:
- Python
- RegEx
data:image/s3,"s3://crabby-images/31c8b/31c8bc52f2e469f7db4d7e7f9ec6e0b7d84605b9" alt="Responsive image"
04: Fill NA Values
The fourth door shows how to fill missing values with the average in a bitcoin trend analysis:
btc = btc.replace(0, np.nan).fillna(method='ffill')
It basically replace all zeros with np.NotaNumber and then fills them with the average.
Link to the Jupyter Notebook
Further reading:
Pandas Documentation
Software used:
- Python
- Pandas
- Quandl
data:image/s3,"s3://crabby-images/91128/91128b476abe65264f4a89f8f5a567c220a64b52" alt="Responsive image"
05: Find similarity between two strings
The fifth door shows how to get a percentage, that shows the similarity between two strings:
>>> import difflib
>>> a='John Doe'
>>> b='John Door'
>>> seq=difflib.SequenceMatcher(None, a,b)
>>> d=seq.ratio()
>>> print (d)
0.8235294117647058
Source
Software used:
- Python
- difflib
data:image/s3,"s3://crabby-images/23fce/23fce0f7b01cbbecbc45495557424bdaf9736358" alt="Responsive image"
06: At what day is St. Nicholas celebrated
The sixth door shows how to do a rapid prototype of geodata.
At what day is St. Nicholas celebrated
In less then 20 minutes I could take the data from Wikipedia and edit the HTML.
A Google Maps API-Key may be needed.
Software used:
data:image/s3,"s3://crabby-images/d7f27/d7f2709fe8e8b24362b68c2d807b8b89fcb967b9" alt="Responsive image"
07: Discover emerging trends
The seventh door shows how to discover emerging trends.
Here is the Jupyter Notebook for it.
df.groupby([pd.Grouper(key='creationdate', freq='QS'), 'tagname']) \
.size() \
.unstack('tagname', fill_value=0) \
.pipe(lambda x: x.div(x.sum(1), axis=0)) \
.plot(kind='area', figsize=(12,6), title='Percentage of New Stack Overflow Questions') \
.legend(loc='upper left')
Thanks very much to Theodore Petrou, the author of the Pandas Cookbook for sending me this line of code :)
Software used:
- Python
- Pandas
- glob
data:image/s3,"s3://crabby-images/d3874/d387478b984a4de834d5193f27f1d8dc2f6bacd3" alt="Responsive image"
08: Remove accents in Python
The eights door shows how to remove accents in Python. This can be very useful for cleaning data
# The example string is of type 'unicode'
accented_string = u'Málaga'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'
Source: Christian Oudard
Software used:
- Python
- unidecode
data:image/s3,"s3://crabby-images/233d3/233d3e4142768be50340ae8f594d40b87c74ba96" alt="Responsive image"
09: Import csv with timeseries
The ninth door shows how to import csv files and make the date a datetime obeject.
Here is the documentation, but you don't find much about it.
The example:
timeformat = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
pd.read_csv('date_example.csv', parse_dates=['date'], date_parser=timeformat)
Thanks very much to Theodore Petrou, the author of the Pandas Cookbook for sending me this line of code :)
Software used:
- Python
- Pandas
data:image/s3,"s3://crabby-images/7cbff/7cbff8384811041f29cc34b869f0587ddbb5aada" alt="Responsive image"
10: Bigrams, bigrams
The tenth door shows how to show frequencies of bigrams.
import nltk
f = open('a_text_file')
raw = f.read()
tokens = nltk.word_tokenize(raw)
bgs = nltk.bigrams(tokens)
nltk.FreqDist(bgs)
Source:
Code and Image from the great EDA of horror books
Software used:
- Python
- Pandas
- NLTK
data:image/s3,"s3://crabby-images/ce65a/ce65a46f92020c934727f8ffbd3912a8271b086a" alt="Responsive image"
11: Correlation heatmap
The eleventh door shows how to do a simple correlation heatmap.
I like the immediate visual experencie of formal data.
import seaborn as sns
corr = dataframe.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
Source from StackOverflow User:
Rafael Valle and Image from this post
Software used:
- Python
- Pandas
- Seaborn
data:image/s3,"s3://crabby-images/31e19/31e190a974780370d603429ac48a0b8d1f1cb14b" alt="Responsive image"
12: Chart Radioactive elements in peanut butter crunchy
The twelth door shows what radio elements could be found in peanut butter. Yes, in peanut butter. I found this during a research session and had to share it :)
Data Source and sheet with chart
Software used:
- Hamburg Transparenz Portal
- Google Sheets
data:image/s3,"s3://crabby-images/b55ab/b55abfadaff61621694ba4f498cf1aa572343b87" alt="Responsive image"
13: Groupby Statistics
The thirteenth door shows how easy you can group in Pandas
df['preTestScore'].groupby(df['regiment']).describe()
Source
Software used:
- Python
- Pandas
data:image/s3,"s3://crabby-images/159ec/159ec67adfe7907ac5d894c91f4f590c8c3875e7" alt="Responsive image"
15: Formatting Floats
When working with floats sometimes the number is so small you need to specify the number of values to display:
pd.options.display.float_format = '${:,.2f}'.format
Source: Unutbu
Software used:
- Python
- RegEx
data:image/s3,"s3://crabby-images/b2ced/b2cede2041493bb9b289ac084577c0c3a52b2cc0" alt="Responsive image"
16: Short PCAs
The sixteenth door shows how to do a PCA in one line:
from sklearn.decomposition import PCA
PCA(n_components = 4).fit_transform(data)
It basically replace all zeros with np.NotaNumber and then fills them with the average.
Join the discussion
Software used:
- Python
- sklearn
data:image/s3,"s3://crabby-images/46be9/46be91fa58ba54d50b82df2a092333a305bead6e" alt="Responsive image"
17: Timeseries. Get rows that were created in the last hour
The seventeenth door shows how to get rows that were created in the last hour:
df[df['datetime'] > datetime.datetime.now()- datetime.timedelta(hours = 1, minutes=10)]
Software used:
data:image/s3,"s3://crabby-images/6f0ef/6f0ef2daa07ca63f694b64ac2f9f4d5a3689d336" alt="Responsive image"
18: Sort by minimum or maximum
The eighteenth door shows is a simple way of applying groupby and describe.
df.groupby(by='platform').describe().sort_values(by='min')
Software used:
- Python
- Pandas
data:image/s3,"s3://crabby-images/9c8b5/9c8b549f952e88d3da44300d50cebc15ae15a413" alt="Responsive image"
19: Rolling average in Holoviews
The nineteenth door shows how to apply rolling averages in holoviews to smooth out curves.
import holoviews as hv
from holoviews.operation.timeseries import rolling
hv.extension('bokeh')
xs = df[x]
ys = df[y]
curve = hv.Curve((xs, ys))
avg_curve = rolling(curves2, rolling_window=1200).relabel('Average')
avg_curve
Software used:
- Python
- Holoviews
- Pandas
data:image/s3,"s3://crabby-images/bb608/bb608bbefac2913ab7bf954239f628042edb480b" alt="Responsive image"
20: Display simple tables as a web app using flask
The tweentieh door shows how to display a dataframe as a HTML Table
Try it yourself below:
Source: Sarah Lee Jane
Software used:
- Python
- Flask
- Pandas
data:image/s3,"s3://crabby-images/2a658/2a658a7a903ce20fc18832bfef31c0e7fc766001" alt="Responsive image"
21: Book recommendation
This door is book recommendation for my favorite book.
Written by the creator of Pandas
Books used:
- Python for Data Analysis
data:image/s3,"s3://crabby-images/e53de/e53de44a784fd6624c6dc11bb82a865b863b9677" alt="Responsive image"
22: Simple Scraper in Google Sheets
With Google Appscript it is also possible to archive data.
Link:
Google Sheets
Software used:
- Google Sheets
- Google App Script
data:image/s3,"s3://crabby-images/9cee7/9cee7f1fa0a7c3c723046a2cde619ebb428d0e48" alt="Responsive image"
24: Find similiar repos by looking other Stargazers stars
This is my special present for all the explorers
Github Repo Recommender
Software used:
- Python
- Pandas
- Github API