- 1. Introduction
- 2. Reading in the dataset
- 3. Getting an overview
- 4. Finding the TOP 10 contributors
- 5. Wrangling the data
- 6. Treating wrong timestamps
- 7. Grouping commits per year
- 8. Visualizing the history of Linux
- 9. Conclusion
1. Introduction
Version control repositories like CVS, Subversion or Git can be a real gold mine for software developers. They contain every change to the source code including the date (the "when"), the responsible developer (the "who"), as well as a little message that describes the intention (the "what") of a change.
In this notebook, we will analyze the evolution of a very famous open-source project – the Linux kernel. The Linux kernel is the heart of some Linux distributions like Debian, Ubuntu or CentOS. Our dataset at hand contains the history of kernel development of almost 13 years (early 2005 - late 2017). We get some insights into the work of the development efforts by
- identifying the TOP 10 contributors and
- visualizing the commits over the years.
import pandas as pd
data = pd.read_csv('datasets/git_log_excerpt.csv')
print(data)
2. Reading in the dataset
The dataset was created by using the command git log --encoding=latin-1 --pretty="%at#%aN"
in late 2017. The latin-1
encoded text output was saved in a header-less CSV file. In this file, each row is a commit entry with the following information:
-
timestamp
: the time of the commit as a UNIX timestamp in seconds since 1970-01-01 00:00:00 (Git log placeholder "%at
") -
author
: the name of the author that performed the commit (Git log placeholder "%aN
")
The columns are separated by the number sign #
. The complete dataset is in the datasets/
directory. It is a gz
-compressed csv file named git_log.gz
.
import pandas as pd
# Reading in the log file
git_log = pd.read_csv('datasets/git_log.gz', sep='#', encoding='latin-1', compression='gzip', header=None, names=['timestamp', 'author'])
# Printing out the first 5 rows
print(git_log.head())
number_of_commits = len(git_log['author'])
# calculating number of authors
not_null = (git_log['author'].isna() == False)
number_of_authors = len(pd.unique(git_log['author'][not_null]))
# printing out the results
print("%s authors committed %s code changes." % (number_of_authors, number_of_commits))
top_10_authors = git_log.groupby('author').count().sort_values('timestamp', ascending=False).head(10)
# Listing contents of 'top_10_authors'
top_10_authors
git_log['timestamp'] = pd.to_datetime(git_log['timestamp'], unit='s')
# summarizing the converted timestamp column
git_log[['timestamp']].describe()
first_commit_timestamp = git_log[git_log['author'] == 'Linus Torvalds']['timestamp'].min()
# determining the last sensible commit timestamp
last_commit_timestamp = pd.to_datetime('today')
# filtering out wrong timestamps
corrected_log = git_log[git_log['timestamp'] >= first_commit_timestamp][git_log['timestamp'] <= last_commit_timestamp]
# summarizing the corrected timestamp column
corrected_log['timestamp'].describe()
commits_per_year = corrected_log.groupby(pd.Grouper(key='timestamp', freq='AS')).count()
# Listing the first rows
commits_per_year.head(5)
%matplotlib inline
# plot the data
commits_per_year.plot(kind='line', title='Visual', legend=False)
year_with_most_commits = commits_per_year[commits_per_year['author'] == max(commits_per_year['author'])]
year_with_most_commits.index[0]