1. Introduction

Tux - the Linux mascot

Version control repositories like CVS, Subversion or Git can be a real gold mine for software developers. They contain every change to the source code including the date (the "when"), the responsible developer (the "who"), as well as a little message that describes the intention (the "what") of a change.

In this notebook, we will analyze the evolution of a very famous open-source project – the Linux kernel. The Linux kernel is the heart of some Linux distributions like Debian, Ubuntu or CentOS. Our dataset at hand contains the history of kernel development of almost 13 years (early 2005 - late 2017). We get some insights into the work of the development efforts by

  • identifying the TOP 10 contributors and
  • visualizing the commits over the years.
import pandas as pd
data = pd.read_csv('datasets/git_log_excerpt.csv')
print(data)
      1502382966#Linus Torvalds
0       1501368308#Max Gurtovoy
1        1501625560#James Smart
2        1501625559#James Smart
3       1500568442#Martin Wilck
4           1502273719#Xin Long
5    1502278684#Nikolay Borisov
6  1502238384#Girish Moodalbail
7   1502228709#Florian Fainelli
8     1502223836#Jon Paul Maloy

2. Reading in the dataset

The dataset was created by using the command git log --encoding=latin-1 --pretty="%at#%aN" in late 2017. The latin-1 encoded text output was saved in a header-less CSV file. In this file, each row is a commit entry with the following information:

  • timestamp: the time of the commit as a UNIX timestamp in seconds since 1970-01-01 00:00:00 (Git log placeholder "%at")
  • author: the name of the author that performed the commit (Git log placeholder "%aN")

The columns are separated by the number sign #. The complete dataset is in the datasets/ directory. It is a gz-compressed csv file named git_log.gz.

import pandas as pd

# Reading in the log file
git_log = pd.read_csv('datasets/git_log.gz', sep='#', encoding='latin-1', compression='gzip', header=None, names=['timestamp', 'author'])

# Printing out the first 5 rows
print(git_log.head())
    timestamp          author
0  1502826583  Linus Torvalds
1  1501749089   Adrian Hunter
2  1501749088   Adrian Hunter
3  1501882480       Kees Cook
4  1497271395       Rob Clark

3. Getting an overview

The dataset contains the information about every single code contribution (a "commit") to the Linux kernel over the last 13 years. We'll first take a look at the number of authors and their commits to the repository.

number_of_commits = len(git_log['author'])

# calculating number of authors
not_null = (git_log['author'].isna() == False)
number_of_authors = len(pd.unique(git_log['author'][not_null]))

# printing out the results
print("%s authors committed %s code changes." % (number_of_authors, number_of_commits))
17385 authors committed 699071 code changes.

4. Finding the TOP 10 contributors

There are some very important people that changed the Linux kernel very often. To see if there are any bottlenecks, we take a look at the TOP 10 authors with the most commits.

top_10_authors = git_log.groupby('author').count().sort_values('timestamp', ascending=False).head(10)

# Listing contents of 'top_10_authors'
top_10_authors
timestamp
author
Linus Torvalds 23361
David S. Miller 9106
Mark Brown 6802
Takashi Iwai 6209
Al Viro 6006
H Hartley Sweeten 5938
Ingo Molnar 5344
Mauro Carvalho Chehab 5204
Arnd Bergmann 4890
Greg Kroah-Hartman 4580

5. Wrangling the data

For our analysis, we want to visualize the contributions over time. For this, we use the information in the timestamp column to create a time series-based column.

git_log['timestamp'] = pd.to_datetime(git_log['timestamp'], unit='s')

# summarizing the converted timestamp column
git_log[['timestamp']].describe()
timestamp
count 699071
unique 668448
top 2008-09-04 05:30:19
freq 99
first 1970-01-01 00:00:01
last 2037-04-25 08:08:26

6. Treating wrong timestamps

As we can see from the results above, some contributors had their operating system's time incorrectly set when they committed to the repository. We'll clean up the timestamp column by dropping the rows with the incorrect timestamps.

first_commit_timestamp = git_log[git_log['author'] == 'Linus Torvalds']['timestamp'].min()

# determining the last sensible commit timestamp
last_commit_timestamp = pd.to_datetime('today')

# filtering out wrong timestamps
corrected_log = git_log[git_log['timestamp'] >= first_commit_timestamp][git_log['timestamp'] <= last_commit_timestamp]

# summarizing the corrected timestamp column
corrected_log['timestamp'].describe()
count                  698569
unique                 667977
top       2008-09-04 05:30:19
freq                       99
first     2005-04-16 22:20:36
last      2017-10-03 12:57:00
Name: timestamp, dtype: object

7. Grouping commits per year

To find out how the development activity has increased over time, we'll group the commits by year and count them up.

commits_per_year = corrected_log.groupby(pd.Grouper(key='timestamp', freq='AS')).count()

# Listing the first rows
commits_per_year.head(5)
author
timestamp
2005-01-01 16229
2006-01-01 29255
2007-01-01 33759
2008-01-01 48847
2009-01-01 52572

8. Visualizing the history of Linux

Finally, we'll make a plot out of these counts to better see how the development effort on Linux has increased over the the last few years.

%matplotlib inline

# plot the data
commits_per_year.plot(kind='line', title='Visual', legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7fa586a039b0>

9. Conclusion

Thanks to the solid foundation and caretaking of Linux Torvalds, many other developers are now able to contribute to the Linux kernel as well. There is no decrease of development activity at sight!

year_with_most_commits = commits_per_year[commits_per_year['author'] == max(commits_per_year['author'])]
year_with_most_commits.index[0]
Timestamp('2016-01-01 00:00:00', freq='AS-JAN')