Wednesday, August 27, 2014

Visualizing electricity prices with Plotly

We have already mentioned plotly many times (here are other two posts about it) and this time we'll see how to use it in order to build an interactive visualization of the latest data about the domestic electricity prices provided by International Energy Agency (IEA).

In the chart that we are going to make, we will show the prices of the domestic electricity among the countries monitored by IEA in 2013 with a bar chart where each bar shows the electricity price and the fraction of the price represented by the taxes.

First, we import the data (the full data is available here, in this post we'll use only the Table 5.5.1 in cvs format) using pandas:
import pandas as pd
ieaprices = pd.read_csv('iea_prices.csv',
                        na_values=('..','+','-','+/-'))
ieaprices = ieaprices.dropna()
ieaprices.set_index(['Country'],inplace=True)
countries = ieaprices.sort('2013_with_tax').index
Then, we arrange the data in order create a plotly bar chart:
from plotly.graph_objs import Bar,Data,Layout,Figure
from plotly.graph_objs import XAxis,YAxis,Marker,Scatter,Legend

prices_bars = []

# computing the taxes
taxes = ieaprices['2013_with_tax']-ieaprices['2013_no_tax']

# adding the prices to the chart
prices_bars.append(Bar(x=countries.values, 
             y=ieaprices['2013_no_tax'].ix[countries].values,
             marker=Marker(color='#0074D9'),
             name='price without taxes'))

# adding the taxes to the chart
prices_bars.append(Bar(x=countries.values, 
             y=taxes.ix[countries].values,
             marker=Marker(color='#0099D9'),name='taxes'))
And now we are ready to submit the data to the plotly server to render the chart:
import plotly.plotly as py

py.sign_in("SexyUser", "asexykeyforasexyuser")

meadian_line = Scatter(
    x=countries.values,
    y=np.ones(len(countries))*ieaprices['2013_with_tax'].median(),
    marker=Marker(color='rgb(40, 40, 40)'),
    opacity=0.5,
    mode='lines',
    name='Median')

data = Data(prices_bars+[meadian_line])

layout = Layout(
    title='Domestic electricity prices in the IEA in 2013',
    xaxis=XAxis(type='category'),
    yaxis=YAxis(title='Price (Pence per Kwh)'),
    legend=Legend(x=0.0,y=1.0),
    barmode='stack',
    hovermode='closest')

fig = Figure(data=data, layout=layout)

# this line will work only in ipython
# use py.plot() in other environments
plot_url = py.iplot(fig, filename='ieaprices2013') 
The result should look like this:

Looking at the chart we note that, during 2013, the average domestic electricity prices, including taxes, in Denmark and Germany were the highest in the IEA. We also note that in Denmark the fraction of taxes paid is higher than the actual electricity price whereas in Germany the actual electricity price and the taxes are almost the same. Interestingly, USA has the lowest price and the lowest taxation.

This post shows how to create one of the charts commented here, where a more insights about the IEA data are provided.

Wednesday, August 20, 2014

Quick HDF5 with Pandas

HDF5 is a format designed to store large numerical arrays of homogenous type. It cames particularly handy when you need to organize your data models in a hierarchical fashion and you also need a fast way to retrieve the data. Pandas implements a quick and intuitive interface for this format and in this post will shortly introduce how it works.

We can create a HDF5 file using the HDFStore class provided by Pandas:
import numpy as np
from pandas import HDFStore,DataFrame
# create (or open) an hdf5 file and opens in append mode
hdf = HDFStore('storage.h5')
Now we can store a dataset into the file we just created:
df = DataFrame(np.random.rand(5,3), columns=('A','B','C'))
# put the dataset in the storage
hdf.put('d1', df, format='table', data_columns=True)
The structure used to represent the hdf file in Python is a dictionary and we can access to our data using the name of the dataset as key:
print hdf['d1'].shape
(5, 3)
The data in the storage can be manipulated. For example, we can append new data to the dataset we just created:
hdf.append('d1', DataFrame(np.random.rand(5,3), 
           columns=('A','B','C')), 
           format='table', data_columns=True)
hdf.close() # closes the file
There are many ways to open a hdf5 storage, we could use again the constructor of the class HDFStorage, but the function read_hdf makes us also able to query the data:
from pandas import read_hdf
# this query selects the columns A and B
# where the values of A is greather than 0.5
hdf = read_hdf('storage.h5', 'd1',
               where=['A>.5'], columns=['A','B'])
At this point, we have a storage which contains a single dataset. The structure of the storage can be organized using groups. In the following example we add three different datasets to the hdf5 file, two in the same group and another one in a different one:
hdf = HDFStore('storage.h5')
hdf.put('tables/t1', DataFrame(np.random.rand(20,5)))
hdf.put('tables/t2', DataFrame(np.random.rand(10,3)))
hdf.put('new_tables/t1', DataFrame(np.random.rand(15,2)))
Our hdf5 storage now looks like this:
print hdf

File path: storage.h5
/d1             frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/new_tables/t1  frame        (shape->[15,2])                                                   
/tables/t1      frame        (shape->[20,5])                                                   
/tables/t2      frame        (shape->[10,3])  
On the left we can see the hierarchy of the groups added to the storage, in the middle we have the type of dataset and on the right there is the list of attributes attached to the dataset. Attributes are pieces of metadata you can stick on objects in the file and the attributes we see here are automatically created by Pandas in order to describe the information required to recover the data from the hdf5 storage system.

Friday, May 23, 2014

Code parallelization with joblib

Recently I've been working on the parallelization of some Python code and I discovered Joblib. It is a library that supports pipelining and offers a good support for parallelization. In this post we will implement a (very naive) paraller matrix by matrix multiplication algorithm to show the parallelization capabilities of this library.
from joblib import Parallel, delayed

def parallel_dot(A,B,n_jobs=2):
    """
     Computes A x B using more CPUs.
     This works only when the number 
     of rows of A and the n_jobs are even.
    """
    parallelizer = Parallel(n_jobs=n_jobs)
    # this iterator returns the functions to execute for each task
    tasks_iterator = ( delayed(np.dot)(A_block,B) 
                      for A_block in np.split(A,n_jobs) )
    result = parallelizer( tasks_iterator )
    # merging the output of the jobs
    return np.vstack(result)
This function spreads the computation across more precesses. The strategy applied to distribute the data is very simple. Each process has the full matrix B and a contiguous block of rows of A, so it can compute a block of rows A*B. In the end, the result of each process is stacked to build final matrix.

Let's compare the parallel version of the algorithm with the sequential one:
A = np.random.randint(0,high=10,size=(1000,1000))
B = np.random.randint(0,high=10,size=(1000,1000))
%time _ = np.dot(A,B)
CPU times: user 13.2 s, sys: 36 ms, total: 13.2 s
Wall time: 13.4 s
%time _ = parallel_dot(A,B,n_jobs=2)
CPU times: user 92 ms, sys: 76 ms, total: 168 ms
Wall time: 8.49 s
Wow, we had a speedup of 1.6X, not bad for a so naive algorithm. It's important to notice that the arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process. Which means that the last time that parallel_dot have been called, the matrix B have been entirely replicated two times in memory. To avoid this problem, we can dump the matrices on the filesystem and pass a reference to the worker to open them as memory map.
import tempfile
import os
from joblib import load, dump

# saving A and B to a local file for memmapping
temp_folder = tempfile.mkdtemp()
filenameA = os.path.join(temp_folder, 'A.mmap')
dump(A, filenameA)
filenameB = os.path.join(temp_folder, 'B.mmap')
dump(A, filenameB)
Now, when parallel_dot(A_memmap,B_memmap,n_jobs=2) is called, both the processes created will use only a reference to the matrix B..