Excel Tip : “Group By, Concatenate”

2 Toronto Header

 

tl;dr : Here are the two key Excel functions to insert at the top of two new columns =IF(A2=A1,D1&”, “&C2,C2) & =IF(A2<>A3,”Last”,””)
CONTEXT : Here is an interesting data cleaning problem. At my workplace, people would enter reports into a system. Although the user sees one big box to type their reports,  the information was stored in separate, sequential lines. When I exported the csv file, I have a series of reports and so each individual report is broken up into multiple line. This reminds me of computing the “group by, sum” query in SQL, but instating of summing, we are concatenating text.

Given a 2-column table, column A has reference / report ids and column B has the numerical value, but you want to summarize the values by column A’s report ids e.g. take the sum of the column B, by grouping column A (e.g. group by, summarise, sum). In this case, column B is not numerical, but it’s a string. In this case, we want to “group by, summarise, and concatenate” the text. We will show two examples : one in Excel and a faster one using an R script.
KEY QUESTION: Given a two-column list (where one column is the report id and another column is the text), how can I concatenate the text together, grouping them by report id? (A third column may be needed to index and sort the lines if the list is not already sorted.)

DESIRED OUTPUT: Two column matrix with concatenated text

LEVEL OF DIFFICULTY: Easy

ESTIMATED EXCEL AND SCRIPT TIME: less than 4 hours

CATEGORY OF ANALYSIS: Excel tip, data cleaning

PREREQUISITE KNOWLEDGE: basic manipulation in R, Excel formulas and functions

TOOLS AND LIBRARIES: R/R Studio, Excel

VALUE ADDED: A handy function to help concatenate that text and clean up your file!

METHODOLOGY: First, one new column is created, concatenating the text from the previous row’s text column (per report id). Then, a second column is created based on report id again and makes a marker if it the row is the last of that report id. If the row is indeed the last of that grouping of the report id (remember, the report ids are sorted!) then the last entry of the concatenated text is the concatenation of all the relevant rows. The file is filtered for the marker. The R script mirrors the logic and steps in the Excel file.

EXCE EXAMPLE:

First, we have our text:2_excelfile_first.png

And the report is separated into multiple lines. Boo-urns!

 

We introduce two columns. At the top of each column, we insert the following formulas and then bring the formula down. =IF(A2=A1,D1&”, “&C2,C2) & =IF(A2<>A3,”Last”,””)

 

3_Excefile_second.png

 

And

4_excel_third.png

 

For each row that has the marker “Last”, we can see that it is the complete concatenated text for the reference id. Then we filter formula 2’s column for the Last marker to get the rows of interest.

 

R EXAMPLE

In R, here is our example:

1_R_script_group_by_concatenate.png

 

rm(list=ls())
df <- read.csv(“/Users/patrickroncal/Desktop/Group_By.csv”)
library(dplyr)
## re-name your columns
df$col_a <- df$Reference_ID ### change this to the main identifying file number
df$col_b <- df$Line_Number ### this is actually optional, but it’s the sequential paragraph line number for each file number
df$col_c <- df$Text ### the actual words
## function
group_by_concatenate <- function(df){
df$col_d <- “”
df$col_e <- “”
df$col_c <- as.character(df$col_c)
for(j in 2:nrow(df)){
i <- j-1;
df[1, “col_d”] <- df[1, “col_c”];
df[j, “col_d”] <- ifelse(df[j, “col_a”] == df[i, “col_a”],
paste0(df[i, “col_d”], ” “, df[j, “col_c”]),
df[j, “col_c”]);
df[1, “col_d”] <- df[1, “col_c”];}
for(j in 2:nrow(df)){
i <- j-1;
df[nrow(df), “col_e”] <- “Last”;
df[i, “col_e”] <- ifelse(df[i, “col_a”] != df[j, “col_a”],
“Last”,
“”);}
df <- filter(df, col_e == “Last”);
df$Full_Comment <- df$col_d
df <- subset(df, select = c(“col_a”, “Full_Comment”));
return(df);}
## run function
df_cleaned <- group_by_concatenate(df)

 

What do you think? Do you find this handy? Are there other ways to achieve the same output?

 

 

Let’s do some graph theory! (part 1)

1 Balls Header

 

** For those not familiar with graph theory, please note that term “graph” does not refer to bar graphs, line graphs, etc. A graph is a specific mathematical object of nodes/vertices and edge. You can visualize a graph a lot like a network.**

KEY QUESTION: Given a two-column list representing pairs of vertices of an edge, or given an adjacency matrix, how can I visualize them as a (network) graph?

DESIRED OUTPUT: A graph

LEVEL OF DIFFICULTY: Easy

ESTIMATED SCRIPT TIME: less than 4 hours

CATEGORY OF ANALYSIS: Graph theory, social network analysis, data visualization

PREREQUISITE KNOWLEDGE: Graph theory, basic manipulation in R

TOOLS AND LIBRARIES: R/R Studio, igraph

CONTEXT AND DESCRIPTION: You have either a two-column list representing pairs of vertices of an edge or an adjacency matrix, which usually represents some sort of (network) graph.

Maybe a vertex represents a person and an edge represents a relationship i.e. a social network. Maybe you want to organize the graph as a cycle or as a tree. Maybe you want to perform calculations like betweeness centrality or degree centrality.

Before doing any of that, we will first plot the graph and do some basic visual manipulations. In this example, we will start out with writing our own small network in the R console.

VALUE ADDED: As with most of my introductory examples, this particular script and this particular data set are both very simple, and do not lend themselves to a lot of analysis. This is however the foundational script and building block for more complex analyses to be explored later on. Coupled with other data sets and additional functions, a lot more powerful and insightful analyses can be made later on.

In general, a social network analysis offers many benefits. Many technical and non-technical people can interpret and digest the information in the graph visualization very easily. Most importantly, the vertices in a graph illustrates which players are central or significant players. We can even change the size of the nodes to further emphasize which vertex is more significant.

METHODOLOGY: The key idea is to load your dataset and then convert the edges or the adjacency matrix into a graph object. From there, many functions can be applied on your graph object. Again, in this example, we will start out with writing our own small network in the R console.

Here is the script:

 

Figure 1: R script of igraph script

igraph script

 

And voila! Here is one of the outputted graphs:

File for Graph

 

Notice the direction of the edges and compare that to your script. Notice how the edges will change when you change “directed” to  FALSE.

This is the first post for our igraph series. Later, we will load in more complicated files and begin to use more of the manipulation functions and calculation functions.

Let’s draw some polygons!

2 Toronto Header

KEY QUESTION: Given an Esri Shapefile, how can I visualize the coordinates as polygons?

DESIRED OUTPUT: A map of polygons

CATEGORY OF ANALYSIS: Geospatial analysis, data visualization

LEVEL OF DIFFICULTY: Easy

ESTIMATED SCRIPT TIME: less than 4 hours

DATA SOURCE: City of Toronto, City Wards, May 2010 (MTM 3 Degree Zone 10, NAD27)

CONTEXT AND DESCRIPTION: You have an Esri shapefile, which usually represents boundaries of some sort. Maybe later you want to see which points lie inside or outside of the polygons. Maybe later you want to make a chloropleth / heat map.

Before doing any of that, we need to first plot the boundaries. In this example, we use the City of Toronto’s open dataset of city wards. We plot the coordinates as polygons to see what the boundaries look like.

VALUE ADDED: This particular script and this particular data set are both very simple, and do not lend themselves to a lot of analysis. This is however the foundational script and building block for more complex analyses to be explored later on. Coupled with other data sets and additional functions, a lot more powerful and insightful analyses can be made later on.

In general, geospatial analysis offers many benefits, including a basic plot of boundaries. Many technical and non-technical people can interpret and digest the information on a map very easily. Using polygons can provide many insights. Maybe you have a boundary polygon and you have a list of addresses, and you need to see which addresses lie inside or outside of the boundary of interest. If you make a chloropleth/heat map, you can see the distribution and the change in your polygon in a very colourful manner.

METHODOLOGY: Using R, I programmed a script using the ggplot2 and rgdal libraries. The key ideas are to transform the shapefile in a useable dataframe and then use ggplot and geom_polygon functions to visualize the boundaries.

Here is the script:

Figure 1: R script of geom_polygon script

R script polygon

And voila! Here is one of the outputted maps:

Figure 2: Map of Toronto’s Ward Boundaries

Polygon

Again, this is a very basic example of geospatial analysis: it plots boundary lines on a map. This R library package offers many features. You can play around with colours, dimensions, shapes, transparencies, legends, etc. Heat maps or “choropleth maps” are also very interesting to see as they as show gradients of concentrations.

What are your thoughts? Do you find business value in this kind of visualization? Is the R script easy-to-understand? What can be improved? Leave a comment!

 

SOURCES: Here is a list of sources I used for help and inspiration. They’re great for more detail and additional reading.

Let’s plot some points on maps!

1 Balls Header

KEY QUESTION: Given a list of addresses, how can I visualize them as points on a map?

DESIRED OUTPUT: A map of the addresses

CATEGORY OF ANALYSIS: Geospatial analysis, data visualization

LEVEL OF DIFFICULTY: Easy

ESTIMATED SCRIPT TIME: less than 4 hours

DATA SOURCE: City of Toronto, Cultural Spaces

 

CONTEXT AND DESCRIPTION: You have a list of addresses: Maybe it is a list of establishments in a city. Maybe it is a list of customers and their addresses. Maybe each address represents an event (e.g. crime incident) that occurred there. Now you want to see them on a map. Maybe you want to see if there are any patterns. Which areas have high/low concentrations of points? Are the points evenly distributed? Are the points sparsely distributed? Are the points distributed in clusters? If so, what is the shape of the clusters? Are the points clustered in groups? Are they clustered in lines?

In this example, we use the City of Toronto’s open dataset of cultural spaces in the city. We plot each address as a point on the map to visualize the distribution of points throughout the city.

VALUE ADDED: Geospatial analysis offers many benefits, including a basic plot of points. Many technical and non-technical people can interpret and digest the information on a map very easily. Patterns on the distribution and clustering of points can provide many insights. If you understand the distribution and clustering of points, you can use this information to answer many questions. Where are my customers coming from? How can position my advertising in the most optimal location? Which areas have too many buildings or not enough buildings?  Which areas experience a lot of crime?

METHODOLOGY: Using R/R Studio, I programmed a script using the classic ggmap and ggplot2 libraries. The key ideas are to use the geocode function to assign latitude and longitude values for each address and then use geom_point and ggmap function to visualize the information in an aesthetically pleasing way.

Here is the script:

Figure 1: R script of ggmap script

Rscript

And voila! Here is one of the outputted maps:

Figure 2: Map of Toronto’s Cultural Spaces

Rplot

 

This is a very basic example of geospatial analysis: it plots points on a map. This R library package offers many features. You can play around with colours, dimensions, shapes, transparencies, legends, etc. Heat maps or “choropleth maps” are also very interesting to see as they as show gradients of concentrations.

What are your thoughts? Do you find business value in this kind of visualization? Is the R script easy-to-understand? What can be improved? Leave a comment!

 

EDIT: 2017-04-13 – As pointed out by Stephan in the comments, Google has a limit on how much you are permitted to geocode each day. If you go over the limit, you will receive an “OVER_QUERY_LIMIT” error. I believe the threshold is 2,500 records. In the code, there is a “write.csv” line because it would be prudent to save the records that you already geocoded. Since geocoding is very time-consuming and you have a finite amount you can run, then we should avoid duplicating work. There is no point in geocoding twice! Each day, you (or a group of your colleagues) can run the script to geocode large amounts of records.

The .xlsx file from the Toronto website was saved as .csv in the beginning before reading into R.
The other warnings are from the fact that Google could not geocode those records.