Goal
This is an example to explore how users can take IP addresses from their weblogs and identify the GeoLocation from where those IPs originate.
Learn
Download sample data:
- Visit the website MaxMind to download the file GeoLite City (CSV / zip)
- Extract the .zip which will contain two .csv files.
- GeoCityBlocks.csv - These are blocks of IP addresses which included location ID.
- GeoCityLocations.csv - These matched latitude and longitude with the location ID. - Next, download the Weblogs.log file here.
This product includes GeoLite data created by MaxMind, available from
http://www.maxmind.com
Load the example file into Datameer
Perform file uploads into your Datameer instance so you can work with the demo data.
- Upload the GeoCityBlocks and GeoCityLocations using a CSV file type.
- Upload the Weblogs.log using Apache Log as the file type.
- The parse pattern is: %h %l %u %t "%r" %>s %b
Open a new Workbook after the three files have been uploaded. Select all three files and click Add Data.
If you want to learn more about uploading data into Datameer, check out our documentation on Uploading Files.
If you want to lean more about how to open data into workbooks, check out our documentation on Workbooks.
Start the example
Join GeoCityBlocks and GeoCityLocations based on locId.
Duplicate the joined sheet in order to clean up some of the data.
Click on the next available empty column to bring up the formula builder. Use the TRUNC formula on GeoCityBlocks.startIpNum.
As viewed in Fx bar: =TRUNC(#GeoCityBlocks.startIpNum) |
The next step is to convert the IP to a numerical representation. Click in the Fx bar to edit the column that was just created and divide it by 256*256.
As viewed in Fx bar =TRUNC(GeoCityBlocks.startIpNum/(256*256)) |
Repeat the process of using the TRUNC function and modifying the formula with GeoCityBlocks.endIpNum
As viewed in Fx bar =TRUNC(GeoCityBlocks.endIpNum/(256*256)) |
How and why to convert an IP to a numerical representation? For example, using the following IP: 10.10.10.10 There are 4 numbers or buckets divided by the decimal place. Each number can range from 0-255. This means that matching an IP to the current data will yield very large data sets. The method to simplify this is to create buckets by dividing the Start and End IP with 256*256. This creates a smaller range for the IPs to match the data based on the first 2 numbers in an IP. |
Now find all possible IP addresses within the start to end range. Click on the next column to bring up the formula builder and select the function EXPLODE_RANGE.
Use the two columns that were just created for the start (GeoCityBlocks_startIpNum) and end (GeoCityBlocks_endIpNum) arguments.
As viewed in Fx bar =EXPLODE_RANGE(#GeoCityBlocks_startIpNum;#GeoCityBlocks_endIpNum) |
Right click the column just created and rename it to IPBucket
Right click sheet1 at the bottom of the page and rename it to GeoJoinEdit
Create a new sheet and click in the first empty column to bring up the formula builder. Select the function IP_TO_NUMBER. The argument used will come from the Weblogs page in the workbook using the remoteHost column. This converts the remote host to a number that will be able to be joined with the GeoCity data.
As viewed in Fx bar =IP_TO_NUMBER(#Weblogs!remoteHost) |
Name this column IPNumber.
Click on the next empty column to bring up the formula builder. Select the TRUNC function and the column IPNumber that was just created as the argument.
As viewed in Fx bar =TRUNC(#IPNumber) |
Now modify the function in the same was as the previous TRUNC functions:
As viewed in Fx bar: =TRUNC(#IPNumber/(256*256)) |
Name this column IPBucket.
Name this sheet WeblogsEdit.
Create a joined sheet between WeblogsEdit and GeoJoinEdit based on IPBucket.
The new joined sheet contains IP number, start and end of bucket, and location.
Name this sheet GeoWeblogJoin.
The last step is to find exact matches for all the locations.
Ensure the page name being worked on is GeoWeblogJoin. Click the filter tool from the tool bar (or select it from the menu bar).
Check the box at the top to have the filtered results displayed on a new sheet.
The first filter is:
WeblogsEdit.IPNumber is greater equals to the GeoJoinEdit.startIPNum column.
Click the plus sign to add a filter.
The second filter is:
WeblogsEdit.IPNumber is less equals to the GeoJoinEdit.endIPNum column.
Click Create Filter at the bottom to finish.
Save the workbook and process it to view the final results.
It is possible to continue to analyze this data by grouping by location and then doing a count to find where the most traffic is coming originating. Try it out!
“Got a question? Have an answer? Join the Datameer Community!”
Comments
0 comments
Please sign in to leave a comment.