Recently, I underwent a project which requires some additional information that a bit troublesome to obtain them manually. Then I decided to make use of one of the infamous skill for obtaining data, yeah, scraping!
So here is the webpage link .
I know, it may a bit unpleasing to saw the web-design. lol. But there are lots of valuable information. It's only sad that there is nowhere to get the download button there. Below is the front-page of the site.
OK, now get back into business. I need to get below informations :
> Kecamatan/Distrik (district name in Indonesia)
> Kode Pos (postal code)
> Kode Wilayah (area code based on the district name)
> Kota/Kabupaten (area type, either a city or county)
> Nama Kota/Kabupaten (city or county name)
> Nama Provinsi (province name, in US it is similar with 'state')
So there are just a couple of columns needed. Here is the specific webpage I would like to scrape from.
If it is only a page, I'd only copy the table and paste-value to excel, then ged rid of the columns I don't want to use. The maximum it can show is only 1000 row, in total 8 pages. But yeah, there are 'still' some pages, and that's why it will save lots of patience to get the data by using this technique.
Now move to python, I will import some libraries and pull a request to the website.
Take a look for a pattern, so that we can get the 'key' that we will able to iterate over. It can be almost any element. Will explain this further below.
Right click above the table on the web-page, then choose inspect element. In this case, I am using <tr>, with a specific background colour, which is tagged with 'bgcolor'. The color is coded with '#ccffff' ( specifically, <tr bgcolor':'#ccffff'>). It is where all the data needed lies.
I save it within the variable trs
Now I want to test if the 'key' I am using is OK to get what I want. So for the first experiment, I will obtain the data from the first page only. I store all the data to an array. Whereas every row are stored in dictionary type.
Then display the result to a data-frame, for a proper looking.
Here it looks.
Yippee!!
Now how can we get the data from all the pages?
Well, easy-peasy, I'll inspect the web-page again to get the link of the pages (this time, I only get the 2nd page until 8th page). The link lies inside the <a>, with specific tag {'class':'tpage'}
It will get the links just like below :
Then we can get only the link (stored in an array).Then just combine the previous 1st page code with another for loop.
Last but not least, import to a csv file, and we're good to go using them!
Happy scraping! <3 <3
Comments
Post a Comment