Suppose you were given task of collecting server names of a website say www.mysite.com with minimum traffic generation.While browsing the site, you notice that their main page contains links to many of their services which are located on different servers. The exercise requires Linux BASH text manipulation in order to extract all the server names from the website's main page.
Solution
First download the website's main page to your machine by simply issuing the command wget
Solution
First download the website's main page to your machine by simply issuing the command wget
Let's extract the lines containing the string “href=”, indicating that this line contains an http link.
This is still a mess .We split this line using a “/” delimiter, the 3rd field should contain our server
name.
name.
In this way we can extract server information from the website main page if you want further filtering
type command:-
grep "href=" filename.html | cut -d"/" -f3| grep mysite.com| sort -u
type command:-
grep "href=" filename.html | cut -d"/" -f3| grep mysite.com| sort -u
Here "|" is a filter in linux which filters the data as per your requirement.
Now we save the list of url to a text file by issuing the command
grep "href=" filename.html | cut -d"/" -f3| grep mysite.com| sort -u >mysite.txt
Now we save the list of url to a text file by issuing the command
grep "href=" filename.html | cut -d"/" -f3| grep mysite.com| sort -u >mysite.txt