A programming & designing blog!

Monday, September 30, 2013

Extract urls from a web page with PHP

Hello there! In this post I am going to show you how we can extract all urls from a webpage with php. For this purpose I am using php DOMDocument object. It is a awesome tools for working with DOM elements. If you are familiar with JavaScript you can found many similarity between them.

Let's see how we can extract url from a webpage with php.

// get data from url
$data = file_get_contents('http://wwww.w3bees.com');

$dom = new DOMDocument;
// load html with error handler
// find all a tags
$anchors = $dom->getElementsByTagName('a');

$url = array();

foreach ($anchors as $anchor) {
  // get href value
  $href = $anchor->getAttribute('href');
  // filter valid urls
  if (!preg_match("/^(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/", $href) ) continue;
  $url[] = $href;