Selectorlib lets you use a YML styled file to specify the selectors for the elements or data that you need to extract from a website. You can use both CSS Selectors, XPaths or both. YML Structure ------------- Lets take a look at this fictional store that sells Pokemon - https://scrapeme.live/shop/ Lets extract Here is a sample YML that SelectorLib accepts as Input .. code:: yaml pokemon: css: li.product multiple: true type: Text children: name: css: h2.woocommerce-loop-product__title type: Text price: css: span.woocommerce-Price-amount type: Text image: css: img.attachment-woocommerce_thumbnail type: Attribute attribute: src url: css: a.woocommerce-LoopProduct-link type: Link Here ``pokemon`` is the main element and the elements - name, price, image and url are inside it and are called the children of the pokemon element. Every element starts with its name and can have these properties - css - xpath - type - children - formatter css (default: Blank) ~~~~~~~~~~~~~~~~~~~~ The css selector for the element. In our example the element called pokemon is in an li with a class product. So its ``li.product``. xpath (default: Blank) ~~~~~~~~~~~~~~~~~~~~~~ The xpath selector for the element. If we were to use xpaths instead of css selectors for the element pokemon above. It would be ``//li[contains(@class,'pokemon')]``. Every element needs either css or xpath selectors. Every element needs either css or xpath selectors. If both xpath and css are defined, xpath takes preference. type (default: Text) ~~~~~~~~~~~~~~~~~~~~ The type defines what kind of extraction needs to happen on the selected element. Here are accepted types Text ^^^^ This type of extraction just extracts all the text content from the selected elements. If you have not specifed a type, Text would be used as default. Attribute ^^^^^^^^^ This type of extraction lets you extract a particular attribute, specified using the ``attribute`` property for the element. This is not usually required when you are selecting using xpaths as you define that easily in an expression as compared to css selectors. eg. ``//img[@src]`` Here is an example that extracts the src attribute of an img element .. code:: yaml image: css: img.attachment-woocommerce_thumbnail type: Attribute attribute: src Link ^^^^ This type is a shortcut for getting the href attribute from any links in the html defined using an ```` tag Example, .. code:: yaml url: css: a.woocommerce-LoopProduct-link type: Link HTML ^^^^ HTML type, just gives you the full HTML content of the element. This is useful when you need the html as is for some custom extraction or checking a few conditions. multiple (default: False) ~~~~~~~~~~~~~~~~~~~~~~~~~ If you need multiple matches on the selector of an element use multiple as true. If you only need to get the first match, use multiple as false or leave it blank. For example, the element pokemon has multiple matches on the same page, so we have set multiple:true in it to get all of them. children (default: Blank) ~~~~~~~~~~~~~~~~~~~~~~~~~ An element can have multiple child elements. In the example above the parent element ``pokemon`` has these "children" - ``name``,\ ``price``,\ ``image``,\ ``url``. Each child element could also more children and can be nested. If an element has children, it's ``type`` property is ignored. format ~~~~~~ You can define custom formatters, and can be used for minor transformations on the extracted data. In Python, these formatters are defined as :: from selectorlib.formatter import Formatter class Price(Formatter): def format(self, text): return text.replace('\\n','').strip() Used in the YAML as .. code:: yaml price: css: span.woocommerce-Price-amount type: Text format: Price And passed to the Extractor while its initialized .. code:: python formatters = Formatter.get_all() Extractor.from_yaml_file('a.yaml', formatters=formatters) Python Example -------------- ``scrapeme_listing_page.yml`` .. code:: yaml pokemon: css: li.product multiple: true type: Text children: name: css: h2.woocommerce-loop-product__title type: Text price: css: span.woocommerce-Price-amount type: Text image: css: img.attachment-woocommerce_thumbnail type: Attribute attribute: src url: css: a.woocommerce-LoopProduct-link type: Link ``extract.py`` .. code:: python import requests from selectorlib import Extractor, Formatter from pprint import pprint import re # Define a formatter for Price class Price(Formatter): def format(self, text): price = re.findall(r'\d+\.\d+',text) if price: return price[0] return None formatters = Formatter.get_all() extractor = Extractor.from_yaml_file('./scrapeme_listing_page.yml',formatters=formatters) #Download the HTML and use Extractor r = requests.get('https://scrapeme.live/shop/') data = extractor.extract(r.text) pprint(data) :: >>> python extract.py :: {'pokemon': [{'image': 'https://scrapeme.live/wp-content/uploads/2018/08/001-350x350.png', 'name': 'Bulbasaur', 'price': '63.00', 'url': 'https://scrapeme.live/shop/Bulbasaur/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/002-350x350.png', 'name': 'Ivysaur', 'price': '87.00', 'url': 'https://scrapeme.live/shop/Ivysaur/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/003-350x350.png', 'name': 'Venusaur', 'price': '105.00', 'url': 'https://scrapeme.live/shop/Venusaur/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/004-350x350.png', 'name': 'Charmander', 'price': '48.00', 'url': 'https://scrapeme.live/shop/Charmander/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/005-350x350.png', 'name': 'Charmeleon', 'price': '165.00', 'url': 'https://scrapeme.live/shop/Charmeleon/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/006-350x350.png', 'name': 'Charizard', 'price': '156.00', 'url': 'https://scrapeme.live/shop/Charizard/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/007-350x350.png', 'name': 'Squirtle', 'price': '130.00', 'url': 'https://scrapeme.live/shop/Squirtle/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/008-350x350.png', 'name': 'Wartortle', 'price': '123.00', 'url': 'https://scrapeme.live/shop/Wartortle/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/009-350x350.png', 'name': 'Blastoise', 'price': '76.00', 'url': 'https://scrapeme.live/shop/Blastoise/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/010-350x350.png', 'name': 'Caterpie', 'price': '73.00', 'url': 'https://scrapeme.live/shop/Caterpie/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/011-350x350.png', 'name': 'Metapod', 'price': '148.00', 'url': 'https://scrapeme.live/shop/Kakuna/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/015-350x350.png', 'name': 'Beedrill', 'price': '168.00', 'url': 'https://scrapeme.live/shop/Beedrill/'}, {'image': 'https://scrapeme.live/wp-content/uploads/2018/08/016-350x350.png', 'name': 'Pidgey', 'price': '159.00', 'url': 'https://scrapeme.live/shop/Pidgey/'}]}