Suppose you're attempting to scrape a slab of HTML that looks a bit like this:
<tr class="oddRow"> <td> <a href="/ukpga/2018/21/contents/enacted">Domestic Gas and Electricity (Tariff Cap) Act 2018</a> </td> <td> <a href="/ukpga/2018/21/contents/enacted">2018 c. 21</a> </td> <td>UK Public General Acts</td> </tr> <tr> <td> <a href="/ukpga/2018/20/contents/enacted">Northern Ireland Budget Act 2018</a> </td> <td> <a href="/ukpga/2018/20/contents/enacted">2018 c. 20</a> </td>
The bit you're looking to scrape is contained in
<a> tag that sits as a child of the
<td> tag, i.e.
Northern Ireland Budget Act 2018.
Now, for all you know, there are going to be
<a> elements all over the page, many of which you have no interest in. Because of this, something like
stuff = soup.find_all('a') is no good.
What you really need to do is limit your scrape to only those
<a> tags that have a
<td> tags as its parent.
Here's how you do it:
td = soup.find_all('td') # Find all the td elements on the page for i in td: # call .findChildren() on each item in the td list children = i.findChildren("a" , recursive=True) # Iterate over the list of children calling accessing the .text attribute on each child for child in children: what_i_want = child.text