NotesWhat is notes.io?

Notes brand slogan

Notes - notes.io

To save all text data from the internal links for each company, you need to:

Extract the text data from each link.
Store it in a separate file or within a structured output format, ensuring that you preserve a logical structure to identify which company the data belongs to.
Handle possible variations in internal links and data structure on different pages.
Here's an approach to scrape data from each internal link, then save the collected information for each company in a distinct text file:

bash
Copy code
# If you don't have Selenium installed, use:
!pip install selenium
python
Copy code
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

# Initialize the WebDriver with ChromeDriver
def setup_webdriver():
chrome_options = Options()
chrome_options.add_argument("--headless") # Run without GUI
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
# Set path to ChromeDriver
service = Service("/path/to/chromedriver")
return webdriver.Chrome(service=service, options=chrome_options)

# Function to get all internal links from a given URL
def get_internal_links(driver, base_url):
driver.get(base_url)
time.sleep(2) # Allow time for the page to load

links = driver.find_elements(By.TAG_NAME, "a")
internal_links = []

for link in links:
href = link.get_attribute("href")
if href and base_url in href:
internal_links.append(href)

return list(set(internal_links)) # Ensure uniqueness

# Function to scrape text data from given links
def scrape_content(driver, links, output_file):
with open(output_file, "w") as file:
for link in links:
driver.get(link)
time.sleep(2) # Allow time for the page to load

body = driver.find_element(By.TAG_NAME, "body")
text_data = body.text

file.write(f"Link: {link}n")
file.write(text_data)
file.write("n---n") # Separator between data from different links

# Main script
def main():
# List of company URLs to scrape
company_urls = [
{"name": "Company1", "url": "https://example-company1.com"},
{"name": "Company2", "url": "https://example-company2.com"},
# Add more companies as needed
]

# Initialize the WebDriver
driver = setup_webdriver()

for company in company_urls:
company_name = company["name"]
company_url = company["url"]

# Get all internal links for the current company
internal_links = get_internal_links(driver, company_url)

# Save data to a unique file for each company
output_file = f"{company_name}_scraped_data.txt"

# Scrape content from internal links
scrape_content(driver, internal_links, output_file)

driver.quit() # Close WebDriver

# Run the main script
if __name__ == "__main__":
main()
Explanation
Company-Specific Data: The script saves the text data to a unique text file for each company. This structure makes it easier to manage the extracted information.
Extract Internal Links: It gathers all internal links for each company by checking if the base URL is contained within the href attribute.
Scrape Content from Links: For each internal link, it navigates to the page, retrieves the text from the body, and saves it to the appropriate text file.
Handling Variability: Different companies may have various structures, so adapt the script to ensure robustness (e.g., handling slow-loading pages, exceptions).
Considerations
Headless Mode: Running in headless mode means the browser window doesn't open, but Selenium still operates as if it did.
Error Handling: You may want to add try-except blocks around potentially error-prone operations to ensure robustness.
ChromeDriver Path: Ensure the path to ChromeDriver is correct before running the script.
With this script, you can scrape internal links for each company and save the respective text data in separate files for easier processing and analysis.
     
 
what is notes.io
 

Notes.io is a web-based application for taking notes. You can take your notes and share with others people. If you like taking long notes, notes.io is designed for you. To date, over 8,000,000,000 notes created and continuing...

With notes.io;

  • * You can take a note from anywhere and any device with internet connection.
  • * You can share the notes in social platforms (YouTube, Facebook, Twitter, instagram etc.).
  • * You can quickly share your contents without website, blog and e-mail.
  • * You don't need to create any Account to share a note. As you wish you can use quick, easy and best shortened notes with sms, websites, e-mail, or messaging services (WhatsApp, iMessage, Telegram, Signal).
  • * Notes.io has fabulous infrastructure design for a short link and allows you to share the note as an easy and understandable link.

Fast: Notes.io is built for speed and performance. You can take a notes quickly and browse your archive.

Easy: Notes.io doesn’t require installation. Just write and share note!

Short: Notes.io’s url just 8 character. You’ll get shorten link of your note when you want to share. (Ex: notes.io/q )

Free: Notes.io works for 12 years and has been free since the day it was started.


You immediately create your first note and start sharing with the ones you wish. If you want to contact us, you can use the following communication channels;


Email: [email protected]

Twitter: http://twitter.com/notesio

Instagram: http://instagram.com/notes.io

Facebook: http://facebook.com/notesio



Regards;
Notes.io Team

     
 
Shortened Note Link
 
 
Looding Image
 
     
 
Long File
 
 

For written notes was greater than 18KB Unable to shorten.

To be smaller than 18KB, please organize your notes, or sign in.