How to know if Google Sheets IMPORTDATA, IMPORTFEED, IMPORTHTML or IMPORTXML functions are able to get data from a resource hosted on a website?

If the content is added dynamically (by using Javascript), it can’t be imported by using Google Sheets built-in functions. Also if the website webmaster have taken certain measures, this functions will not able to import the data.


To check if the content is added dynamically, using Chrome,

  1. Open the URL of the source data.
  2. Press F12 to open Chrome Developer Tools
  3. Press Control+Shift+P to open the Command Menu.
  4. Start typing javascript, select Disable JavaScript, and then press Enter to run the command. JavaScript is now disabled.

JavaScript will remain disabled in this tab so long as you have DevTools open.

Reload the page to see if the content that you want to import is shown, if it’s shown it could be imported by using Google Sheets built-in functions, otherwise it’s not possible but might be possible by using other means for doing web scraping.

According to Wikipedia,

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

The webmasters could use robots.txt file to block access to website. In such case the result will be #N/A Could not fetch url.

The webpage could be designed to return a special a custom message instead of the data.


IMPORTDATA, IMPORTFEED, IMPORTHTML and IMPORTXML are able to get content from resources hosted on websites that are:

  • Publicly available. This means that the resource doesn’t require authorization / to be logged in into any service to access it.
  • The content is “static”. This mean that if you open the resource using the view source code option of modern web browsers it will be displayed as plain text.
    • NOTE: The Chrome’s Inspect tool shows the parsed DOM; in other works the actual structure/content of the web page which could be dynamically modified by JavaScript code or browser extensions/plugins.
  • The content has the appropriated structure.
    • IMPORTDATA works with structured content as csv or tsv doesn’t matter of the file extension of the resource.
    • IMPORTFEED works with marked up content as ATOM/RSS
    • IMPORTHTML works with marked up content as HTML that includes properly markedup list or tables.
    • IMPORTXML works with marked up content as XML or any of its variants like XHTML.
  • Google servers are not blocked by means of robots.txt or the user agent.

On W3C Markup Validator there are several tools to checkout is the resources had been properly marked up.

Regarding CSV check out Are there known services to validate CSV files

It’s worth to note that the spreadsheet

  • should have enough room for the imported content; Google Sheets has a 5 million cell limit by spreadsheet, according to this post a columns limit of 18278, and a 50 thousand characters as cell content even as a value or formula.
  • it doesn’t handle well large in-cell content; the “limit” depends on the user screen size and resolution as now it’s possible to zoom in/out.

References

Related

The following question is about a different result, #N/A Could not fetch url

importhtml – Can you filter a column in size order of an imported HTML ? Ive tried using query with no success

Using the below importhtml I’m hoping to order the numerical data of the third column in size order but seemed to have lost my way on the model i was following. Any help greatly appreciated.

=importhtml(“https://fbref.com/en/share/SxTNE”,”table”,0)

query(IMPORTHTML(“https://fbref.com/en/share/SxTNE””,”table”,2),”Select * where Col3=’Crdy’”)

google sheets – Can you change the importhtml address for the website URL automatically? The website address ends in tomorrows date so changes each day

Using importhtml i always access tomorrows updated data from a website which finishes the webpage address with tomorrows date. Is there a way of automatically updating that part of the address each day so that it will continue to import tomorrows data?

e.g https://www.poisonfoot.com/30-12-2020/

=importhtml("https://www.poisonfoot.com/30-12-2020/","table",0)

How to know if Google Sheets IMPORTDATA, IMPORTFEED, IMPORTHTML or IMPORTXML functions are able to get data from a resource hosted on a website?

If the content is added dynamically (by using Javascript), it can’t be imported by using Google Sheets built-in functions. Also if the website webmaster have taken certain measures, this functions will not able to import the data.


To check if the content is added dynamically, using Chrome,

  1. Open the URL of the source data.
  2. Press F12 to open Chrome Developer Tools
  3. Press Control+Shift+P to open the Command Menu.
  4. Start typing javascript, select Disable JavaScript, and then press Enter to run the command. JavaScript is now disabled.

JavaScript will remain disabled in this tab so long as you have DevTools open.

Reload the page to see if the content that you want to import is shown, if it’s shown it could be imported by using Google Sheets built-in functions, otherwise it’s not possible but might be possible by using other means for doing web scraping.

According to Wikipedia,

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

The webmasters could use robots.txt file to block access to website. In such case the result will be #N/A Could not fetch url.

The webpage could be designed to return a special a custom message instead of the data.


IMPORTDATA, IMPORTFEED, IMPORTHTML and IMPORTXML are able to get content from resources hosted on websites that are:

  • Publicly available. This means that the resource doesn’t require authorization / to be logged in into any service to access it.
  • The content is “static”. This mean that if you open the resource using the view source code option of modern web browsers it will be displayed as plain text.
    • NOTE: The Chrome’s Inspect tool shows the parsed DOM; in other works the actual structure/content of the web page which could be dynamically modified by JavaScript code or browser extensions/plugins.
  • The content has the appropriated structure.
    • IMPORTDATA works with structured content as csv or tsv doesn’t matter of the file extension of the resource.
    • IMPORTFEED works with marked up content as ATOM/RSS
    • IMPORTHTML works with marked up content as HTML that includes properly markedup list or tables.
    • IMPORTXML works with marked up content as XML or any of its variants like XHTML.
  • Google servers are not blocked by means of robots.txt or the user agent.

On W3C Markup Validator there are several tools to checkout is the resources had been properly marked up.

Regarding CSV check out Are there known services to validate CSV files

It’s worth to note that the spreadsheet

  • should have enough room for the imported content; Google Sheets has a 5 million cell limit by spreadsheet, according to this post a columns limit of 18278, and a 50 thousand characters as cell content even as a value or formula.
  • it doesn’t handle well large in-cell content; the “limit” depends on the user screen size and resolution as now it’s possible to zoom in/out.

References

Related

The following question is about a different result, #N/A Could not fetch url

google sheets arrayformula – =IMPORTHTML forces content to date format

I am a beginner with code and formulas here, so I was wondering if anyone could do exactly what was done on this question:

Trying to use Google Sheets importHTML() to import a table. It forces content to a date format

to this table:

https://www.actionnetwork.com/nfl/nfl-against-the-spread-standings

In short, IMPORTHTML changes sports records to date format. Changing the numbers format to plain text does not work when importing websites, as the format is preset. However, there is a workaround using the ARRAYFORMULA and REGEXREPLACE. When you open the link scroll down to Aurielle’s response for the single formula version. I have tried to implement the workaround for the NFL record link above, but I can’t quite find the right formula mostly because, again, I’m a beginner and I don’t understand these formulas. I was wondering if anyone understood this better and could give me the correct formula for this specific table of NFL records.

I would greatly appreciate any help!

google sheets – Reducing amount of IMPORTHTML Requests

Main Question

I’m trying to reduce my IMPORTHTML requests. The main issue is I’m forced to do the process twice.

=INDEX(IMPORTHTML(*Link*,"Table", 1),ROWS(IMPORTHTML(*Link*,"Table", 1)),2)

Is there any way to index the last row of the table without having to reinitialize the table again? The table size is random between 4-12. This wouldn’t be an issue if it wasn’t for the fact I’m using this 5 times (different cells from the imported table) across 282 tables. 1410 total cells like this. Over 2820 IMPORTHTML requests…

If I could remove the second request within the formula, finding the last row some other way, or re-reference the initial request, I could cut the requests in half. 1410 is still a lot, but it’s faster than 2820.

Ideally, I’d like to reference the information of a full table within a single cell. Then I could reference the cell whenever I needed the table. Then I’d be down to 282 requests.

If I absolutely have to, I can remove 2 of the columns as the main usage of this chart only needs 3 (Current Team, Career Hits/Strikeout, Career Homeruns/Shutouts). This would reduce it to 1692 requests.


Specific Information

A Simplified version of the sheet I’m working on.

Here’s what I’m trying to do on each page. I’m sure there are ways to combine some of these into a single step, but I didn’t want my formula’s to get so long leaving me unable to find errors. The original sheet has 20 teams.

  • TeamScraper: Pull the current players on each team and create 2 lists
    (Batters and Pitchers)
  • TeamTranspose: Rotate the list to be vertical.
  • PlayerStack: Take all the players and organize them into 2 lists(Batters and Pitchers). Create 2 more lists with the names reformatted for the LinkBuilder.
  • LinkBuilder: Turn each of the names into a link of the webpage with that player’s stats table.
  • Batters & Pitchers: Pull needed stats from the table relevant to that player.

Don’t mind Baby Triumphant. He’s one of the few cases where the name he’s given in the team list isn’t the one used by the stats website. I’m working on an Exceptions page that the reformatting process will reference.

Edit: While writing this I saw someone asking about too many IMPORTHTMLs. The answer-er mentioned a limit of 50 instances of the request per sheet. I find this odd considering my 2820 requests split between 2 sheets EVENTUALLY works. It’s just slow. Maybe they mean only 50 requests every few minutes?

How to know if Google Sheets IMPORTDATA, IMPORTFEED, IMPORTHTML or IMPORTXML functions are able to get data from a resource hosted on a website?

If the content is added dynamically (by using Javascript), it can’t be imported by using Google Sheets built-in functions. Also if the website webmaster have taken certain measures, this functions will not able to import the data.


To check if the content is added dynamically, using Chrome,

  1. Open the URL of the source data.
  2. Press F12 to open Chrome Developer Tools
  3. Press Control+Shift+P to open the Command Menu.
  4. Start typing javascript, select Disable JavaScript, and then press Enter to run the command. JavaScript is now disabled.

JavaScript will remain disabled in this tab so long as you have DevTools open.

Reload the page to see if the content that you want to import is shown, if it’s shown it could be imported by using Google Sheets built-in functions, otherwise it’s not possible but might be possible by using other means for doing web scraping.

According to Wikipedia,

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

The webmasters could use robots.txt file to block access to website. In such case the result will be #N/A Could not fetch url.

The webpage could be designed to return a special a custom message instead of the data.


IMPORTDATA, IMPORTFEED, IMPORTHTML and IMPORTXML are able to get content from resources hosted on websites that are:

  • Publicly available. This means that the resource doesn’t require authorization / to be logged in into any service to access it.
  • The content is “static”. This mean that if you open the resource using the view source code option of modern web browsers it will be displayed as plain text.
    • NOTE: The Chrome’s Inspect tool shows the parsed DOM; in other works the actual structure/content of the web page which could be dynamically modified by JavaScript code or browser extensions/plugins.
  • The content has the appropriated structure.
    • IMPORTDATA works with structured content as csv or tsv doesn’t matter of the file extension of the resource.
    • IMPORTFEED works with marked up content as ATOM/RSS
    • IMPORTHTML works with marked up content as HTML that includes properly markedup list or tables.
    • IMPORTXML works with marked up content as XML or any of its variants like XHTML.
  • Google servers are not blocked by means of robots.txt or the user agent.

On W3C Markup Validator there are several tools to checkout is the resources had been properly marked up.

Regarding CSV check out Are there known services to validate CSV files

It’s worth to note that the spreadsheet

  • should have enough room for the imported content; Google Sheets has a 5 million cell limit by spreadsheet, according to this post a columns limit of 18278, and a 50 thousand characters as cell content even as a value or formula.
  • it doesn’t handle well large in-cell content; the “limit” depends on the user screen size and resolution as now it’s possible to zoom in/out.

References

Related

The following question is about a different result, #N/A Could not fetch url

google sheets – IMPORTHTML waiting on data

I am fairly new with query language, but I have been looking around the internet to find a solution to this problem without any luck.

I’m trying to import data from leaderboards on specific tracks in trackmania using the tmx database. I have no problem importing the offline records, but when it comes to the online records the function in google sheets return “Connecting to dedimania, please wait…”. Is it because the data on the site is being updated dynamically? and are there any workarounds to get these records automatically updated into google sheets?

=IMPORTHTML("https://tmnforever.tm-exchange.com/main.aspx?action=trackshow&id=7000900#auto";"table";23)

importhtml problem