Download session protected websites using wget and Google Chrome

Downloading http auth protected websites using wget is pretty easy, you just add –user and –password parameters to the command:

wget --user=foo --password=bar

But most of the websites do protect content using a more user friendly html login page, which does not use http authentication, but server side code to check the user credentials against a database or auth server.

In a classic browser based login, once authenticated, the webserver tells the browser to store a cookie with a session id, which the browser sends back to the server on every successive request to prove that he has already logged in.

Using wget you have to replicate the exact same procedure, so you have to send the credentials using a POST request and store the cookie that the server sends back.

Authenticating in wget

The most basic and usual way is to use wget to both authenticate and download files. In this case you have to look at the html login form to get:

  • the input names of the user and password fields
  • the url of the form action (often same as login page)

Then you have to tell wget to store resulting cookies for later use. The full command should be something like:

wget --post-data="loginfieldname=user&passwordfieldname=pwd" --save-cookies cookies.txt --keep-session-cookies

This will save a cookies.txt file in the currect directory that you can use in the next requests with the –load-cookies parameter:

wget --load-cookies=cookies.txt

In this way you are acting like a browser from a session point of view (keep in mind that websites can use other ways to track browser based session from bots), and you can use wget to download or mirror the website.

Authenticating in Chrome and passing the cookie to wget

Sometimes it may be difficult to correctly replicate the authentication request in wget, maybe because the website is using hidden variables or other checks to validate the request. You may try to send the exact same data as the browser would, but a much simpler way is to login in a browser and then pass the generated cookie to wget.

In the end, is just a matter of writing the session id in a text file. To inspect the cookie content in Chrome (and in Safari, which uses very similar dev tools), just inspect the page, go to the Resources tab, and get the cookies content in the Cookies section.

But, when creating the cookie file for wget you should use its own format to represent the cookie. There is a very nice extension for Google Chrome that does this job for you called cookie.txt export. Just install and copy/paste the cookie from the extension’s icon into the txt file that you’ll load in wget:

wget --load-cookies=cookies.txt

Leave a Reply