Rectangle 27 5

[UPDATE] Here is the complete project code

soup('a') returns the complete html tag.

<a href="http://itunes.apple.com/us/store">Buy Music Now</a>

so the urlopen gives the error 'NoneType' object is not callable'. you need extract the only the url/href.

links=soup.findAll('a',href=True)
for l in links:
    print(l['href'])

You need to validate the url too.refer to following anwsers

Again i would like to suggest you to use python sets instead Arrays.you can easily add,ommit duplicate urls.

import re
import httplib
import urllib2
from urlparse import urlparse
import BeautifulSoup

regex = re.compile(
        r'^(?:http|ftp)s?://' # http:// or https://
        r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|' #domain...
        r'localhost|' #localhost...
        r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
        r'(?::\d+)?' # optional port
        r'(?:/?|[/?]\S+)$', re.IGNORECASE)

def isValidUrl(url):
    if regex.match(url) is not None:
        return True;
    return False

def crawler(SeedUrl):
    tocrawl=[SeedUrl]
    crawled=[]
    while tocrawl:
        page=tocrawl.pop()
        print 'Crawled:'+page
        pagesource=urllib2.urlopen(page)
        s=pagesource.read()
        soup=BeautifulSoup.BeautifulSoup(s)
        links=soup.findAll('a',href=True)        
        if page not in crawled:
            for l in links:
                if isValidUrl(l['href']):
                    tocrawl.append(l['href'])
            crawled.append(page)   
    return crawled
crawler('http://www.princeton.edu/main/')

python 2.7 - simple web crawler - Stack Overflow

python-2.7 beautifulsoup
Rectangle 27 225

If you're using requests v2.13 and newer

The user-agent should be specified as a field in the header.

User-Agent

The simplest way to do what you want is to create a dictionary and specify your headers directly, like so:

import requests

url = 'SOME URL'

headers = {
    'User-Agent': 'My User Agent 1.0',
    'From': 'youremail@domain.com'  # This is another valid field
}

response = requests.get(url, headers=headers)

Older versions of requests clobbered default headers, so you'd want to do the following to preserve default headers and then add your own to them.

import requests

url = 'SOME URL'

# Get a copy of the default headers that requests would use
headers = requests.utils.default_headers()

# Update the headers with your custom ones
# You don't have to worry about case-sensitivity with
# the dictionary keys, because default_headers uses a custom
# CaseInsensitiveDict implementation within requests' source code.
headers.update(
    {
        'User-Agent': 'My User Agent 1.0',
    }
)

response = requests.get(url, headers=headers)
response.request.headers

The default value is also available as requests.utils.default_user_agent() if you want to just augment that with your own info.

It's not correct. It clobbers the rest of the headers. He should get a copy of defaults from requests.utils.default_user_agent() and update it, and send those.

for easyness, on httpbin.org/headers (downloadable stuff) you can get the browser headers then make your query appear you

At least in 2.13.0, the headers are not clobbered and the docs just tell you to use the headers kwarg.

web crawler - Sending "User-agent" using Requests library in Python - ...

python web-crawler python-requests
Rectangle 27 164

Fastest way to get a list with current directory's files - Python 3

>>> import os
>>> arr = next(os.walk('.'))[2]
>>> arr
['5bs_Turismo1.pdf', '5bs_Turismo1.pptx', 'esperienza.txt']
>>> import os
>>> path = os.getcwd()
>>> arr = []
>>> for files in next(os.walk(path))[2]:
>>>     arr.append(path + "\\" + files)
...
>>> for files in arr:
>>>     print(files)
...
F:\_moduli_economia\5bs_Turismo1.pdf
F:\_moduli_economia\5bs_Turismo1.pptx
F:\_moduli_economia\esperienza.txt

Here is a list of what I talked about in this answer:

  • 1.1 - Use of list comprehension to select only txt files
  • 1.2 - Using os.path.isfile to avoid directories in the list
  • 4.1 - python 2.7 - os.walk('.')
  • Example of use of os.walk('.') to count how many files there are in a directory and its subdirectories (for python 3.5 and 2.7)
  • Bonus: search for a type of files and copy them in a dir
>>> import os
>>> arr = os.listdir()
>>> arr
['$RECYCLE.BIN', 'work.txt', '3ebooks.txt', 'documents']
>>> arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
>>> print(arr_txt)
['work.txt', '3ebooks.txt']
import os.path

listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]

print(listOfFiles)

There are only files here

import pathlib

>>> flist = []
>>> for p in pathlib.Path('.').iterdir():
...  if p.is_file():
...   print(p)
...   flist.append(p)
...
error.PNG
exemaker.bat
guiprova.mp3
setup.py
speak_gui2.py
thumb.PNG

If you want to use list comprehension

>>> flist = [p for p in pathlib.Path('.').iterdir() if p.is_file()]

To include all the files in the subdirectory (in this example there are 11 files in the first directory and 3 in a subdirectory) I will use os.walk() that works sell in python 3.5 and newer versions:

import os
x = [i[2] for i in os.walk('.')]
y=[]
for t in x:
    for f in t:
        y.append(f)
print(y)
# print y # for 2.7 uncomment this and comment the previous line
>>> import os
>>> x = next(os.walk('F://python'))[2] # for the current dir use ('.')
>>> ['calculator.bat','calculator.py']

When you use next(os.walk(',')), you have the same results of os.listdir(), but you have the root as the first item of the list, all the folders in the second item and all the files in the third, while in os.listdir() you have folders and files in the same list. In both case (next(os.walk('.')) and os.listdir()) you just look in the current directory, leaving the subdirectory alone (you must use os.walk('-') for that, as we showed before).

>>> import os
>>> x = [f.name for f in os.scandir() if f.is_file()]
>>> x
['calculator.bat','calculator.py']

Another example with scandir (a little variation from docs.python.org) This one is more efficient than os.listdir. In this case, it shows the files only in the current directory where the script is executed.

>>> import os
>>> with os.scandir() as i:
...  for entry in i:
...   if entry.is_file():
...    print(entry.name)
...
ebookmaker.py
error.PNG
exemaker.bat
guiprova.mp3
setup.py
speakgui4.py
speak_gui2.py
speak_gui3.py
thumb.PNG
>>>
>>> import os
>>> mylist = os.listdir(os.getcwd())
>>> mylist
['$RECYCLE.BIN', 'work.txt', '3ebooks.txt', 'documents']
>>> for f in os.listdir('..'):
...     print f


>>> for f in os.listdir('/'):
...     print f

It's the same as in Python 3 (except the print)

>>> x = os.listdir('F:/python')
>>> for files in x:
>>>    print files
...
$RECYCLE.BIN
work.txt
3ebooks.txt
documents

5.1 - python 2 - os.walk('.')

Let's make an example for python 2.7 with walk (same as python 3).

>>> def getAllFiles(dir):
...     """Get all the files in the dir and subdirs"""
...     allfiles = []
...     for pack in os.walk(dir):
...         for files in pack[2]:
...             if os.path.isfile(files):
...                 allfiles += [files]
...     return allfiles
...
>>> getAllFiles("F://python")
['first.py', 'Modules.txt', 'test4Console.py', 'text4Console.bat', 'tkinter001.py']

In this example, we look for the number of files that are included in all the directory and its subdirecories.

import os    

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"


print(count("F:\\python"))
>>> import glob
>>> glob.glob("*.txt")
['ale.txt', 'alunni2015.txt', 'assenze.text.txt', 'text2.txt', 'untitled.txt']

A little script that searches in all the subdirectories of some direcotories (I choose the ones that has an undescore symbol at the start), takes all the type of files (pdf or pptx or txt ecc.) amd copies them into a destination directory. This is useful if you have made a lot of subdirectory and you want to take a look to all the stuff you made... let's say presentations, in one place, without having to recall where you put that file or the other one. I wish you find it helpful. I used for my own purposes.

import os
import shutil
from path import path

destination = "F:\\pptx_copied"
# os.makedirs(destination)


def copyfile(dir, filetype='pptx', counter=0):
    "Searches for pptx (or other) files and copies them"
    for pack in os.walk(dir):
        for f in pack[2]:
            if f.endswith(filetype):
                fullpath = pack[0] + "\\" + f
                print(fullpath)
                shutil.copy(fullpath, destination)
                counter += 1
    if counter > 0:
        print("------------------------")
        print("\t==> Found in: `" + dir + "` : " + str(counter) + " files\n")


for dir in os.listdir():
    "searches for folders that starts with `_`"
    if dir[0] == '_':
        # copyfile(dir, filetype='pdf')
        copyfile(dir, filetype='txt')
_compiti18\Compito Contabilit 1\conti.txt
_compiti18\Compito Contabilit 1\modula4.txt
_compiti18\Compito Contabilit 1\moduloa4.txt
_compiti18\ottobre\3acc\compito.txt
_compiti18\ottobre\3acc\compito1530.txt
_compiti18\ottobre\3acc\compito1530_correttore.txt
_compiti18\ottobre\3acc\compito3825.txt
_compiti18\ottobre\3acc\compito3825_correttore.txt
_compiti18\ottobre\3acc\compito6028.txt
------------------------
==> Found in: `_compiti18` : 9 files

You should include the path argument to listdir.

I agree, but I did not notice something also, that python2 requires the argument whilst python3 is optional, If you improve the answer for both python versions would be great :)

Ok, I went into Python 2 and find the differences and I edited the post.

python - How do I list all files of a directory? - Stack Overflow

python directory
Rectangle 27 154

Fastest way to get a list with current directory's files - Python 3

Here is a list of what I talked about in this answer:

  • 1.1 - Use of list comprehension to select only txt files
  • 1.2 - Using os.path.isfile to avoid directories in the list
  • 4.1 - python 2.7 - os.walk('.')
  • Example of use of os.walk('.') to count how many files there are in a directory and its subdirectories (for python 3.5 and 2.7)
>>> import os
>>> arr = os.listdir()
>>> arr
['$RECYCLE.BIN', 'work.txt', '3ebooks.txt', 'documents']
>>> arr_txt = [x for x in os.listdir() if x.endswith(".txt")]
>>> print(arr_txt)
['work.txt', '3ebooks.txt']
import os.path

listOfFiles = [f for f in os.listdir() if os.path.isfile(f)]

print(listOfFiles)

There are only files here

import pathlib

>>> flist = []
>>> for p in pathlib.Path('.').iterdir():
...  if p.is_file():
...   print(p)
...   flist.append(p)
...
error.PNG
exemaker.bat
guiprova.mp3
setup.py
speak_gui2.py
thumb.PNG

If you want to use list comprehension

>>> flist = [p for p in pathlib.Path('.').iterdir() if p.is_file()]

To include all the files in the subdirectory (in this example there are 11 files in the first directory and 3 in a subdirectory) I will use os.walk() that works sell in python 3.5 and newer versions:

import os
x = [i[2] for i in os.walk('.')]
y=[]
for t in x:
    for f in t:
        y.append(f)
print(y)
# print y # for 2.7 uncomment this and comment the previous line
>>> import os
>>> x = next(os.walk('F://python'))[2] # for the current dir use ('.')
>>> ['calculator.bat','calculator.py']

When you use next(os.walk(',')), you have the same results of os.listdir(), but you have the root as the first item of the list, all the folders in the second item and all the files in the third, while in os.listdir() you have folders and files in the same list. In both case (next(os.walk('.')) and os.listdir()) you just look in the current directory, leaving the subdirectory alone (you must use os.walk('-') for that, as we showed before).

>>> import os
>>> x = [f.name for f in os.scandir() if f.is_file()]
>>> x
['calculator.bat','calculator.py']

Another example with scandir (a little variation from docs.python.org) This one is more efficient than os.listdir. In this case, it shows the files only in the current directory where the script is executed.

>>> import os
>>> with os.scandir() as i:
...  for entry in i:
...   if entry.is_file():
...    print(entry.name)
...
ebookmaker.py
error.PNG
exemaker.bat
guiprova.mp3
setup.py
speakgui4.py
speak_gui2.py
speak_gui3.py
thumb.PNG
>>>
>>> import os
>>> mylist = os.listdir(os.getcwd())
>>> mylist
['$RECYCLE.BIN', 'work.txt', '3ebooks.txt', 'documents']
>>> for f in os.listdir('..'):
...     print f


>>> for f in os.listdir('/'):
...     print f

It's the same as in Python 3 (except the print)

>>> x = os.listdir('F:/python')
>>> for files in x:
>>>    print files
...
$RECYCLE.BIN
work.txt
3ebooks.txt
documents

5.1 - python 2 - os.walk('.')

Let's make an example for python 2.7 with walk (same as python 3).

>>> def getAllFiles(dir):
...     """Get all the files in the dir and subdirs"""
...     allfiles = []
...     for pack in os.walk(dir):
...         for files in pack[2]:
...             if os.path.isfile(files):
...                 allfiles += [files]
...     return allfiles
...
>>> getAllFiles("F://python")
['first.py', 'Modules.txt', 'test4Console.py', 'text4Console.bat', 'tkinter001.py']

In this example, we look for the number of files that are included in all the directory and its subdirecories.

import os    

def count(dir, counter=0):
    "returns number of files in dir and subdirs"
    for pack in os.walk(dir):
        for f in pack[2]:
            counter += 1
    return dir + " : " + str(counter) + "files"


print(count("F:\\python"))
>>> import glob
>>> glob.glob("*.txt")
['ale.txt', 'alunni2015.txt', 'assenze.text.txt', 'text2.txt', 'untitled.txt']

You should include the path argument to listdir.

I agree, but I did not notice something also, that python2 requires the argument whilst python3 is optional, If you improve the answer for both python versions would be great :)

Ok, I went into Python 2 and find the differences and I edited the post.

python - How do I list all files of a directory? - Stack Overflow

python directory
Rectangle 27 2

I have had great success with datetimes on GAE.

from datetime import datetime, timedelta
time_start = datetime.now()
time_taken = datetime.now() - time_start

time_taken will be a timedelta. You can compare it against another timedelta that has the duration you are interested in.

ten_seconds = timedelta(seconds=10)
if time_taken > ten_seconds:
    ....do something quick.

It sounds like you would be far better served using mapreduce or Task Queues. Both are great fun for dealing with huge numbers of records.

nobranches=TreeNode.all().fetch(100)

This code will only pull 100 records. If you have a full 100, when you are done, you can throw another item on the queue to launch off more.

-- Based on comment about needing trees without branches --

I do not see your model up there, but if I were trying to create a list of all of the trees without branches and process them, I would: Fetch the keys only for trees in blocks of 100 or so. Then, I would fetch all of the branches that belong to those trees using an In query. Order by the tree key. Scan the list of branches, the first time you find a tree's key, pull the key tree from the list. When done, you will have a list of "branchless" tree keys. Schedule each one of them for processing.

A simpler version is to use MapReduce on the trees. For each tree, find one branch that matches its ID. If you cannot, flag the tree for follow up. By default, this function will pull batches of trees (I think 25) with 8 simultaneous workers. And, it manages the job queues internally so you don't have to worry about timing out.

Thank You, datatime seems to be working better. I can't use <code>nobranches=TreeNode.all().fetch(100)</code> since my for loop after that looks for only branches where the nodes are [] <code>for tree in nobranches: if tree.branches==[]:</code> using fetch(100) would return the same 100 nodes everytime and I want to add to new untouched branches. I wish I could get the nodes without branches in gql, but this seems to be the only way

I would say either, use map-reduce on the whole beast or create a filtered query that retrieves only the nodes you need. I also noticed that you are retrieving records one by one in add_branches. If you get all of the child_node records in one round trip, it should speed up your function.

python - app engine DeadlineExceededError for cron jobs and task queue...

python google-app-engine cron wikipedia
Rectangle 27 2

Instead of parsing the urls yourself, you can use urlparse.parse_qs function:

>>> from urlparse import urlparse, parse_qs
>>> URL = 'https://someurl.com/with/query_string?i=main&mode=front&sid=12ab&enc=+Hello'
>>> parsed_url = urlparse(URL)
>>> parse_qs(parsed_url.query)
{'i': ['main'], 'enc': [' Hello '], 'mode': ['front'], 'sid': ['12ab']}

web scraping - Python ValueError: too many values to unpack for crawle...

python web-scraping web-crawler valueerror
Rectangle 27 12

session = requests.Session()
session.headers.update({'User-Agent': 'Custom user agent'})

session.get('https://httpbin.org/headers')

web crawler - Sending "User-agent" using Requests library in Python - ...

python web-crawler python-requests
Rectangle 27 26

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider.

In the 0.14 release note doc is mentioned: "Added CloseSpider exception to manually close spiders (r2691)"

Example as per the docs:

def parse_page(self, response):
  if 'Bandwidth exceeded' in response.body:
    raise CloseSpider('bandwidth_exceeded')

It succeeds to force stop, but not fast enough. It still lets some Request running. I hope Scrapy will provide a better solution in the future.

From my observations, it finishes the requests which were already fired, no?

python - Force my scrapy spider to stop crawling - Stack Overflow

python scrapy
Rectangle 27 3

As suggested in comments, please use pytidylib...

import urllib2
from StringIO import StringIO

from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document

html = urllib2.urlopen("http://www.hitmeister.de").read()
tidy, errors = tidy_document(html)
soup = BeautifulSoup(tidy)
print type(soup)
(py26_default)[mpenning@Bucksnort ~]$ python foo.py
<class 'BeautifulSoup.BeautifulSoup'>
(py26_default)[mpenning@Bucksnort ~]$
errors
pytidylib
line 53 column 1493 - Warning: '<' + '/' + letter not allowed here
line 53 column 1518 - Warning: '<' + '/' + letter not allowed here
line 53 column 1541 - Warning: '<' + '/' + letter not allowed here
line 53 column 1547 - Warning: '<' + '/' + letter not allowed here
line 132 column 239 - Warning: '<' + '/' + letter not allowed here
line 135 column 231 - Warning: '<' + '/' + letter not allowed here
line 434 column 98 - Warning: replacing invalid character code 156
line 453 column 96 - Warning: replacing invalid character code 156
line 780 column 108 - Warning: replacing invalid character code 159
line 991 column 27 - Warning: replacing invalid character code 156
line 1018 column 43 - Warning: '<' + '/' + letter not allowed here
line 1029 column 40 - Warning: '<' + '/' + letter not allowed here
line 1037 column 126 - Warning: '<' + '/' + letter not allowed here
line 1039 column 96 - Warning: '<' + '/' + letter not allowed here
line 1040 column 71 - Warning: '<' + '/' + letter not allowed here
line 1041 column 58 - Warning: '<' + '/' + letter not allowed here
line 1047 column 126 - Warning: '<' + '/' + letter not allowed here
line 1049 column 96 - Warning: '<' + '/' + letter not allowed here
line 1050 column 72 - Warning: '<' + '/' + letter not allowed here
line 1051 column 58 - Warning: '<' + '/' + letter not allowed here
line 1063 column 108 - Warning: '<' + '/' + letter not allowed here
line 1066 column 58 - Warning: '<' + '/' + letter not allowed here
line 1076 column 17 - Warning: <input> element not empty or not closed
line 1121 column 140 - Warning: '<' + '/' + letter not allowed here
line 1202 column 33 - Error: <g:plusone> is not recognized!
line 1202 column 33 - Warning: discarding unexpected <g:plusone>
line 1202 column 88 - Warning: discarding unexpected </g:plusone>
line 1245 column 86 - Warning: replacing invalid character code 130
line 1265 column 33 - Warning: entity "&gt" doesn't end in ';'
line 1345 column 354 - Warning: '<' + '/' + letter not allowed here
line 1361 column 255 - Warning: unescaped & or unknown entity "&_s_icmp"
line 1361 column 562 - Warning: unescaped & or unknown entity "&_s_icmp"
line 1361 column 856 - Warning: unescaped & or unknown entity "&_s_icmp"
line 1397 column 115 - Warning: replacing invalid character code 130
line 1425 column 116 - Warning: replacing invalid character code 130
line 1453 column 115 - Warning: replacing invalid character code 130
line 1481 column 116 - Warning: replacing invalid character code 130
line 1509 column 116 - Warning: replacing invalid character code 130
line 1523 column 251 - Warning: replacing invalid character code 159
line 1524 column 259 - Warning: replacing invalid character code 159
line 1524 column 395 - Warning: replacing invalid character code 159
line 1533 column 151 - Warning: replacing invalid character code 159
line 1537 column 115 - Warning: replacing invalid character code 130
line 1565 column 116 - Warning: replacing invalid character code 130
line 1593 column 116 - Warning: replacing invalid character code 130
line 1621 column 115 - Warning: replacing invalid character code 130
line 1649 column 115 - Warning: replacing invalid character code 130
line 1677 column 115 - Warning: replacing invalid character code 130
line 1705 column 115 - Warning: replacing invalid character code 130
line 1750 column 150 - Warning: replacing invalid character code 130
line 1774 column 150 - Warning: replacing invalid character code 130
line 1798 column 150 - Warning: replacing invalid character code 130
line 1822 column 150 - Warning: replacing invalid character code 130
line 1826 column 78 - Warning: replacing invalid character code 130
line 1854 column 150 - Warning: replacing invalid character code 130
line 1878 column 150 - Warning: replacing invalid character code 130
line 1902 column 150 - Warning: replacing invalid character code 130
line 1926 column 150 - Warning: replacing invalid character code 130
line 1954 column 186 - Warning: unescaped & or unknown entity "&charge"
line 2004 column 100 - Warning: replacing invalid character code 156
line 2033 column 162 - Warning: replacing invalid character code 159
line 21 column 1 - Warning: <meta> proprietary attribute "property"
line 22 column 1 - Warning: <meta> proprietary attribute "property"
line 23 column 1 - Warning: <meta> proprietary attribute "property"
line 29 column 1 - Warning: <meta> proprietary attribute "property"
line 30 column 1 - Warning: <meta> proprietary attribute "property"
line 31 column 1 - Warning: <meta> proprietary attribute "property"
line 412 column 9 - Warning: <body> proprietary attribute "itemscope"
line 412 column 9 - Warning: <body> proprietary attribute "itemtype"
line 1143 column 1 - Warning: <script> inserting "type" attribute
line 1225 column 44 - Warning: <table> lacks "summary" attribute
line 1934 column 9 - Warning: <div> proprietary attribute "name"
line 436 column 41 - Warning: trimming empty <li>
line 446 column 89 - Warning: trimming empty <li>
line 1239 column 33 - Warning: trimming empty <span>
line 1747 column 37 - Warning: trimming empty <span>
line 1771 column 37 - Warning: trimming empty <span>
line 1795 column 37 - Warning: trimming empty <span>
line 1819 column 37 - Warning: trimming empty <span>
line 1851 column 37 - Warning: trimming empty <span>
line 1875 column 37 - Warning: trimming empty <span>
line 1899 column 37 - Warning: trimming empty <span>
line 1923 column 37 - Warning: trimming empty <span>
line 2018 column 49 - Warning: trimming empty <span>
line 2026 column 49 - Warning: trimming empty <span>

Maybe this question is not releted to BS, but I am new to python, can you. please, explain, this line: tidy, errors = tidy_document(html).

tidy is the html document cleaned up by pytidylib, and errors are the errors pytidylib found in the original I sent to it.

correct, tidy_document returns two values: a sanitized html string and errors found during the html sanitization.

Mike Pennington, nice explanation of a feature that confused me about python as a n00b (and one that I quickly came to love)

web crawler - Python BeautifulSoup Error - Stack Overflow

python web-crawler beautifulsoup lxml html5lib
Rectangle 27 264

Python 2.x - The Long Version

  • Try to convert strings to Unicode strings as soon as possible in your code

Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u'my nicd strng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

Conversion from str to Unicode can happen even when you don't explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode('')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % ''

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + ''

In the following diagram, you can see how the word caf has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, is encoded using two bytes. In "Cp1252", is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u'Zrich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read()

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

In the connection string add:

charset='utf8',
use_unicode=True
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.

Work with Unicodes as you would normal strs.

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

io.open

Python 3 in no more Unicode capable as Python 2.x is, but the regular str is now a Unicode string and the old str is now bytes.

The default encoding is now UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

python - How to fix: "UnicodeDecodeError: 'ascii' codec can't decode b...

python python-2.7 chinese-locale
Rectangle 27 264

Python 2.x - The Long Version

  • Try to convert strings to Unicode strings as soon as possible in your code

Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u'my nicd strng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

Conversion from str to Unicode can happen even when you don't explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode('')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % ''

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + ''

In the following diagram, you can see how the word caf has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, is encoded using two bytes. In "Cp1252", is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u'Zrich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read()

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

In the connection string add:

charset='utf8',
use_unicode=True
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.

Work with Unicodes as you would normal strs.

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

io.open

Python 3 in no more Unicode capable as Python 2.x is, but the regular str is now a Unicode string and the old str is now bytes.

The default encoding is now UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

python - How to fix: "UnicodeDecodeError: 'ascii' codec can't decode b...

python python-2.7 chinese-locale
Rectangle 27 264

Python 2.x - The Long Version

  • Try to convert strings to Unicode strings as soon as possible in your code

Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u'my nicd strng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

Conversion from str to Unicode can happen even when you don't explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode('')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % ''

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + ''

In the following diagram, you can see how the word caf has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, is encoded using two bytes. In "Cp1252", is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u'Zrich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read()

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

In the connection string add:

charset='utf8',
use_unicode=True
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.

Work with Unicodes as you would normal strs.

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

io.open

Python 3 in no more Unicode capable as Python 2.x is, but the regular str is now a Unicode string and the old str is now bytes.

The default encoding is now UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

python - How to fix: "UnicodeDecodeError: 'ascii' codec can't decode b...

python python-2.7 chinese-locale
Rectangle 27 239

Python 2.x - The Long Version

  • Try to convert strings to Unicode strings as soon as possible in your code

Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u'my nicd strng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

Conversion from str to Unicode can happen even when you don't explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

unicode('')                       # explicit conversion without encoding
u"The currency is: {}".format('') # new style format string into Unicode string - Python will try to convert value string to Unicode first
u'The currency is: %s' % ''       # old style format string into Unicode string - Python will try to convert value string to Unicode first
u'The currency is: ' + ''         # append string to Unicode - Python will try to convert string to Unicode first

In the following diagram, you can see how the word caf has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, is encoded using two bytes. In "Cp1252", is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull:

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u'Zrich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read()

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

In the connection string add:

charset='utf8',
use_unicode=True
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.

Work with Unicodes as you would normal strs.

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

io.open

Python 3 in no more Unicode capable as Python 2.x is, but the regular str is now a Unicode string and the old str is now bytes.

The default encoding is now UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

python - How to fix: "UnicodeDecodeError: 'ascii' codec can't decode b...

python python-2.7
Rectangle 27 1

The problem here is that you're doing a query operation for every link in your document. Since wikipedia pages can contain a lot of links, this means a lot of queries - and hence, you run out of processing time. This approach is also going to consume your quota at a fantastic rate!

Instead, you should use the page name of the Wikipedia page as the key name of the entity. Then, you can collect up all the links from the document into a list, construct keys from them (which is an entirely local operation), and do a single batch db.get for all of them. Once you've updated and/or created them as appropriate, you can do a batch db.put to store them all to the datastore - reducing your total datastore operations from numlinks*2 to just 2!

Agree! For added bonus points, I think you might be able to stick a yield in front of that db.put to allow the put to be asynchronous. (I know you can in mapreduce workers).

Putting yield in front of a normal db.put operation does not magically turn it into an asynchronous operation. What you yield from mapreduce workers is a "special" put operation, and the mapreduce framework itself is built to work efficiently with the generators created by yielding these special operations.

python - app engine DeadlineExceededError for cron jobs and task queue...

python google-app-engine cron wikipedia
Rectangle 27 2007

List - a mutable type

Arguments are passed by assignment. The rationale behind this is twofold:

  • the parameter passed in is actually a reference to an object (but the reference is passed by value)
  • some data types are mutable, but others aren't

Let's try to modify the list that was passed to a method:

def try_to_change_list_contents(the_list):
    print('got', the_list)
    the_list.append('four')
    print('changed to', the_list)

outer_list = ['one', 'two', 'three']

print('before, outer_list =', outer_list)
try_to_change_list_contents(outer_list)
print('after, outer_list =', outer_list)
before, outer_list = ['one', 'two', 'three']
got ['one', 'two', 'three']
changed to ['one', 'two', 'three', 'four']
after, outer_list = ['one', 'two', 'three', 'four']

Since the parameter passed in is a reference to outer_list, not a copy of it, we can use the mutating list methods to change it and have the changes reflected in the outer scope.

Now let's see what happens when we try to change the reference that was passed in as a parameter:

def try_to_change_list_reference(the_list):
    print('got', the_list)
    the_list = ['and', 'we', 'can', 'not', 'lie']
    print('set to', the_list)

outer_list = ['we', 'like', 'proper', 'English']

print('before, outer_list =', outer_list)
try_to_change_list_reference(outer_list)
print('after, outer_list =', outer_list)
before, outer_list = ['we', 'like', 'proper', 'English']
got ['we', 'like', 'proper', 'English']
set to ['and', 'we', 'can', 'not', 'lie']
after, outer_list = ['we', 'like', 'proper', 'English']

Since the the_list parameter was passed by value, assigning a new list to it had no effect that the code outside the method could see. The the_list was a copy of the outer_list reference, and we had the_list point to a new list, but there was no way to change where outer_list pointed.

It's immutable, so there's nothing we can do to change the contents of the string

def try_to_change_string_reference(the_string):
    print('got', the_string)
    the_string = 'In a kingdom by the sea'
    print('set to', the_string)

outer_string = 'It was many and many a year ago'

print('before, outer_string =', outer_string)
try_to_change_string_reference(outer_string)
print('after, outer_string =', outer_string)
before, outer_string = It was many and many a year ago
got It was many and many a year ago
set to In a kingdom by the sea
after, outer_string = It was many and many a year ago

Again, since the the_string parameter was passed by value, assigning a new string to it had no effect that the code outside the method could see. The the_string was a copy of the outer_string reference, and we had the_string point to a new string, but there was no way to change where outer_string pointed.

EDIT: It's been noted that this doesn't answer the question that @David originally asked, "Is there something I can do to pass the variable by actual reference?". Let's work on that.

As @Andrea's answer shows, you could return the new value. This doesn't change the way things are passed in, but does let you get the information you want back out:

def return_a_whole_new_string(the_string):
    new_string = something_to_do_with_the_old_string(the_string)
    return new_string

# then you could call it like
my_string = return_a_whole_new_string(my_string)

If you really wanted to avoid using a return value, you could create a class to hold your value and pass it into the function or use an existing class, like a list:

def use_a_wrapper_to_simulate_pass_by_reference(stuff_to_change):
    new_string = something_to_do_with_the_old_string(stuff_to_change[0])
    stuff_to_change[0] = new_string

# then you could call it like
wrapper = [my_string]
use_a_wrapper_to_simulate_pass_by_reference(wrapper)

do_something_with(wrapper[0])

Then the same is in C, when you pass "by reference" you're actually passing by value the reference... Define "by reference" :P

I'm not sure I understand your terms. I've been out of the C game for a while, but back when I was in it, there was no "pass by reference" - you could pass things, and it was always pass by value, so whatever was in the parameter list was copied. But sometimes the thing was a pointer, which one could follow to the piece of memory (primitive, array, struct, whatever), but you couldn't change the pointer that was copied from the outer scope - when you were done with the function, the original pointer still pointed to the same address. C++ introduced references, which behaved differently.

@Zac Bowling I don't really get how what you're saying is relevant, in a practical sense, to this answer. If a Python newcomer wanted to know about passing by ref/val, then the takeaway from this answer is: 1- You can use the reference that a function receives as its arguments, to modify the 'outside' value of a variable, as long as you don't reassign the parameter to refer to a new object. 2- Assigning to an immutable type will always create a new object, which breaks the reference that you had to the outside variable.

@CamJackson, you need a better example - numbers are also immutable objects in Python. Besides, wouldn't it be true to say that any assignment without subscripting on the left side of the equals will reassign the name to a new object whether it is immutable or not? def Foo(alist): alist = [1,2,3] will not modify the contents of the list from the callers perspective.

-1. The code shown is good, the explanation as to how is completely wrong. See the answers by DavidCournapeau or DarenThomas for correct explanations as to why.

python - How do I pass a variable by reference? - Stack Overflow

python reference parameter-passing pass-by-reference
Rectangle 27 1994

List - a mutable type

Arguments are passed by assignment. The rationale behind this is twofold:

  • the parameter passed in is actually a reference to an object (but the reference is passed by value)
  • some data types are mutable, but others aren't

Let's try to modify the list that was passed to a method:

def try_to_change_list_contents(the_list):
    print('got', the_list)
    the_list.append('four')
    print('changed to', the_list)

outer_list = ['one', 'two', 'three']

print('before, outer_list =', outer_list)
try_to_change_list_contents(outer_list)
print('after, outer_list =', outer_list)
before, outer_list = ['one', 'two', 'three']
got ['one', 'two', 'three']
changed to ['one', 'two', 'three', 'four']
after, outer_list = ['one', 'two', 'three', 'four']

Since the parameter passed in is a reference to outer_list, not a copy of it, we can use the mutating list methods to change it and have the changes reflected in the outer scope.

Now let's see what happens when we try to change the reference that was passed in as a parameter:

def try_to_change_list_reference(the_list):
    print('got', the_list)
    the_list = ['and', 'we', 'can', 'not', 'lie']
    print('set to', the_list)

outer_list = ['we', 'like', 'proper', 'English']

print('before, outer_list =', outer_list)
try_to_change_list_reference(outer_list)
print('after, outer_list =', outer_list)
before, outer_list = ['we', 'like', 'proper', 'English']
got ['we', 'like', 'proper', 'English']
set to ['and', 'we', 'can', 'not', 'lie']
after, outer_list = ['we', 'like', 'proper', 'English']

Since the the_list parameter was passed by value, assigning a new list to it had no effect that the code outside the method could see. The the_list was a copy of the outer_list reference, and we had the_list point to a new list, but there was no way to change where outer_list pointed.

It's immutable, so there's nothing we can do to change the contents of the string

def try_to_change_string_reference(the_string):
    print('got', the_string)
    the_string = 'In a kingdom by the sea'
    print('set to', the_string)

outer_string = 'It was many and many a year ago'

print('before, outer_string =', outer_string)
try_to_change_string_reference(outer_string)
print('after, outer_string =', outer_string)
before, outer_string = It was many and many a year ago
got It was many and many a year ago
set to In a kingdom by the sea
after, outer_string = It was many and many a year ago

Again, since the the_string parameter was passed by value, assigning a new string to it had no effect that the code outside the method could see. The the_string was a copy of the outer_string reference, and we had the_string point to a new string, but there was no way to change where outer_string pointed.

EDIT: It's been noted that this doesn't answer the question that @David originally asked, "Is there something I can do to pass the variable by actual reference?". Let's work on that.

As @Andrea's answer shows, you could return the new value. This doesn't change the way things are passed in, but does let you get the information you want back out:

def return_a_whole_new_string(the_string):
    new_string = something_to_do_with_the_old_string(the_string)
    return new_string

# then you could call it like
my_string = return_a_whole_new_string(my_string)

If you really wanted to avoid using a return value, you could create a class to hold your value and pass it into the function or use an existing class, like a list:

def use_a_wrapper_to_simulate_pass_by_reference(stuff_to_change):
    new_string = something_to_do_with_the_old_string(stuff_to_change[0])
    stuff_to_change[0] = new_string

# then you could call it like
wrapper = [my_string]
use_a_wrapper_to_simulate_pass_by_reference(wrapper)

do_something_with(wrapper[0])

Then the same is in C, when you pass "by reference" you're actually passing by value the reference... Define "by reference" :P

I'm not sure I understand your terms. I've been out of the C game for a while, but back when I was in it, there was no "pass by reference" - you could pass things, and it was always pass by value, so whatever was in the parameter list was copied. But sometimes the thing was a pointer, which one could follow to the piece of memory (primitive, array, struct, whatever), but you couldn't change the pointer that was copied from the outer scope - when you were done with the function, the original pointer still pointed to the same address. C++ introduced references, which behaved differently.

@Zac Bowling I don't really get how what you're saying is relevant, in a practical sense, to this answer. If a Python newcomer wanted to know about passing by ref/val, then the takeaway from this answer is: 1- You can use the reference that a function receives as its arguments, to modify the 'outside' value of a variable, as long as you don't reassign the parameter to refer to a new object. 2- Assigning to an immutable type will always create a new object, which breaks the reference that you had to the outside variable.

@CamJackson, you need a better example - numbers are also immutable objects in Python. Besides, wouldn't it be true to say that any assignment without subscripting on the left side of the equals will reassign the name to a new object whether it is immutable or not? def Foo(alist): alist = [1,2,3] will not modify the contents of the list from the callers perspective.

-1. The code shown is good, the explanation as to how is completely wrong. See the answers by DavidCournapeau or DarenThomas for correct explanations as to why.

python - How do I pass a variable by reference? - Stack Overflow

python reference parameter-passing pass-by-reference
Rectangle 27 767

7 in a

You can also consider using a set, but constructing that set from your list may take more time than faster membership testing will save. The only way to be certain is to benchmark well. (this also depends on what operations you require)

But you don't have the index, and getting it will cost you what you saved.

-1. Performance will be horrible for long lists. Using a set or searching on a sorted list using the bisect module is much faster.

like: If 7 in a: b=a.index(7) ?

@StevenRumbalski: Sets are only an option if you don't need it to be ordered (and hence, have an index). And sets are clearly mentioned in the answer, it just also gives an straightforward answer to the question as OP asked it. I don't think this is worth -1.

Okay , I try your method in my real code and it's take a bit more time probably because I need to know the index of the value. With my second method , I check if it exist and get the index at the same time.

python - Fastest way to check if a value exist in a list - Stack Overf...

python performance list
Rectangle 27 1

The way around it is to redesign your code to fit the AppEngine infrastructure and leverage it to use tons of machines in small batches.

python - app engine DeadlineExceededError for cron jobs and task queue...

python google-app-engine cron wikipedia
Rectangle 27 1

When DeadlineExcededErrors happen you want the request to eventually succeed if called again. This may require that your crawling state is guaranteed to have made some progress that can be skipped the next time. (Not addressed here)

Parallelized calls can help tremendously.

  • Datastore Query (queries in parallel - asynctools)

Combine Entities being put into a single round trip call.

# put newNodes+tree at the same time
db.put(newNodes+tree)

Pull TreeNode.gql from inside loop up into parallel query tool like asynctools http://asynctools.googlecode.com

if pyNode is not None:

        runner = AsyncMultiTask()
        for child in pyNode:
             title = child.attributes["title"].value
             query = db.GqlQuery("SELECT __key__ FROM TreeNode WHERE name = :1", title)
             runner.append(QueryTask(query, limit=1, client_state=title))

        # kick off the work
        runner.run()

        # peel out the results
        treeNodes = []
        for task in runner:
            task_result = task.get_result() # will raise any exception that occurred for the given query
            treeNodes.append(task_result)

        for node in treeNodes:
            if node is None:
                newNodes.append(TreeNode(name=child.attributes["title"].value))

            else:
                tree.branches.append(node.key())
        for node in newNodes:
            tree.branches.append(node.key())
            self.log.debug("Node Added: %s" % node.name)

        # put newNodes+tree at the same time
        db.put(newNodes+tree)
        return tree.branches

python - app engine DeadlineExceededError for cron jobs and task queue...

python google-app-engine cron wikipedia
Rectangle 27 1

This is due to the fact that one of the pieces contains two or more '=' characters. In that case you thus return a list of three or more elements. And you cannot assign it to the two values.

You can solve that problem, by splitting at most one '=' by adding an additional parameter to the .split(..) call:

k, v = piece.split("=",1)

But now we still do not have guarantees that there is an '=' in the piece string anyway.

urllib.parse
urlparse
from urllib.parse import urlparse, parse_qsl

purl = urlparse(url)
quer = parse_qsl(purl.query)

for k,v in quer:
    # ...
    pass

Now we have decoded the query string as a list of key-value tuples we can process separately. I would advice to build up a URL with the urllib as well.

web scraping - Python ValueError: too many values to unpack for crawle...

python web-scraping web-crawler valueerror