Rectangle 27 62

What is the difference between __str__ and __repr__ in Python?

__str__ (read as "dunder (double-underscore) string") and __repr__ (read as "dunder-repper" (for "representation")) are both special methods that return strings based on the state of the object.

__repr__
__str__

So one should first write a __repr__ that allows you to reinstantiate an equivalent object from the string it returns e.g. using eval or by typing it in character-for-character in a Python shell.

At any time later, one can write a __str__ for a user-readable string representation of the instance, when one believes it to be necessary.

If you print an object, or pass it to format, str.format, or str, then if a __str__ method is defined, that method will be called, otherwise, __repr__ will be used.

The __repr__ method is called by the builtin function repr and is what is echoed on your python shell when it evaluates an expression that returns an object.

Since it provides a backup for __str__, if you can only write one, start with __repr__

Here's the builtin help on repr:

repr(...)
    repr(object) -> string

    Return the canonical string representation of the object.
    For most object types, eval(repr(object)) == object.

That is, for most objects, if you type in what is printed by repr, you should be able to create an equivalent object. But this is not the default implementation.

The default object __repr__ is (C Python source) something like:

def __repr__(self):
    return '<{0}.{1} object at {2}>'.format(
      self.__module__, type(self).__name__, hex(id(self)))

That means by default you'll print the module the object is from, the class name, and the hexadecimal representation of its location in memory - for example:

<__main__.Foo object at 0x7f80665abdd0>

This information isn't very useful, but there's no way to derive how one might accurately create a canonical representation of any given instance, and it's better than nothing, at least telling us how we might uniquely identify it in memory.

Let's look at how useful it can be, using the Python shell and datetime objects. First we need to import the datetime module:

import datetime

If we call datetime.now in the shell, we'll see everything we need to recreate an equivalent datetime object. This is created by the datetime __repr__:

>>> datetime.datetime.now()
datetime.datetime(2015, 1, 24, 20, 5, 36, 491180)

If we print a datetime object, we see a nice human readable (in fact, ISO) format. This is implemented by datetime's __str__:

>>> print(datetime.datetime.now())
2015-01-24 20:05:44.977951

It is a simple matter to recreate the object we lost because we didn't assign it to a variable by copying and pasting from the __repr__ output, and then printing it, and we get it in the same human readable output as the other object:

>>> the_past = datetime.datetime(2015, 1, 24, 20, 5, 36, 491180)
>>> print(the_past)
2015-01-24 20:05:36.491180

As you're developing, you'll want to be able to reproduce objects in the same state, if possible. This, for example, is how the datetime object defines __repr__ (Python source). It is fairly complex, because of all of the attributes needed to reproduce such an object:

def __repr__(self):
    """Convert to formal string, for repr()."""
    L = [self._year, self._month, self._day, # These are never zero
         self._hour, self._minute, self._second, self._microsecond]
    if L[-1] == 0:
        del L[-1]
    if L[-1] == 0:
        del L[-1]
    s = ", ".join(map(str, L))
    s = "%s(%s)" % ('datetime.' + self.__class__.__name__, s)
    if self._tzinfo is not None:
        assert s[-1:] == ")"
        s = s[:-1] + ", tzinfo=%r" % self._tzinfo + ")"
    return s

If you want your object to have a more human readable representation, you can implement __str__ next. Here's how the datetime object (Python source) implements __str__, which it easily does because it already has a function to display it in ISO format:

def __str__(self):
    "Convert to string, for str()."
    return self.isoformat(sep=' ')
__repr__ = __str__

Setting __repr__ = __str__ is silly - __repr__ is a fallback for __str__ and a __repr__, written for developers usage in debugging, should be written before you write a __str__.

You need a __str__ only when you need a textual representation of the object.

Define __repr__ for objects you write so you and other developers have a reproducible example when using it as you develop. Define __str__ when you need a human readable string representation of it.

repr - Difference between __str__ and __repr__ in Python - Stack Overf...

python repr
Rectangle 27 62

What is the difference between __str__ and __repr__ in Python?

__str__ (read as "dunder (double-underscore) string") and __repr__ (read as "dunder-repper" (for "representation")) are both special methods that return strings based on the state of the object.

__repr__
__str__

So one should first write a __repr__ that allows you to reinstantiate an equivalent object from the string it returns e.g. using eval or by typing it in character-for-character in a Python shell.

At any time later, one can write a __str__ for a user-readable string representation of the instance, when one believes it to be necessary.

If you print an object, or pass it to format, str.format, or str, then if a __str__ method is defined, that method will be called, otherwise, __repr__ will be used.

The __repr__ method is called by the builtin function repr and is what is echoed on your python shell when it evaluates an expression that returns an object.

Since it provides a backup for __str__, if you can only write one, start with __repr__

Here's the builtin help on repr:

repr(...)
    repr(object) -> string

    Return the canonical string representation of the object.
    For most object types, eval(repr(object)) == object.

That is, for most objects, if you type in what is printed by repr, you should be able to create an equivalent object. But this is not the default implementation.

The default object __repr__ is (C Python source) something like:

def __repr__(self):
    return '<{0}.{1} object at {2}>'.format(
      self.__module__, type(self).__name__, hex(id(self)))

That means by default you'll print the module the object is from, the class name, and the hexadecimal representation of its location in memory - for example:

<__main__.Foo object at 0x7f80665abdd0>

This information isn't very useful, but there's no way to derive how one might accurately create a canonical representation of any given instance, and it's better than nothing, at least telling us how we might uniquely identify it in memory.

Let's look at how useful it can be, using the Python shell and datetime objects. First we need to import the datetime module:

import datetime

If we call datetime.now in the shell, we'll see everything we need to recreate an equivalent datetime object. This is created by the datetime __repr__:

>>> datetime.datetime.now()
datetime.datetime(2015, 1, 24, 20, 5, 36, 491180)

If we print a datetime object, we see a nice human readable (in fact, ISO) format. This is implemented by datetime's __str__:

>>> print(datetime.datetime.now())
2015-01-24 20:05:44.977951

It is a simple matter to recreate the object we lost because we didn't assign it to a variable by copying and pasting from the __repr__ output, and then printing it, and we get it in the same human readable output as the other object:

>>> the_past = datetime.datetime(2015, 1, 24, 20, 5, 36, 491180)
>>> print(the_past)
2015-01-24 20:05:36.491180

As you're developing, you'll want to be able to reproduce objects in the same state, if possible. This, for example, is how the datetime object defines __repr__ (Python source). It is fairly complex, because of all of the attributes needed to reproduce such an object:

def __repr__(self):
    """Convert to formal string, for repr()."""
    L = [self._year, self._month, self._day, # These are never zero
         self._hour, self._minute, self._second, self._microsecond]
    if L[-1] == 0:
        del L[-1]
    if L[-1] == 0:
        del L[-1]
    s = ", ".join(map(str, L))
    s = "%s(%s)" % ('datetime.' + self.__class__.__name__, s)
    if self._tzinfo is not None:
        assert s[-1:] == ")"
        s = s[:-1] + ", tzinfo=%r" % self._tzinfo + ")"
    return s

If you want your object to have a more human readable representation, you can implement __str__ next. Here's how the datetime object (Python source) implements __str__, which it easily does because it already has a function to display it in ISO format:

def __str__(self):
    "Convert to string, for str()."
    return self.isoformat(sep=' ')
__repr__ = __str__

Setting __repr__ = __str__ is silly - __repr__ is a fallback for __str__ and a __repr__, written for developers usage in debugging, should be written before you write a __str__.

You need a __str__ only when you need a textual representation of the object.

Define __repr__ for objects you write so you and other developers have a reproducible example when using it as you develop. Define __str__ when you need a human readable string representation of it.

repr - Difference between __str__ and __repr__ in Python - Stack Overf...

python repr
Rectangle 27 62

What is the difference between __str__ and __repr__ in Python?

__str__ (read as "dunder (double-underscore) string") and __repr__ (read as "dunder-repper" (for "representation")) are both special methods that return strings based on the state of the object.

__repr__
__str__

So one should first write a __repr__ that allows you to reinstantiate an equivalent object from the string it returns e.g. using eval or by typing it in character-for-character in a Python shell.

At any time later, one can write a __str__ for a user-readable string representation of the instance, when one believes it to be necessary.

If you print an object, or pass it to format, str.format, or str, then if a __str__ method is defined, that method will be called, otherwise, __repr__ will be used.

The __repr__ method is called by the builtin function repr and is what is echoed on your python shell when it evaluates an expression that returns an object.

Since it provides a backup for __str__, if you can only write one, start with __repr__

Here's the builtin help on repr:

repr(...)
    repr(object) -> string

    Return the canonical string representation of the object.
    For most object types, eval(repr(object)) == object.

That is, for most objects, if you type in what is printed by repr, you should be able to create an equivalent object. But this is not the default implementation.

The default object __repr__ is (C Python source) something like:

def __repr__(self):
    return '<{0}.{1} object at {2}>'.format(
      self.__module__, type(self).__name__, hex(id(self)))

That means by default you'll print the module the object is from, the class name, and the hexadecimal representation of its location in memory - for example:

<__main__.Foo object at 0x7f80665abdd0>

This information isn't very useful, but there's no way to derive how one might accurately create a canonical representation of any given instance, and it's better than nothing, at least telling us how we might uniquely identify it in memory.

Let's look at how useful it can be, using the Python shell and datetime objects. First we need to import the datetime module:

import datetime

If we call datetime.now in the shell, we'll see everything we need to recreate an equivalent datetime object. This is created by the datetime __repr__:

>>> datetime.datetime.now()
datetime.datetime(2015, 1, 24, 20, 5, 36, 491180)

If we print a datetime object, we see a nice human readable (in fact, ISO) format. This is implemented by datetime's __str__:

>>> print(datetime.datetime.now())
2015-01-24 20:05:44.977951

It is a simple matter to recreate the object we lost because we didn't assign it to a variable by copying and pasting from the __repr__ output, and then printing it, and we get it in the same human readable output as the other object:

>>> the_past = datetime.datetime(2015, 1, 24, 20, 5, 36, 491180)
>>> print(the_past)
2015-01-24 20:05:36.491180

As you're developing, you'll want to be able to reproduce objects in the same state, if possible. This, for example, is how the datetime object defines __repr__ (Python source). It is fairly complex, because of all of the attributes needed to reproduce such an object:

def __repr__(self):
    """Convert to formal string, for repr()."""
    L = [self._year, self._month, self._day, # These are never zero
         self._hour, self._minute, self._second, self._microsecond]
    if L[-1] == 0:
        del L[-1]
    if L[-1] == 0:
        del L[-1]
    s = ", ".join(map(str, L))
    s = "%s(%s)" % ('datetime.' + self.__class__.__name__, s)
    if self._tzinfo is not None:
        assert s[-1:] == ")"
        s = s[:-1] + ", tzinfo=%r" % self._tzinfo + ")"
    return s

If you want your object to have a more human readable representation, you can implement __str__ next. Here's how the datetime object (Python source) implements __str__, which it easily does because it already has a function to display it in ISO format:

def __str__(self):
    "Convert to string, for str()."
    return self.isoformat(sep=' ')
__repr__ = __str__

Setting __repr__ = __str__ is silly - __repr__ is a fallback for __str__ and a __repr__, written for developers usage in debugging, should be written before you write a __str__.

You need a __str__ only when you need a textual representation of the object.

Define __repr__ for objects you write so you and other developers have a reproducible example when using it as you develop. Define __str__ when you need a human readable string representation of it.

repr - Difference between __str__ and __repr__ in Python - Stack Overf...

python repr
Rectangle 27 5

In short what you want to do is not very difficult as long as you understand the differences between C++ and Python, and let both C++ and Python handle the differences between the languages. The method I have found the easiest and safest is to use Python ctypes to define a Python class wrapper for your C++ class, and define an - extern C - wrapper to bridge your C++ class to the Python Class.

The advantages to this approach are that Python can handle all of the memory management, reference counts, etc; while C++ can handle all of the type conversion, and error handling. Also if there are any future changes to the Python C API, you will not need to worry about it. Instead you can just focus on what is important your code.

Compared to wrapping the C++ class within the Python C API, this is way, way easier! As well as this method does not require anything not included with either C++ or Python standard libraries.

Below you will find an arbitrary example put together mainly from other Stack Overflow posts (cited in the Python wrapper). That I created when I was trying to figure out how to interface Python and C++. The code is heavily commented with details on how each portion of code is implemented. It is one way to do it.

"""
My C++ & Python ctypes test class.  The following Stack Overflow URLs
either answered my questions as I figured this out, inspired code ideas, 
or where just downright informative.  However there are were other useful
pages here and there that I did not record links for.

http://stackoverflow.com/questions/1615813/how-to-use-c-classes-with-ctypes
http://stackoverflow.com/questions/17244756/python-ctypes-wraping-c-class-with-operators
http://stackoverflow.com/questions/19198872/how-do-i-return-objects-from-a-c-function-with-ctypes
"""

# Define imports.
from ctypes import cdll, c_int, c_void_p, c_char_p

# Load the shared library.
lib = cdll.LoadLibrary("MyClass.dll")

# Explicitly define the return types and argument types.
# This helps both clarity and troubleshooting.  Note that
# a 'c_void_p' is passed in the place of the C++ object.
# The object passed by the void pointer will be handled in
# the C++ code itself.
#
# Each one of the below calls is a C function call contained
# within the external shared library.
lib.createClass.restype = c_void_p
lib.deleteClass.argtypes = [c_void_p]

lib.callAdd.argtypes = [c_void_p, c_void_p]
lib.callAdd.restype  = c_int

lib.callGetID.argtypes = [c_void_p]
lib.callGetID.restype  = c_char_p

lib.callGetValue.argtypes = [c_void_p]
lib.callGetValue.restype  = c_int

lib.callSetID.argtypes = [c_void_p, c_char_p]
lib.callSetID.restype  = c_int

lib.callSetValue.argtypes = [c_void_p, c_int]
lib.callSetValue.restype  = c_int


class MyClass(object):
    """A Python class which wraps around a C++ object.
    The Python class will handle the memory management
    of the C++ object.

    Not that only the default constructor is called for
    the C++ object within the __init__ method.  Once the
    object is defined any specific values for the object
    are set through library function calls.
    """

    def __init__(self, id_str = ""):
        """Initialize the C++ class using the default constructor.

        Python strings must be converted to a string of bytes.
        'UTF-8' is used to specify the encoding of the bytes to
        preserve any Unicode characters.  NOTE: this can make
        for unintended side effects in the C++ code.
        """
        self.obj = lib.createClass()

        if id_str != "":
            lib.callSetID(self.obj, bytes(id_str, 'UTF-8'))

    def __del__(self):
        """Allow Python to call the C++ object's destructor."""
        return lib.deleteClass(self.obj)

    def add(self, other):
        """Call the C++ object method 'add' to return a new 
        instance of MyClass; self.add(other).
    """
        r = MyClass()
        lib.callAdd(self.obj, other.obj, r.obj)
        return r

    def getID(self):
        """Return the C++ object's ID.
        C char string also must be converted to Python strings.
        'UTF-8' is the specified format for conversion to
        preserve any Unicode characters.
        """
        return str(lib.callGetID(self.obj), 'utf-8')

    def getValue(self):
        """Return the C++ object's Value."""
        return lib.callGetValue(self.obj)

    def setID(self, id_str):
        """Set the C++ object's ID string.
        Remember that Python string must be converted to 
        C style char strings.
    """
        return lib.callSetID(self.obj, bytes(id_str, 'utf-8'))

    def setValue(self, n):
        """Set the C++ object's value."""
        return lib.callSetValue(self.obj, n)


if __name__ == "__main__":
    x = MyClass("id_a")
    y = MyClass("id_b")
    z = x.add(y)

    z.setID("id_c")

    print("x.getID = {0}".format(x.getID()))
    print("x.getValue = {0}".format(x.getValue()))
    print()
    print("y.getID = {0}".format(y.getID()))
    print("y.getValue = {0}".format(y.getValue()))
    print()
    print("z.getID = {0}".format(z.getID()))
    print("z.getValue = {0}".format(z.getValue()))

The C++ class & extern C wrapper:

#include <iostream>
#include <new>
#include <string>
using namespace std;

// Manually compile with:
// g++ -O0 -g3 -Wall -c -fmessage-length=0 -o MyClass.o MyClass.cpp
// g++ -shared -o MyClass.dll "MyClass.o"

// Check to see if the platform is a Windows OS.  Note that
// _WIN32 applies to both a 32 bit or 64 bit environment.
// So there is no need to check for _WIN64.
#ifdef _WIN32
// On Windows platforms declare any functions meant to be
// called from an external program in order to allow the
// function to be able to be called.  Else define a DEF
// file to allow the correct behaviour. (much harder!)
#define DLLEXPORT __declspec(dllexport)
#endif

#ifndef DLLEXPORT
#define DLLEXPORT
#endif

class MyClass {
    // A C++ class solely used to define an object to test
    // Python ctypes compatibility.  In reality this would
    // most likely be implemented as a wrapper around
    // another C++ object to define the right a compatible
    // object between C++ and Python.

public:
    MyClass() : val(42), id("1234567890") {};
    // Notice the next constructor is never called.
    MyClass(string str) : val(42), id(str) {};
    ~MyClass(){};

    int add(const MyClass* b, MyClass* c) {

        // Do not allow exceptions to be thrown.  Instead catch
        // them and tell Python about them, using some sort of
        // error code convention, shared between the C++ code
        // and the Python code.

        try {
            c->val = val + b->val;

            return 0;

        /*
        } catch(ExceptionName e) {
            // Return a specific integer to identify
            // a specific exception was thrown.
            return -99
        */

        } catch(...) {
            // Return an error code to identify if
            // an unknown exception was thrown.
            return -1;
        } // end try
    }; // end method

    string getID() { return id; };
    int getValue() { return val; };

    void setID(string str) { id = str; };
    void setValue(int n) { val = n; };

private:
    int val;
    string id;
}; // end class

extern "C" {
    // All function calls that Python makes need to be made to
    // "C" code in order to avoid C++ name mangling.  A side
    // effect of this is that overloaded C++ constructors must
    // use a separate function call for each constructor that
    // is to be used.  Alternatively a single constructor can
    // be used instead, and then setters can be used to specify
    // any of an object instance specific values.  Which is
    // what was implemented here.

    DLLEXPORT void * createClass(void) {
        // Inside of function call C++ code can still be used.
        return new(std::nothrow) MyClass;
    } // end function

    DLLEXPORT void deleteClass (void *ptr) {
         delete static_cast<MyClass *>(ptr);
    } // end function

    DLLEXPORT int callAdd(void *a, void *b, void *c) {

        // Do not allow exceptions to be thrown.  Instead catch
        // them and tell Python about them.

        try {
            MyClass * x = static_cast<MyClass *>(a);
            MyClass * y = static_cast<MyClass *>(b);
            MyClass * z = static_cast<MyClass *>(c);

            return x->add(y, z);

        /*
        } catch(ExceptionName e) {
            // Return a specific integer to identify
            // a specific exception was thrown.
            return -99
        */

        } catch(...) {
            // Return an error code to identify if
            // an unknown exception was thrown.
            return -1;
        } // end try
    } // end function

    DLLEXPORT const char* callGetID(void *ptr) {

        try {
            MyClass * ref = static_cast<MyClass *>(ptr);

            // Inside of function call C++ code can still be used.
            string temp = ref->getID();

            // A string must be converted to it "C" equivalent.
            return temp.c_str();

        } catch(...) {
            // Return an error code to identify if
            // an unknown exception was thrown.
            return "-1";
        } // end try
    } // end function

    DLLEXPORT int callGetValue(void *ptr) {

        try {
            MyClass * ref = static_cast<MyClass *>(ptr);
            return ref->getValue();

        } catch(...) {
            // Return an error code to identify if
            // an unknown exception was thrown.
            return -1;
        } // end try
    } // end function

    DLLEXPORT int callSetID(void *ptr, char *str) {

        try {
            MyClass * ref = static_cast<MyClass *>(ptr);

            ref->setID(str);

            return 0;

        } catch(...) {
            // Return an error code to identify if
            // an unknown exception was thrown.
            return -1;
        } // end try
    } // end function

    DLLEXPORT int callSetValue(void *ptr, int n) {

        try {
            MyClass * ref = static_cast<MyClass *>(ptr);

            ref->setValue(n);

            return 0;

        } catch(...) {
            // Return an error code to identify if
            // an unknown exception was thrown.
            return -1;
        } // end try
    } // end function

} // end extern

Note: Trog unfortunately I do not have a high enough reputation to post comments yet, as I am new to Stack Overflow. Otherwise I would like to have asked if Python ctypes was available in you embedded Python environment first. In fact this is my first post.

My compiler is Borland C++ Builder 6, so we're talking about C++ 99 standard. Boost stopped supporting this compiler some time ago, so that's why I don't want to introduce Boost. There is actually an article on the web how to easily extend python interpreter with a class, but I forgot where it was. I'll search for it and post it here, as I have few questions regarding that particular way proposed there.

How to expose C++ class to Python without using Boost? - Stack Overflo...

python c++ class extending
Rectangle 27 19

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but itll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesnt hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

So basically trying to read lines in binary mode is much more difficult because I'm not guaranteed that the EOL character is \n or \r\n or something else?

python - Difference between parsing a text file in r and rb mode - Sta...

python file-io text-parsing
Rectangle 27 41

This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.

In Python 3, its a different (and more consistent) story: in text mode ('r'), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read() will give you a str. In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.

Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.

for py3, will reading in text mode automatically try to detect what type of encoding it is? I imagine having to detect encoding is quite a challenge with a bytes object.

@Keikoku Detecting encoding based on a stream alone, without any metadata, is impossible - think about the various encodings that are ASCII + use the 8th bit for information rather than parity; they all share 255 valid one-byte sequences, but only half of them (the ASCII half) represent the same character in each. Python's default isn't to guess it, its a session-wide default encoding, spelled sys.getdefaultencoding(). On my Py3 install, its UTF-8, but you can't rely on that always being the case.

python - Difference between parsing a text file in r and rb mode - Sta...

python file-io text-parsing
Rectangle 27 14

In connection with a project to build an analytics toolbox for our Network Ops guys, i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).

I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).

If you are not familiar with error logs or with the difference between them and access logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X, that directory in /var, just below root:

$> pwd
   /var/log/apache2

$> ls
   access_log   error_log

For network diagnostics, error logs are often far more useful than the access logs. They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.

i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.

I have been building analytics tools for a long time, but only in the past four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:

A couple of examples of the last bullet:

# what are the most common issues that cause an error to be logged?

err_order = function(df){
    t0 = xtabs(~Issue_Descr, df)
    m = cbind( names(t0), t0)
    rownames(m) = NULL
    colnames(m) = c("Cause", "Count")
    x = m[,2]
    x = as.numeric(x)
    ndx = order(x, decreasing=T)
    m = m[ndx,]
    m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
                    CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
    subset(m1, CountAsProp >= 1.)
}

# calling this function, passing in a data frame, returns something like:


                        Cause       Count    CountAsProp
1  'connect to unix://var/ failed'    200        40.0
2  'object buffered to temp file'     185        37.0
3  'connection refused'                94        18.8

Doug, that sounds lovely. I can try to help with the R packaging -- e.g. Python scripts are a non-issue as other package come with their own Perl (cf gdata which uses a Perl package to read xls files) or Java jars (several packages).

Logfile analysis in R? - Stack Overflow

r logfile-analysis
Rectangle 27 162

I have an approach which I think is interesting and a bit different from the rest. The main difference in my approach, compared to some of the others, is in how the image segmentation step is performed--I used the DBSCAN clustering algorithm from Python's scikit-learn; it's optimized for finding somewhat amorphous shapes that may not necessarily have a single clear centroid.

At the top level, my approach is fairly simple and can be broken down into about 3 steps. First I apply a threshold (or actually, the logical "or" of two separate and distinct thresholds). As with many of the other answers, I assumed that the Christmas tree would be one of the brighter objects in the scene, so the first threshold is just a simple monochrome brightness test; any pixels with values above 220 on a 0-255 scale (where black is 0 and white is 255) are saved to a binary black-and-white image. The second threshold tries to look for red and yellow lights, which are particularly prominent in the trees in the upper left and lower right of the six images, and stand out well against the blue-green background which is prevalent in most of the photos. I convert the rgb image to hsv space, and require that the hue is either less than 0.2 on a 0.0-1.0 scale (corresponding roughly to the border between yellow and green) or greater than 0.95 (corresponding to the border between purple and red) and additionally I require bright, saturated colors: saturation and value must both be above 0.7. The results of the two threshold procedures are logically "or"-ed together, and the resulting matrix of black-and-white binary images is shown below:

You can clearly see that each image has one large cluster of pixels roughly corresponding to the location of each tree, plus a few of the images also have some other small clusters corresponding either to lights in the windows of some of the buildings, or to a background scene on the horizon. The next step is to get the computer to recognize that these are separate clusters, and label each pixel correctly with a cluster membership ID number.

For this task I chose DBSCAN. There is a pretty good visual comparison of how DBSCAN typically behaves, relative to other clustering algorithms, available here. As I said earlier, it does well with amorphous shapes. The output of DBSCAN, with each cluster plotted in a different color, is shown here:

There are a few things to be aware of when looking at this result. First is that DBSCAN requires the user to set a "proximity" parameter in order to regulate its behavior, which effectively controls how separated a pair of points must be in order for the algorithm to declare a new separate cluster rather than agglomerating a test point onto an already pre-existing cluster. I set this value to be 0.04 times the size along the diagonal of each image. Since the images vary in size from roughly VGA up to about HD 1080, this type of scale-relative definition is critical.

Another point worth noting is that the DBSCAN algorithm as it is implemented in scikit-learn has memory limits which are fairly challenging for some of the larger images in this sample. Therefore, for a few of the larger images, I actually had to "decimate" (i.e., retain only every 3rd or 4th pixel and drop the others) each cluster in order to stay within this limit. As a result of this culling process, the remaining individual sparse pixels are difficult to see on some of the larger images. Therefore, for display purposes only, the color-coded pixels in the above images have been effectively "dilated" just slightly so that they stand out better. It's purely a cosmetic operation for the sake of the narrative; although there are comments mentioning this dilation in my code, rest assured that it has nothing to do with any calculations that actually matter.

Once the clusters are identified and labeled, the third and final step is easy: I simply take the largest cluster in each image (in this case, I chose to measure "size" in terms of the total number of member pixels, although one could have just as easily instead used some type of metric that gauges physical extent) and compute the convex hull for that cluster. The convex hull then becomes the tree border. The six convex hulls computed via this method are shown below in red:

The source code is written for Python 2.7.6 and it depends on numpy, scipy, matplotlib and scikit-learn. I've divided it into two parts. The first part is responsible for the actual image processing:

from PIL import Image
import numpy as np
import scipy as sp
import matplotlib.colors as colors
from sklearn.cluster import DBSCAN
from math import ceil, sqrt

"""
Inputs:

    rgbimg:         [M,N,3] numpy array containing (uint, 0-255) color image

    hueleftthr:     Scalar constant to select maximum allowed hue in the
                    yellow-green region

    huerightthr:    Scalar constant to select minimum allowed hue in the
                    blue-purple region

    satthr:         Scalar constant to select minimum allowed saturation

    valthr:         Scalar constant to select minimum allowed value

    monothr:        Scalar constant to select minimum allowed monochrome
                    brightness

    maxpoints:      Scalar constant maximum number of pixels to forward to
                    the DBSCAN clustering algorithm

    proxthresh:     Proximity threshold to use for DBSCAN, as a fraction of
                    the diagonal size of the image

Outputs:

    borderseg:      [K,2,2] Nested list containing K pairs of x- and y- pixel
                    values for drawing the tree border

    X:              [P,2] List of pixels that passed the threshold step

    labels:         [Q,2] List of cluster labels for points in Xslice (see
                    below)

    Xslice:         [Q,2] Reduced list of pixels to be passed to DBSCAN

"""

def findtree(rgbimg, hueleftthr=0.2, huerightthr=0.95, satthr=0.7, 
             valthr=0.7, monothr=220, maxpoints=5000, proxthresh=0.04):

    # Convert rgb image to monochrome for
    gryimg = np.asarray(Image.fromarray(rgbimg).convert('L'))
    # Convert rgb image (uint, 0-255) to hsv (float, 0.0-1.0)
    hsvimg = colors.rgb_to_hsv(rgbimg.astype(float)/255)

    # Initialize binary thresholded image
    binimg = np.zeros((rgbimg.shape[0], rgbimg.shape[1]))
    # Find pixels with hue<0.2 or hue>0.95 (red or yellow) and saturation/value
    # both greater than 0.7 (saturated and bright)--tends to coincide with
    # ornamental lights on trees in some of the images
    boolidx = np.logical_and(
                np.logical_and(
                  np.logical_or((hsvimg[:,:,0] < hueleftthr),
                                (hsvimg[:,:,0] > huerightthr)),
                                (hsvimg[:,:,1] > satthr)),
                                (hsvimg[:,:,2] > valthr))
    # Find pixels that meet hsv criterion
    binimg[np.where(boolidx)] = 255
    # Add pixels that meet grayscale brightness criterion
    binimg[np.where(gryimg > monothr)] = 255

    # Prepare thresholded points for DBSCAN clustering algorithm
    X = np.transpose(np.where(binimg == 255))
    Xslice = X
    nsample = len(Xslice)
    if nsample > maxpoints:
        # Make sure number of points does not exceed DBSCAN maximum capacity
        Xslice = X[range(0,nsample,int(ceil(float(nsample)/maxpoints)))]

    # Translate DBSCAN proximity threshold to units of pixels and run DBSCAN
    pixproxthr = proxthresh * sqrt(binimg.shape[0]**2 + binimg.shape[1]**2)
    db = DBSCAN(eps=pixproxthr, min_samples=10).fit(Xslice)
    labels = db.labels_.astype(int)

    # Find the largest cluster (i.e., with most points) and obtain convex hull   
    unique_labels = set(labels)
    maxclustpt = 0
    for k in unique_labels:
        class_members = [index[0] for index in np.argwhere(labels == k)]
        if len(class_members) > maxclustpt:
            points = Xslice[class_members]
            hull = sp.spatial.ConvexHull(points)
            maxclustpt = len(class_members)
            borderseg = [[points[simplex,0], points[simplex,1]] for simplex
                          in hull.simplices]

    return borderseg, X, labels, Xslice

and the second part is a user-level script which calls the first file and generates all of the plots above:

#!/usr/bin/env python

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from findtree import findtree

# Image files to process
fname = ['nmzwj.png', 'aVZhC.png', '2K9EF.png',
         'YowlH.png', '2y4o5.png', 'FWhSP.png']

# Initialize figures
fgsz = (16,7)        
figthresh = plt.figure(figsize=fgsz, facecolor='w')
figclust  = plt.figure(figsize=fgsz, facecolor='w')
figcltwo  = plt.figure(figsize=fgsz, facecolor='w')
figborder = plt.figure(figsize=fgsz, facecolor='w')
figthresh.canvas.set_window_title('Thresholded HSV and Monochrome Brightness')
figclust.canvas.set_window_title('DBSCAN Clusters (Raw Pixel Output)')
figcltwo.canvas.set_window_title('DBSCAN Clusters (Slightly Dilated for Display)')
figborder.canvas.set_window_title('Trees with Borders')

for ii, name in zip(range(len(fname)), fname):
    # Open the file and convert to rgb image
    rgbimg = np.asarray(Image.open(name))

    # Get the tree borders as well as a bunch of other intermediate values
    # that will be used to illustrate how the algorithm works
    borderseg, X, labels, Xslice = findtree(rgbimg)

    # Display thresholded images
    axthresh = figthresh.add_subplot(2,3,ii+1)
    axthresh.set_xticks([])
    axthresh.set_yticks([])
    binimg = np.zeros((rgbimg.shape[0], rgbimg.shape[1]))
    for v, h in X:
        binimg[v,h] = 255
    axthresh.imshow(binimg, interpolation='nearest', cmap='Greys')

    # Display color-coded clusters
    axclust = figclust.add_subplot(2,3,ii+1) # Raw version
    axclust.set_xticks([])
    axclust.set_yticks([])
    axcltwo = figcltwo.add_subplot(2,3,ii+1) # Dilated slightly for display only
    axcltwo.set_xticks([])
    axcltwo.set_yticks([])
    axcltwo.imshow(binimg, interpolation='nearest', cmap='Greys')
    clustimg = np.ones(rgbimg.shape)    
    unique_labels = set(labels)
    # Generate a unique color for each cluster 
    plcol = cm.rainbow_r(np.linspace(0, 1, len(unique_labels)))
    for lbl, pix in zip(labels, Xslice):
        for col, unqlbl in zip(plcol, unique_labels):
            if lbl == unqlbl:
                # Cluster label of -1 indicates no cluster membership;
                # override default color with black
                if lbl == -1:
                    col = [0.0, 0.0, 0.0, 1.0]
                # Raw version
                for ij in range(3):
                    clustimg[pix[0],pix[1],ij] = col[ij]
                # Dilated just for display
                axcltwo.plot(pix[1], pix[0], 'o', markerfacecolor=col, 
                    markersize=1, markeredgecolor=col)
    axclust.imshow(clustimg)
    axcltwo.set_xlim(0, binimg.shape[1]-1)
    axcltwo.set_ylim(binimg.shape[0], -1)

    # Plot original images with read borders around the trees
    axborder = figborder.add_subplot(2,3,ii+1)
    axborder.set_axis_off()
    axborder.imshow(rgbimg, interpolation='nearest')
    for vseg, hseg in borderseg:
        axborder.plot(hseg, vseg, 'r-', lw=3)
    axborder.set_xlim(0, binimg.shape[1]-1)
    axborder.set_ylim(binimg.shape[0], -1)

plt.show()

@lennon310 's solution is clustering. (k-means)

@user3054997: That's true, but take a close look at his final result--his bounding contours are highly convoluted, and three of the figures (3, 5, and 6, by his own admission) also have lots of extra tiny little disconnected contours inside of the main boundary. We may have both used a clustering algorithm at some point during our procedure, but the final result, and the path we took to get to it, aren't even remotely similar!!! FWIW, I'd argue that my result is the better one of the two, because my boundary shapes much more closely resemble what an actual human being would probably draw.

@stachyra I also thought about this approach before proposing my simpler ones. I think this has a great potential to be extended and generalized to produce good results in other cases also. You could experiment with neural nets for clustering. Something like a SOM or neural gas would do excellent work. Nevertheless, great proposal and thumbs up from me!

@Faust & Ryan Carlson: thanks, guys! Yes, I agree that the upvote system, while it works well for adjudicating between 2 or 3 short answers all submitted within a few hours of each other, has serious biases when it comes to contests with long answers that play out over extended periods of time. For one thing, early submissions begin accumulating upvotes before later ones are even available for public review. And if answers are all lengthy, then as soon as one establishes a modest lead, there is often a "bandwagon effect" as people only upvote the first one without bothering to read the rest.

@stachyra great news friend! Warmest congrats and may this mark a beginning for your new year!

c++ - How to detect a Christmas Tree? - Stack Overflow

c++ python opencv image-processing computer-vision
Rectangle 27 161

I have an approach which I think is interesting and a bit different from the rest. The main difference in my approach, compared to some of the others, is in how the image segmentation step is performed--I used the DBSCAN clustering algorithm from Python's scikit-learn; it's optimized for finding somewhat amorphous shapes that may not necessarily have a single clear centroid.

At the top level, my approach is fairly simple and can be broken down into about 3 steps. First I apply a threshold (or actually, the logical "or" of two separate and distinct thresholds). As with many of the other answers, I assumed that the Christmas tree would be one of the brighter objects in the scene, so the first threshold is just a simple monochrome brightness test; any pixels with values above 220 on a 0-255 scale (where black is 0 and white is 255) are saved to a binary black-and-white image. The second threshold tries to look for red and yellow lights, which are particularly prominent in the trees in the upper left and lower right of the six images, and stand out well against the blue-green background which is prevalent in most of the photos. I convert the rgb image to hsv space, and require that the hue is either less than 0.2 on a 0.0-1.0 scale (corresponding roughly to the border between yellow and green) or greater than 0.95 (corresponding to the border between purple and red) and additionally I require bright, saturated colors: saturation and value must both be above 0.7. The results of the two threshold procedures are logically "or"-ed together, and the resulting matrix of black-and-white binary images is shown below:

You can clearly see that each image has one large cluster of pixels roughly corresponding to the location of each tree, plus a few of the images also have some other small clusters corresponding either to lights in the windows of some of the buildings, or to a background scene on the horizon. The next step is to get the computer to recognize that these are separate clusters, and label each pixel correctly with a cluster membership ID number.

For this task I chose DBSCAN. There is a pretty good visual comparison of how DBSCAN typically behaves, relative to other clustering algorithms, available here. As I said earlier, it does well with amorphous shapes. The output of DBSCAN, with each cluster plotted in a different color, is shown here:

There are a few things to be aware of when looking at this result. First is that DBSCAN requires the user to set a "proximity" parameter in order to regulate its behavior, which effectively controls how separated a pair of points must be in order for the algorithm to declare a new separate cluster rather than agglomerating a test point onto an already pre-existing cluster. I set this value to be 0.04 times the size along the diagonal of each image. Since the images vary in size from roughly VGA up to about HD 1080, this type of scale-relative definition is critical.

Another point worth noting is that the DBSCAN algorithm as it is implemented in scikit-learn has memory limits which are fairly challenging for some of the larger images in this sample. Therefore, for a few of the larger images, I actually had to "decimate" (i.e., retain only every 3rd or 4th pixel and drop the others) each cluster in order to stay within this limit. As a result of this culling process, the remaining individual sparse pixels are difficult to see on some of the larger images. Therefore, for display purposes only, the color-coded pixels in the above images have been effectively "dilated" just slightly so that they stand out better. It's purely a cosmetic operation for the sake of the narrative; although there are comments mentioning this dilation in my code, rest assured that it has nothing to do with any calculations that actually matter.

Once the clusters are identified and labeled, the third and final step is easy: I simply take the largest cluster in each image (in this case, I chose to measure "size" in terms of the total number of member pixels, although one could have just as easily instead used some type of metric that gauges physical extent) and compute the convex hull for that cluster. The convex hull then becomes the tree border. The six convex hulls computed via this method are shown below in red:

The source code is written for Python 2.7.6 and it depends on numpy, scipy, matplotlib and scikit-learn. I've divided it into two parts. The first part is responsible for the actual image processing:

from PIL import Image
import numpy as np
import scipy as sp
import matplotlib.colors as colors
from sklearn.cluster import DBSCAN
from math import ceil, sqrt

"""
Inputs:

    rgbimg:         [M,N,3] numpy array containing (uint, 0-255) color image

    hueleftthr:     Scalar constant to select maximum allowed hue in the
                    yellow-green region

    huerightthr:    Scalar constant to select minimum allowed hue in the
                    blue-purple region

    satthr:         Scalar constant to select minimum allowed saturation

    valthr:         Scalar constant to select minimum allowed value

    monothr:        Scalar constant to select minimum allowed monochrome
                    brightness

    maxpoints:      Scalar constant maximum number of pixels to forward to
                    the DBSCAN clustering algorithm

    proxthresh:     Proximity threshold to use for DBSCAN, as a fraction of
                    the diagonal size of the image

Outputs:

    borderseg:      [K,2,2] Nested list containing K pairs of x- and y- pixel
                    values for drawing the tree border

    X:              [P,2] List of pixels that passed the threshold step

    labels:         [Q,2] List of cluster labels for points in Xslice (see
                    below)

    Xslice:         [Q,2] Reduced list of pixels to be passed to DBSCAN

"""

def findtree(rgbimg, hueleftthr=0.2, huerightthr=0.95, satthr=0.7, 
             valthr=0.7, monothr=220, maxpoints=5000, proxthresh=0.04):

    # Convert rgb image to monochrome for
    gryimg = np.asarray(Image.fromarray(rgbimg).convert('L'))
    # Convert rgb image (uint, 0-255) to hsv (float, 0.0-1.0)
    hsvimg = colors.rgb_to_hsv(rgbimg.astype(float)/255)

    # Initialize binary thresholded image
    binimg = np.zeros((rgbimg.shape[0], rgbimg.shape[1]))
    # Find pixels with hue<0.2 or hue>0.95 (red or yellow) and saturation/value
    # both greater than 0.7 (saturated and bright)--tends to coincide with
    # ornamental lights on trees in some of the images
    boolidx = np.logical_and(
                np.logical_and(
                  np.logical_or((hsvimg[:,:,0] < hueleftthr),
                                (hsvimg[:,:,0] > huerightthr)),
                                (hsvimg[:,:,1] > satthr)),
                                (hsvimg[:,:,2] > valthr))
    # Find pixels that meet hsv criterion
    binimg[np.where(boolidx)] = 255
    # Add pixels that meet grayscale brightness criterion
    binimg[np.where(gryimg > monothr)] = 255

    # Prepare thresholded points for DBSCAN clustering algorithm
    X = np.transpose(np.where(binimg == 255))
    Xslice = X
    nsample = len(Xslice)
    if nsample > maxpoints:
        # Make sure number of points does not exceed DBSCAN maximum capacity
        Xslice = X[range(0,nsample,int(ceil(float(nsample)/maxpoints)))]

    # Translate DBSCAN proximity threshold to units of pixels and run DBSCAN
    pixproxthr = proxthresh * sqrt(binimg.shape[0]**2 + binimg.shape[1]**2)
    db = DBSCAN(eps=pixproxthr, min_samples=10).fit(Xslice)
    labels = db.labels_.astype(int)

    # Find the largest cluster (i.e., with most points) and obtain convex hull   
    unique_labels = set(labels)
    maxclustpt = 0
    for k in unique_labels:
        class_members = [index[0] for index in np.argwhere(labels == k)]
        if len(class_members) > maxclustpt:
            points = Xslice[class_members]
            hull = sp.spatial.ConvexHull(points)
            maxclustpt = len(class_members)
            borderseg = [[points[simplex,0], points[simplex,1]] for simplex
                          in hull.simplices]

    return borderseg, X, labels, Xslice

and the second part is a user-level script which calls the first file and generates all of the plots above:

#!/usr/bin/env python

from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from findtree import findtree

# Image files to process
fname = ['nmzwj.png', 'aVZhC.png', '2K9EF.png',
         'YowlH.png', '2y4o5.png', 'FWhSP.png']

# Initialize figures
fgsz = (16,7)        
figthresh = plt.figure(figsize=fgsz, facecolor='w')
figclust  = plt.figure(figsize=fgsz, facecolor='w')
figcltwo  = plt.figure(figsize=fgsz, facecolor='w')
figborder = plt.figure(figsize=fgsz, facecolor='w')
figthresh.canvas.set_window_title('Thresholded HSV and Monochrome Brightness')
figclust.canvas.set_window_title('DBSCAN Clusters (Raw Pixel Output)')
figcltwo.canvas.set_window_title('DBSCAN Clusters (Slightly Dilated for Display)')
figborder.canvas.set_window_title('Trees with Borders')

for ii, name in zip(range(len(fname)), fname):
    # Open the file and convert to rgb image
    rgbimg = np.asarray(Image.open(name))

    # Get the tree borders as well as a bunch of other intermediate values
    # that will be used to illustrate how the algorithm works
    borderseg, X, labels, Xslice = findtree(rgbimg)

    # Display thresholded images
    axthresh = figthresh.add_subplot(2,3,ii+1)
    axthresh.set_xticks([])
    axthresh.set_yticks([])
    binimg = np.zeros((rgbimg.shape[0], rgbimg.shape[1]))
    for v, h in X:
        binimg[v,h] = 255
    axthresh.imshow(binimg, interpolation='nearest', cmap='Greys')

    # Display color-coded clusters
    axclust = figclust.add_subplot(2,3,ii+1) # Raw version
    axclust.set_xticks([])
    axclust.set_yticks([])
    axcltwo = figcltwo.add_subplot(2,3,ii+1) # Dilated slightly for display only
    axcltwo.set_xticks([])
    axcltwo.set_yticks([])
    axcltwo.imshow(binimg, interpolation='nearest', cmap='Greys')
    clustimg = np.ones(rgbimg.shape)    
    unique_labels = set(labels)
    # Generate a unique color for each cluster 
    plcol = cm.rainbow_r(np.linspace(0, 1, len(unique_labels)))
    for lbl, pix in zip(labels, Xslice):
        for col, unqlbl in zip(plcol, unique_labels):
            if lbl == unqlbl:
                # Cluster label of -1 indicates no cluster membership;
                # override default color with black
                if lbl == -1:
                    col = [0.0, 0.0, 0.0, 1.0]
                # Raw version
                for ij in range(3):
                    clustimg[pix[0],pix[1],ij] = col[ij]
                # Dilated just for display
                axcltwo.plot(pix[1], pix[0], 'o', markerfacecolor=col, 
                    markersize=1, markeredgecolor=col)
    axclust.imshow(clustimg)
    axcltwo.set_xlim(0, binimg.shape[1]-1)
    axcltwo.set_ylim(binimg.shape[0], -1)

    # Plot original images with read borders around the trees
    axborder = figborder.add_subplot(2,3,ii+1)
    axborder.set_axis_off()
    axborder.imshow(rgbimg, interpolation='nearest')
    for vseg, hseg in borderseg:
        axborder.plot(hseg, vseg, 'r-', lw=3)
    axborder.set_xlim(0, binimg.shape[1]-1)
    axborder.set_ylim(binimg.shape[0], -1)

plt.show()

@lennon310 's solution is clustering. (k-means)

@user3054997: That's true, but take a close look at his final result--his bounding contours are highly convoluted, and three of the figures (3, 5, and 6, by his own admission) also have lots of extra tiny little disconnected contours inside of the main boundary. We may have both used a clustering algorithm at some point during our procedure, but the final result, and the path we took to get to it, aren't even remotely similar!!! FWIW, I'd argue that my result is the better one of the two, because my boundary shapes much more closely resemble what an actual human being would probably draw.

@stachyra I also thought about this approach before proposing my simpler ones. I think this has a great potential to be extended and generalized to produce good results in other cases also. You could experiment with neural nets for clustering. Something like a SOM or neural gas would do excellent work. Nevertheless, great proposal and thumbs up from me!

@Faust & Ryan Carlson: thanks, guys! Yes, I agree that the upvote system, while it works well for adjudicating between 2 or 3 short answers all submitted within a few hours of each other, has serious biases when it comes to contests with long answers that play out over extended periods of time. For one thing, early submissions begin accumulating upvotes before later ones are even available for public review. And if answers are all lengthy, then as soon as one establishes a modest lead, there is often a "bandwagon effect" as people only upvote the first one without bothering to read the rest.

@stachyra great news friend! Warmest congrats and may this mark a beginning for your new year!

c++ - How to detect a Christmas Tree? - Stack Overflow

c++ python opencv image-processing computer-vision
Rectangle 27 1

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

So in Python 3.x

bytes = b'...' literals = a sequence of octets (integers between 0 and 255)

It's not really there, it only gets printed. This should not cause any problem anywhere.

>>> print(out)
b'hello,Python!'
>>> out.decode('utf-8')
'hello,Python!'

Why the result prints b'hello,Python!' ,when I use tensorflow? - Stack...

python tensorflow
Rectangle 27 1

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

So in Python 3.x

bytes = b'...' literals = a sequence of octets (integers between 0 and 255)

It's not really there, it only gets printed. This should not cause any problem anywhere.

>>> print(out)
b'hello,Python!'
>>> out.decode('utf-8')
'hello,Python!'

Why the result prints b'hello,Python!' ,when I use tensorflow? - Stack...

python tensorflow
Rectangle 27 2

For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):

In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)

So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:

0 $ cat data.txt 
line1
line2
line3
0 $ file data.txt 
data.txt: ASCII text, with CRLF line terminators
0 $ python2.7 -c 'f = open("data.txt"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "r"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "rb"); print f.readlines()'

It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:

0 $ python2.7 -c 'f = open("data.txt", "rU"); print f.readlines()'
['line1\n', 'line2\n', 'line3\n']

(the universal newline mode specifier is deprecated as of Python 3.x)

On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.

Python3's open function has a newline parameter to control that if required docs.python.org/3/library/functions.html#open "newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. It works as follows: When reading input from the stream, if newline is None, universal newlines mode is enabled"

python - Difference between parsing a text file in r and rb mode - Sta...

python file-io text-parsing
Rectangle 27 5

def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

First need cross join with merge, remove row with same values in city_x and city_y by boolean indexing:

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566
df['dist'] = df.apply(lambda row: haversine(row['lng_x'], 
                                            row['lat_x'], 
                                            row['lng_y'], 
                                            row['lat_y']), axis=1)
df = df[df.dist < 500]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.215704
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.223782
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.215704
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.464120
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.223782
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.464120

And last create list or get size with groupby:

df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin     [Potsdam, Hamburg]
Hamburg     [Berlin, Potsdam]
Potsdam     [Berlin, Hamburg]
Name: city_y, dtype: object

df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin     2
Hamburg    2
Potsdam    2
dtype: int64
numpy haversine solution
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)

df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.198616
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.063541
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.198616
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.311890
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.063541
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.311890

Nice starting point. But imagine the dataframe contains millions of locations. Is there a way to apply the 500km-haversine during the join? So that not each item is joined with every other item but only with those who are in the desired range.

I think it is problem, because dont know distance befory apply function haversine. :(

where did you get the radians variable from? inside your haversine function? ah, found it in math.radians

math
from math import radians, cos, sin, asin, sqrt

I realized that there is a conditional join in PySpark but unfortunately not in Pandas. I will try that on Monday as soon as I have access to a Spark cluster. Let you know then!

python - Pandas Dataframe: join items in range based on their geo coor...

python pandas latitude-longitude haversine
Rectangle 27 5

def haversine(lon1, lat1, lon2, lat2):

    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

First need cross join with merge, remove row with same values in city_x and city_y by boolean indexing:

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566
df['dist'] = df.apply(lambda row: haversine(row['lng_x'], 
                                            row['lat_x'], 
                                            row['lng_y'], 
                                            row['lat_y']), axis=1)
df = df[df.dist < 500]
print (df)
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.215704
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.223782
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.215704
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.464120
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.223782
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.464120

And last create list or get size with groupby:

df1 = df.groupby('city_x')['city_y'].apply(list)
print (df1)
city_x
Berlin     [Potsdam, Hamburg]
Hamburg     [Berlin, Potsdam]
Potsdam     [Berlin, Hamburg]
Name: city_y, dtype: object

df2 = df.groupby('city_x')['city_y'].size()
print (df2)
city_x
Berlin     2
Hamburg    2
Potsdam    2
dtype: int64
numpy haversine solution
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df['tmp'] = 1
df = pd.merge(df,df,on='tmp')
df = df[df.city_x != df.city_y]
#print (df)

df['dist'] = haversine_np(df['lng_x'],df['lat_x'],df['lng_y'],df['lat_y'])
    city_x     lat_x     lng_x  tmp   city_y     lat_y     lng_y        dist
1   Berlin  52.52437  13.41053    1  Potsdam  52.39886  13.06566   27.198616
2   Berlin  52.52437  13.41053    1  Hamburg  53.57532  10.01534  255.063541
3  Potsdam  52.39886  13.06566    1   Berlin  52.52437  13.41053   27.198616
5  Potsdam  52.39886  13.06566    1  Hamburg  53.57532  10.01534  242.311890
6  Hamburg  53.57532  10.01534    1   Berlin  52.52437  13.41053  255.063541
7  Hamburg  53.57532  10.01534    1  Potsdam  52.39886  13.06566  242.311890

Nice starting point. But imagine the dataframe contains millions of locations. Is there a way to apply the 500km-haversine during the join? So that not each item is joined with every other item but only with those who are in the desired range.

I think it is problem, because dont know distance befory apply function haversine. :(

where did you get the radians variable from? inside your haversine function? ah, found it in math.radians

math
from math import radians, cos, sin, asin, sqrt

I realized that there is a conditional join in PySpark but unfortunately not in Pandas. I will try that on Monday as soon as I have access to a Spark cluster. Let you know then!

python - Pandas Dataframe: join items in range based on their geo coor...

python pandas latitude-longitude haversine
Rectangle 27 4

The safe way to alter a file (with the exception of appending, which can be safely done in-place) is to copy it with modification to a new file, remove the old one, rename the new like the old. This is the one solid way to avoid catastrophic errors and data loss. Depending on the platform, the step to "remove old, rename new" can be atomic, but that's hard in Windows and not all that crucial.

So I'd simply do that -- in one big gulp, unless the file is horribly huge (gigabyte-plus):

with open(filename, 'rb') as f:
  data = f.read()
with open(newfilename, 'wb') as f:
  f.write(data.replace('\r\r\n', '\r\n'))
os.unlink(filename)
os.rename(newfilename, filename)

The problems with your code are of confusion between binary and text mode -- you can't properly "read a line" from a binary-mode opened file, for example.

Edit in Python 3.1 we need to deal with bytes instances here, not strings, since the files are binary ones. So, per the docs, the write calls must become

f.write(data.replace(b'\r\r\n', b'\r\n'))

those b prefixes tell Python we're dealing with bytes, not strings.

I just tried this but getting this error: "TypeError: expected an object with the buffer interface" on this line: "f.write(data.replace('\r\r\n', '\r\n'))"

@TMC, you should have mentioned you are using Python3 ;)

Ah, Python 3.1 -- I noticed it just now in your question's body (there's a specific tag for it, since so often the proper answers differ drastically between the 2.5/2.6 that almost everybody is using, and the newer 3.1). The solution is at: docs.python.org/3.1/library/ -- let me edit the answer to clarify.

@gnibbler, he did (in a parentheses hiding at the end of the first paragraph), just not prominently enough for me to notice (i.e ideally as a tag;-). I've now edited the answer to show the tiny change needed for Python 3 purposes.

Awesome, with that edit it worked. Thanks!

Using Python to replace "\r\r\n" with "\r\n" in a binary file - Stack ...

python-3.x
Rectangle 27 5

I think this is important to consider for cross-platform execution, i.e. as a CYA. :)

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but itll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesnt hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

python open built-in function: difference between modes a, a+, w, w+, ...

python
Rectangle 27 4

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

center_point = [{'lat': -7.7940023, 'lng': 110.3656535}]
test_point = [{'lat': -7.79457, 'lng': 110.36563}]

lat1 = center_point[0]['lat']
lon1 = center_point[0]['lng']
lat2 = test_point[0]['lat']
lon2 = test_point[0]['lng']

radius = 1.00 # in kilometer

a = haversine(lon1, lat1, lon2, lat2)

print('Distance (km) : ', a)
if a <= radius:
    print('Inside the area')
else:
    print('Outside the area')

location - How to check if coordinate inside certain area Python - Sta...

python location coordinates latitude-longitude area
Rectangle 27 4

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

center_point = [{'lat': -7.7940023, 'lng': 110.3656535}]
test_point = [{'lat': -7.79457, 'lng': 110.36563}]

lat1 = center_point[0]['lat']
lon1 = center_point[0]['lng']
lat2 = test_point[0]['lat']
lon2 = test_point[0]['lng']

radius = 1.00 # in kilometer

a = haversine(lon1, lat1, lon2, lat2)

print('Distance (km) : ', a)
if a <= radius:
    print('Inside the area')
else:
    print('Outside the area')

location - How to check if coordinate inside certain area Python - Sta...

python location coordinates latitude-longitude area
Rectangle 27 172

Here's a Python version:

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

Could use math.radians() function instead of multiplying by pi/180 - same effect, but a bit more self-documenting.

You can, but if you say import math then you have to specify math.pi, math.sin etc. With from math import * you get direct access to all the module contents. Check out "namespaces" in a python tutorial (such as docs.python.org/tutorial/modules.html)

How come you use atan2(sqrt(a), sqrt(1-a)) instead of just asin(sqrt(a))? Is atan2 more accurate in this case?

should be float division to cover really rare corner case of dlat|dlon being integers: a = sin(dlat/2.)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.)**2

Good point, I hadn't thought of that. But in this case it's OK, since radians returns a float: radians(degrees(1)) gives 1.0.

Haversine Formula in Python (Bearing and Distance between two GPS poin...

python gps distance haversine bearing
Rectangle 27 172

Here's a Python version:

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

Could use math.radians() function instead of multiplying by pi/180 - same effect, but a bit more self-documenting.

You can, but if you say import math then you have to specify math.pi, math.sin etc. With from math import * you get direct access to all the module contents. Check out "namespaces" in a python tutorial (such as docs.python.org/tutorial/modules.html)

How come you use atan2(sqrt(a), sqrt(1-a)) instead of just asin(sqrt(a))? Is atan2 more accurate in this case?

should be float division to cover really rare corner case of dlat|dlon being integers: a = sin(dlat/2.)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.)**2

Good point, I hadn't thought of that. But in this case it's OK, since radians returns a float: radians(degrees(1)) gives 1.0.

Haversine Formula in Python (Bearing and Distance between two GPS poin...

python gps distance haversine bearing