[page 34↓]

Web analysis framework

“A science is as mature as its measurement tools.” (Louis Pasteur)

Chapter 2 highlighted success factors of multi-channel retailing and emphasized the importance of privacy protection on the Internet. The results motivate our further work on success measurement in Web retailing and on the protection of consumer privacy.

This chapter introduces an analysis framework for measuring online success on multi-channel and Internet-only sites. Our analysis framework will propose five categories of business analyses that aim at measuring notions of online success.

The analysis results are particularly useful for customer relationship management and personalization, which will be discussed in more detail in Chapter 5.

This chapter is organized as follows. Section 3.1 presents the data used to test our analysis framework. Section 3.2 introduces a terminology of business analyses and presents the five analysis categories that constitute our analysis framework. Section 3.3 presents a set of service analyses for Web sites of multi-channel retailers. Based on a systematic distinction of service options in multi-channel and Internet-only retailing, we derive analyses measuring online consumers’ service preferences in multi-channel retailing. Section 3.4 proposes a set of Web analyses measuring conversion success. We formalize existing conversion metrics that have so far been described only informally. New metrics are proposed measuring conversion success in a multi-channel context. Section 3.5 extends the analysis of purchase sessions by using a clustering approach, which provides detailed insight into customers’ usage patterns. The analysis is based on a combination of Web usage and Web user data. Section 3.6 presents analyses for consumer segmentation based on demographic and order data. Section 3.6 proposes segmentation approaches indicating a customer’s value to a company. Concentration indices are introduced and an index of customer value is presented. Section 3.7 presents an approach how success can be measured on information Web sites. A mining template for modeling behavioral strategies as sequences of tasks is introduced.

The proposed analyses of Sections 4.3-4.6 are applied to Web user and usage data from the multi-channel retailer presented in Section 2.3.1. Results of Section 3.7 are presented based on data from an information Web site.

3.1  Data

This section presents the data used for the empirical testing of the Web analysis framework for Internet-only and multi-channel Web sites.


[page 35↓]

Web site owners can collect two types of consumer data: actively divulged Web user data and passively transmitted Web usage data. Consumers actively divulge user data when they send information to a Web site for billing purposes, to register or request information. Visitors passively transmit usage data by leaving traces registered with the Web site server.

Data from two Web sites have been used to exemplarily calculate the proposed metrics and analytics in our Web analysis framework. Based on the multi-channel retailer introduced in Section 2.3.1 we analyzed 92,467 sessions taken from a period of 21 days in 2002, and transaction information of 13,653 customers who conducted 14,957 online purchases over a period of 8 months in 2001/02. From an information Web site, we analyzed a reference set of 27,647 user sessions.

Section 3.1.1 and 3.1.2 present the structure and terminology of Web user and usage data. The data model of the multi-channel retailer is also presented.

3.1.1  Web usage data

Server logging is based on a protocol component that registers requests to a World Wide Web (WWW) server. These server requests can be initiated by a user who visits a Web site consisting of many Web pages. Each Web page is composed of constituent objects such as body text, images or video files, which count as a hit each when invoked. Thus, each page a user views (page view) comprises many hits at the server. A clickstream is a time-ordered list of page views. A user session is a set of users’ server requests to one or more Web servers. Sessions are also referred to as visits [Monticino, 1998].

A standard format for logging server requests has been established by the World Wide Web Consortium [W3C, 1995].

The following log entry, taken from the multi-channel retailer’s Web server, exemplifies the main parts of an access log in the Extended Log File Format.

Figure 3-1: Simplified log entry from the cooperation partner

The first part of the log file is the remote host address (Internet Protocol (IP) address), which can be used to identify a visitor’s computer or device. The IP address is a 32 bit-long, dotted decimal notation, in which each byte is shown as a decimal number encoded [page 36↓]by 8 bits. It can be translated to a domain name via the Domain Name Server (DNS). The first part of the IP address identifies the user’s network address (e.g. 141.20 is the network of the Computer Science department at Humboldt Universität zu Berlin) and may reveal information about the network owner. The last two digits of the IP address specify the host (end-system) within the network, which are assigned (uniquely or dynamically) to a computer.

The DNS can be used to determine a user’s geographic location [Lamm, et al., 1996]. Software vendors claim that they can link IP addresses to geographic locations with an accuracy of 98% for country, 70% for regional, and 65% for city level [Melissa Data, 2004 Olsen, 2000]. A source of inaccuracy for geographic localization is the use of proxy servers, which only reveal the location of the proxy server but not the location of the user.

The log file also contains the remote login name and user authentication of the user if the site requires logins to access a Web server. Moreover, it contains the date and time of a user request, the file name (e.g. of a Web page, picture, document), the number of bytes transferred and the method the client used to retrieve a file from the server (typically GET). The HyperText Transfer Protocol (HTTP) response code (status code) indicates the success or failure of the file transfer. The referrer indicates the Unique Resource Locator (URL) of the previous page request and the user agent indicates browser type and version the client claims to be using. If a site offers active program components, information about a user’s JavaScript availability, installed plug-ins or screen resolution can be collected.

When the user leaves name, address or other identifying information on a Web site (e.g. in registration or purchase forms) a unique identification can be assigned to the log file to combine personal information and the respective clickstream.

Session identifications (session ids), cookies or IP addresses can be used to identify and reconstruct a user session. The process of reconstructing the activity log into sessions is referred to as sessionizing [Berendt, et al., 2001].

Cookies are small text files stored on a user’s hard drive and can be used to recognize users in later sessions. Session ids can be transient cookies that are only stored temporarily during a single session and are embedded in the URL. However, users can delete cookies. A recent study claims that 55% of all cookies become unusable each month [Fiutak, 2004]. Further, the use of cookies can have privacy implications, which will be discussed in Section 5.3.

Table 3-1 shows a simplified session sample from the multi-channel retailer’s Web site. Sessions were determined by the use of session ids, which are available in the log file.


[page 37↓]

Table 3-1: Session sample from the multi-channel retailer

141.20.102.189 - - [04/Jun/2002:14:36:12 +0200] "GET Home HTTP/1.0 SessionID bhApYI6N" 200 6500 "http://www.google.de/search?q=e-shop" "Java1.2.2" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

141.20.102.189 - - [04/Jun/2002:14:37:24 +0200] "GET Browse_Catalog Catalog ID 7n66hz3 HTTP/1.0 SessionID bhApYI6N" 200 759 "http://www.e-shop.de/home" "Java1.2.2" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

141.20.102.189 - - [04/Jun/2002:14:42:54 +0200] "GET View_Product Product ID 19453 HTTP/1.0 SessionID bhApYI6N " 200 759 "http://www.e-shop.de/Browse_Catalog CatalogID 7n66hz3" "Java1.2.2" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

141.20.102.189 - - [04/Jun/2002:14:53:21 +0200] "GET BasketForm PaymentTransactionID 3dNC4KHg PlacedOrderID 3d4rEKHgFoT http://www.e-shop.de/ViewBasket PaymentTransactionID 3dNC4KHg HTTP/1.0 SessionID bhApYI6N" 200 7258 "-" "Java1.2.2" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

The time stamps between subsequent page requests can be used to derive users’ view times per page. Some requested files of the multi-channel retailer contain a catalog_id indicating a specific catalog category, a product_id indicating a product, a transaction_id representing the invocation of the transaction phase, an order_id denouncing a purchase and a payment_id indicating the chosen payment method.

Before the data is stored for analysis purposes, the typical data cleaning steps in Web mining such as robot removal need to be performed. We abstained from analyzing page view times as reconstructing view times is subject to significant inaccuracies [Berendt,et al., 2001].

Several technical problems may complicate the use and processing of log files. In particular, caching, the use of proxy servers, dynamic IP addresses and the use of a device by several people pose a challenge to session reconstruction and user identification [Berendt,et al., 2001 Büchner, et al., 1999 Cooley, et al., 1999 Spiliopoulou, et al., 2003 Wilde, 2003].

For the analysis of user behavior it is beneficial to codify page requests as session vectors. Given a set of n pageviews, P = {p 1 , p 2 ,…, p n }, and a set of m user session,
S = {s 1 , s 2 , … , s m }, where each s i ÎS is a subset of P, each user session can be regarded as a vector over the n-dimensional space of pageviews. The session vector is given by: , where w(p s i ) is the weight associated with pageview p t i in the session s i representing its significance. Usually, but not exclusively, the weight is based on [page 38↓]number of pages visited or page view time, where each w s pj = w(p s i ), for some iÎ {1,…, n}, in case p j appears in the session s i , and otherwise w s pj = 0. [Dai and Mobasher, 2003]. Thus, conceptually, the set of all user sessions can be viewed as an m × n session pageview matrix.

3.1.2  Web user data

The multi-channel retailer uses a relational database schema to store billing information. The following list represents a simplified view on the retailer’s data schema including preprocessed and sessionized Web log data (cf. table session). The company’s full database consists of more than 30 tables and 200 attributes. The following list presents those entities and relationships that were used to test the main parts of our analysis framework in Chapter 3 and for the discussion of privacy problems in Chapter 4.

Table 3-2: User data schema

customer (customer_id, geo_id, credit_rating, first_name, surname, title, gender, date_of_birth)

address (address_id, customer_id, geo_id, country_code, street, street_number, street_number_supplement, customer_zip_code, town, recipient_address, post_office_box, phone_number, e-mail_address)

order (order_id, customer_id, session_id, store_id, product_id status, invoice_value, currency, order_date, order_time, delivery_type, payment_method, credit_card_no, customer_card_no, status_change)

product (product_id, category_id, product_name, product_weight, product_size, price, cost)

product_category (category_id, category_name)

return (return_id, order_id, store_id, return_date, return_value, return_address)

store (store_id, geo_id, store_country_code, store_street_name, store_street_number, store_zip_code, store_town)

session (session_id, order_id, ip_location, access_time, browser_type, status_code, referrer)

page (page_id, concept_id, session_id, page_name, page_content)

page_concept (concept_id, concept_name, concept_content)

belongs_to (page_id, concept_id)

contains (session_id, page_id)

location_zip (geo_id, micro_id, zip_code, longitude_zip_code, latitude_zip_code)

microgeography (micro_id, detail_type, detail_value)

characterizes (micro_id, geo_id)


[page 39↓]

Foreign keys establish relationships between tables and are depicted as dotted attributes in the presented data schema. Log data in the table session could be linked to attributes in the table customer via a unique order_id when a user made an online purchase. If a site uses cookies, the attribute cookie_id would be stored in the table session.

Third-party data sources can be added to extend a retailer’s database with additional consumer profile information. We acquired demographic data from Deutsche Post Direkt [Deutsche Post Direkt GmbH, 2004] that matches zip codes and geographic coordinates. Thus, the table location_zip could be added.

Demographic and sociographic information can be linked to customer addresses. The column detail_type in the table microgeography includes the attributes that could be added via the geo_id (e.g. zip code). Data such as political orientation, car type, family structure, cultural background, status, spending capacity, household size, creditworthiness, age, preferred anonymity level, marketing affinity, product affinity, preferred order medium or preferred communication media can be purchased from external sources [Deutsche Post Direkt GmbH, 2004]. Due to changes in demography and lifestyles the accuracy and timeliness of microgeographic data is limited, however [Weichert, 2004].

Multi-channel retailers can link data from other sales channels to further enrich customer data. For example, shopping cards3 are often used to collect and link data from multiple sales points and may allow the detailed tracking of a customer’s shopping history.

The entity-relationship model for the multi-channel retailer is depicted in Figure 3-2.


[page 40↓]

Figure 3-2: Entity relationship model of the multi-channel retailer

3.2  Framework categories

Related work has used the following terms for measuring notions of online success: “Web traffic measurements” [Malacinski, et al., 2001], “e-metrics” [Cutler and Sterne, 2000], “operational metrics” [Srivastava, et al., 2002], “metrics for Web merchandising” [Lee, et al., 2001], “visit related measures” [Moe and Fader, 2000], “CRM analytics” [SAP AG, 2001] and “Web log metrics” [Kohavi and Parekh, 2003]. Further terms of Web measurement have been introduced in Beal [2003], Bensberg [2001], Schwickert [2001] and Weigend [2003].

We use the following terminology in our framework: Web metrics are specific numbers or ratios assigned to a particular attribute (e.g. objects, events). Measurement techniques that cannot be expressed as a single number – e.g. distributions, association rules, or clusters – are referred to as analytics. The latter term is also used by many vendors of Web mining software [KDNuggets, 2005]. The term Web analyses is used as a superordinate label of both Web metrics and Web analytics.

Our analysis framework consists of five groups of Web analyses as depicted in Figure 3-3.


[page 41↓]

Figure 3-3: Framework categories

The five analysis categories address the notion of online success from different perspectives:

The complete list of 82 metrics and analytics in the five analysis categories, their definitions, required data attributes and formalizations are depicted in Table 7-3 of the Appendix. All analyses are time-referenced. Sections 3.3-3.7 will present a selection of [page 42↓]these analyses and apply them on Web data from a multi-channel retailer and an information site.

Basic statistical aggregations of Web logs4 (e.g. visits per day, distribution of user agents, most frequently visited Web pages, etc.) have not been integrated in our analysis framework as these analyses are offered by standard shareware tools [KDNuggets, 2005].

Moreover, product metrics and analytics are not presented in this thesis. Top-selling products and their position on a Web site are tracked routinely [Kohavi, 2004]. For example, market basket analysis is a common type of product data analysis that determines what products sell well together. A well-known algorithm for market basket analysis is the Apriori algorithm, which finds frequent itemsets in data [Agrawal, et al., 1993]. Linden et al. [2003] describe the recommendation algorithm used by the Internet retailer Amazon.com Inc.

Analyses calculating promotion or campaign success and cost-related analyses are also not included in the framework.

The proposed success analyses are particularly useful in the context of customer relationship management (CRM) [cf. Cutler and Sterne, 2000], Web site usability [cf. Kohavi and Parekh, 2003 Shneiderman and Plaisant, 2004 Spiliopoulou, et al., 2002] and Web site personalization [cf. Kobsa, et al., 2001].

3.3  Multi-channel service analyses

This section presents metrics and analytics measuring consumers’ service preferences for Internet-only and multi-channel retailers. Service offerings are considered one of the most important advantages of multi-channel over Internet-only retailers [Goersch, 2003 Omwando, 2002 USA Today, 2003]. A systematic analysis of service options in multi-channel retailing is presented in Section 3.3.1. The purchase decision process is introduced to point out multi-channel-specific service advantages. The current service mix of the 50 largest multi-channel retailers is presented in Section 3.3.2. The knowledge about the multi-channel service mix is used to define a set of service analyses in Section 3.3.3 and respective service metrics in Section 3.3.4. The metrics and analytics are applied on Web data from the multi-channel retailer. Section 3.3.5 concludes the [page 43↓]discussion of service preferences with a presentation of results from an online survey.

3.3.1  The multi-channel service mix

The purchase decision process is a well-known model that conceptualizes consumer choice as a number of predictable sequences of behavioral tasks in making purchases [Alba, et al., 1997 Engel, et al., 1968 Goersch, 2003 Howard and Sheth, 1969 Nicosia, 1966 Otto and Chung, 2000].

Figure 3-4 depicts an integrated view on the purchasing phases, which points out the main differences between Internet-only and multi-channel service offerings on Web sites.

Dotted arrows indicate the sales path at pure Internet retail sites. Continuous arrows indicate phase transitions at multi-channel retail sites where online customers can deviate from the Internet sales path and switch to traditional offline channels or back.

Figure 3-4: The purchase decision process at multi-channel and pure Internet retail sites

The names, number of tasks and labels of the purchasing process varies in the literature. The main difference between the models is their emphasis on different phases or the stress of specific cognitive aspects [Goersch, 2003]. Related models in an online context are the customer life cycle [Cutler and Sterne, 2000] and the customer buying process [Lee,et al., 2001], which will be discussed in more detail in Section 3.4.

The phases of the purchase decision process are used to systematically point out service advantages of multi-channel retail Web sites:

  1. Acquisition (awareness) describes the phase where a consumer is attracted to a retailer’s value proposition. In an online context, a click on the Web site would characterize the acquisition phase. An advantage of multi-channel Web sites is that consumers could be attracted to visit the Web site from physical stores (e.g. by using Internet terminals in stores).
  2. During the information (persuasion) phase, visitors collect information about products and services and prepare their purchasing decision. In a multi-channel context, consumers could combine the advantages of online and offline information search. They can sample products in store after searching online, which may reduce the impediment of missing sensory clues on the Internet [page 44↓][Rosen and Howard, 2000]. Moreover, multi-channel Web sites may support store-based search by displaying information about physical stores (e.g. opening hours, shop locations or product availability).
  3. The first step of the settlement phase begins when a customer enters the order process. In an online context, the check-out of the shopping cart or input of customer data would characterize the commencement of the settlement phase.
  4. In the payment phase, the customer initiates the payment of her order. Multi-channel retail sites can offer an additional payment option to their customers: customers may pay cash in-store after having ordered online.
  5. Multi-channel Web sites can also offer more delivery options than pure Internet retailers. Online customers may pick up products in-store, which allows immediate gratification and avoids being present during the time of delivery. Some companies already offer special counters in stores where Internet orders can be picked up without waiting times.
  6. During the after-sales phase, multi-channel retailers can provide an additional service to their customers: defect or unsatisfactory orders may be returned in physical stores, which could be more convenient than returns by mail. Multi-channel Web sites may also offer additional assistance (e.g. maintenance, installations) executed by personnel from nearby physical stores.

3.3.2  Site services in multi-channel retailing

The analysis of multi-channel characteristics in the customer purchasing process facilitates the identification of five additional service options that can be offered on multi-channel retail sites:

We observed the availability of these service options at the world’s 50 largest e-retailers in 2002 [Gallo and McAlister, 2003]. 43 of these e-retailers operate multiple distribution channels, seven are pure Internet-players. From the 43 multi-channel retailers, 30 operate [page 45↓]physical stores5 and 13 primarily operate direct distribution channels such as catalogs, TV or call centers. In Table 3-3 we give an overview of the present service mix at the 30 retailers that operate physical stores and a Web site:

Table 3-3: Online service mix at the 30 largest multi-channel retailers (as of November 2003)

The analysis indicates that many retailers do not offer the full multi-channel service spectrum. The most common service combination includes store locator pages and in-store returns of online orders. All multi-channel retailers in the sample offer store locator pages and about two-thirds offer in-store returns6. At eleven companies online customers can check store inventory and/or special offers in physical stores. At five companies customers can pick up online orders in physical stores. Three companies offer the full multi-channel service spectrum including payment in-store after an order has been placed online.

Whereas returning goods from online purchases back to a physical store is a typical service option at many multi-channel retailers, the practice of picking up goods or checking stock in a particular store is less common, yet.

A retailer’s choice of a particular service mix may depend on several parameters. A large store network seems to be a requirement for in-store pick-ups. Retailers offering the full multi-channel service spectrum operate a nationwide retail network. Moreover, differences between online and offline pricing present a challenge to multi-channel integration. Local [page 46↓]discounts at stores could confuse online customers when they pick up orders and recognize a lower in-store price. A study found that one-third of multi-channel retailers offered different online and offline prices in 2001 [Shern, 2001]. Some of the multi-channel retailers in our sample announced on their Web site that any discounts in-store also apply to online orders on the day of pickup (e.g. Circuit City Inc.). Delivery cost is a further decision parameter that needs to be considered in multi-channel retailing. The avoidance of shipping cost is one of the most important reasons for online customers to pick up orders (cf. Section 3.3.5). Customers of online retailers offering low shipping costs or free-of-charge delivery may have fewer incentives to use an in-store pickup service. Cost for order management and additional personnel could be a further reason why many multi-channel retailers have not yet fully integrated online and offline services.

As this brief discussion has demonstrated, a retailer’s decision to offer multi-channel services is influenced by many organizational parameters. An in-depth discussion of these parameters is not within the scope of this work.

3.3.3  Service analytics

Our analysis of the service mix constitutes the basis for the definition of a set of service analyses measuring consumer service preferences in multi-channel retailing. The analyses are applied on data from the multi-channel retailer, who offers an integrated service spectrum in the sense of Table 3-3 except that a search function for in-store inventory is not yet implemented on the Web site. Online customers can pay online by credit card, directly at a physical store or by cash on delivery. Online orders are delivered directly to the customer or can be picked up at a store. Returns can be handled either by mail or at a physical store. Visitors can locate the nearest store online.

We analyzed data from 13,653 customers who made 14,957 transactions over a period of 8 months in 2001/02.

The service analytics are presented as association rules, which depict relationships among items based on their patterns of co-occurrence across transactions [Agrawal,et al., 1993]:

Let I = {I 1 ,…,I n } be a set of discrete entities (items) and D = {t 1 ,…,t k } a set of transactions in a database D with t ÍI. Then X ÞY is a association rule with X ÍI, Y ÍI, X ÈY = Æ.

X ÞY has support s if s% of transactions in D contain X ÈY.


[page 47↓]

The rule confidence c is defined as:

The presentation of service preferences as association rules provides two benefits: first, the Web analyst can easily identify the most important service rules and second, the frequency of occurrences between service offerings can be depicted concisely.

3.3.3.1 Payment and delivery preferences

The first set of association rules describes the associations between customers’ payment and delivery preferences (cf. step 4 and 5 of the customer purchasing process in Figure 3-4):

(1) Online payment s=0.27, c=0.97 Direct delivery

(2) Online payment s=0.02, c=0.03 In-store pickup

(3) Cash on delivery s=0.02, c=0.06 Direct delivery

(4) In-store payment s=0.69, c=0.94 In-store pickup

The first row would be read as follows: if a customer chose online payment using a credit card, she also chose direct delivery with 97% frequency. This rule could be identified in 27% of the transactions. Thus, 3,686 orders were delivered directly. Surprisingly, in 69% of the transactions, customers placed an order online but chose to pay and pick up their order at a physical store (rule 4). Several surveys confirm that observation even though with a lower support factor [Swerdlow, et al., 2002 Tedeschi, 2001]. In 27% of the transactions, customers chose the service combination of online payment and direct delivery – the typical service combination offered at pure Internet retailers. Only very few customers tend to combine online payment and in-store pickup (rule 2). Moreover, only a few customers paid cash on delivery (rule 3). For comparisons, in Germany, 64% of e-commerce offers are purchased on account, 36% by payment on delivery, 26% by direct debit and 23% by credit card [Schneemann, 2003].

As a conclusion, most online customers collect information and place orders on the multi-channel site but prefer physical stores for pickup and payment. Less than one-third of the customers in the sample are “pure” online users who chose direct delivery and online payment.


[page 48↓]

3.3.3.2  Return preferences

Moreover, we analyzed customers’ return preferences at the multi-channel retailer. 10% of all online orders were returned within eight months. The association rules (5) and (6) represent customers’ return preferences (cf. step 5 and 6 of the customer purchase process presented in Figure 3-4):

(5) Return s=0.06, c=0.87 In-store

(6) Return s=0.04, c=0.13 Mail-in

The findings indicate a strong preference for in-store returns (87%). Though returns were offered free of charge, only 13% of all returned orders were mailed back. The customers who returned orders by mail also had chosen online payment and direct delivery when they placed their order. A consumer survey found similar results: 83% percent of online buyers would prefer to return online purchases at stores [Jupiter Research Corporation, 2001].

A reason for the preference of in-store returns could be the convenience of personal assistance and the handling of packaging in-store. Moreover, replacement or guarantee issues can be discussed in person in-store. The offer to return online orders at a physical store seems to be a successful service strategy that is offered by two-thirds of the largest multi-channel retailers (cf. Section 3.3.2).

3.3.3.3 Repeat customers’ service preferences

The last set of association rules describes the migration behavior of repeat customers’ delivery and payment preferences. Migration measures the number of customers who switched their delivery or payment preferences in at least one transaction after their first one. The number of repeat customers amounts to 10% of all customers over a time period of eight months. Only 9% of repeat customers changed delivery terms after their first transaction. None of the customers switched their transaction preferences more than once.

(7) Direct delivery s=0.001, c=0.15 In-store pickup (in ³ 1 of the following transactions)

(8) Direct delivery s=0.003, c=0.85 Direct delivery (in every following transaction)

(9) In-store pickup s=0.001, c=0.10 Direct delivery (in ³ 1 of the following transactions)

(10) In-store pickup s=0.004, c=0.90 In-store pickup (in every following transaction)

The support for repeat customers who switched to in-store pickup (rule 7) was equal to [page 49↓]the support for customers who switched to direct delivery (rule 9) in at least one of the following transactions after the first one.

As payment and delivery preferences are closely coupled (cf. rules (1)-(4)), the support and confidence values for payment migration between online payment and payment in-store were equivalent to rules (7)-(10) in our sample.

Rule (9) could be interpreted as an indicator of trust in the online shop: if an online customer picks up or pays a product in-store first and then switches to direct delivery or online payment, the consumer may have developed trust in the retailer’s direct delivery and online payment reliability.

3.3.4  Service metrics

The service rules of Section 3.3.3 can be transformed into service metrics that are simple to calculate and can be easily used for comparisons over time and between Web sites. Table 3-4 presents a list of multi-channel-specific service metrics and their results that can be derived from the association rules presented in Section 3.3.3.

Table 3-4: Multi-channel service metrics

Multi-Channel Service Metrics

Results

In-store payment rate

= 69%

Online payment rate

= 29%

Cash-on-delivery payment rate

= 2%

In-store payment migration rate

= 15%

Online payment migration rate

= 10%

Deliveries-to-stores rate

= 71%

In-store delivery migration rate

= 15%

Direct delivery migration rate

= 10%

Returns-to-stores rate

= 87%

The in-store payment rate measures the number of online customers who paid in-store and is equivalent to the support factor of association rule (4). The online payment rate measures the number of online payers and is equivalent to the sum of the support factors of rules (1) and (2). The cash-on-delivery payment rate is the support factor of rule (3). The deliveries-to-stores rate measures how many customers preferred to pick up their [page 50↓]online orders at physical stores. It is the sum of the support factors of association rules (2) and (4). The returns-to-stores rate measures how many buyers returned products in physical stores. It is the confidence factor of rule (5).

The in-store delivery migration rate measures the number of repeat customers who switched from direct delivery to pickup in-store in at least one of their following transactions. It is equal to the confidence factor of association rule (7). The result of the direct delivery migration rate is equivalent to the confidence factor of rule (9). The payment migration rates are calculated analogous to the delivery migration rates.

3.3.5  Survey results

To round up our analysis of service preferences we conducted an online survey on the multi-channel Web site to inquire reasons for the surprisingly high number of in-store pickups. Consumer comments from a previous survey7 were consulted to define seven answer options to the question “if you have decided to pick up an online order at the retailer, what were the reasons?”. This question was attached to the online questionnaire described in Chapter 2. 1048 visitors checked 3505 answer fields. The results are depicted in Figure 7-2 of the Appendix.

The survey results show that shipping costs are most important for customers to pick up orders. The retailer’s shipping cost are 4.95 euros and thus below the German average for domestic postal ground shipping of consumer electronics. Costs are waived for orders equal to or greater than 100 euros. The retailer offers standard delivery times of about three days.

The second most important reason to pick up orders in physical stores was the need to look at the product in person and the demand of direct communication. Half of the users prefer to look at a product before they accept it and 41% want to see that a product is not damaged. Delivery convenience and online payment risks are also significant reasons to pick up orders in-store. 26% claim they are usually not at home during delivery times and 20% pick up orders to avoid the lag time of shipping. 19% find online payment too risky.

3.3.6 Summary and implications

The multi-channel service mix at the top 30 multi-channel e-retailers in 2002 has been [page 51↓]analyzed. The results demonstrate that Web sites increasingly extend multi-channel services to their customers. In particular, in-store returns and a store locator are typical service options at large multi-channel retailers. However, the analysis has shown that many companies do not yet fully exploit the potential of multi-channel service integration. The analysis of consumer preferences demonstrated a clear demand for such services, however.

In order to measure these service preferences, a group of service analyses has been presented. The results indicate that consumers have a strong preference for in-store pickup, payment and return.

The presented service analytics and metrics have important implications for business decision making. For example, if a large percentage of users prefers to examine and to pick up products in store, it may be worthwhile to further expand the store network.

3.4  Conversion analyses

This section focuses on the analysis of Web usage behavior and presents a set of Web analyses measuring fine-grained conversion metrics for Internet-only and multi-channel retailers.

Conversion – defined as the proportion of visits that end with a purchase – is a well-known notion of online success. The online conversion rate for US retailers increased from 2.2% in 2000 to 3.1% in 2001 [BCG and Shop.Org, 2002]. However, only 2–3% of user sessions are captured in this success metric, whereas 97-98% of session data stem from visitors who looked at information on the Web site but did not engage in an online transaction. The session data from this latter group may provide useful insights in alternative success incidents on Web sites though. Moreover, a single conversion rate is not sufficient for measuring the success of multi-channel Web sites: in a multi-channel context, conversion success may not be visible directly in the Web logs, e.g., if visitors collect information online but purchase offline. Thus, more fine-grained conversion metrics need to be developed.

In Section 3.4.1, we introduce the customer life cycle of Cutler and Sterne [2000] and the micro-conversion rates of Lee et al. [2001] and derive a formal model measuring conversion success in Internet retailing. We will refer to techniques from Web usage mining, which is the application of data mining techniques to discover interesting Web usage patterns [Baldi, et al., 2003 Cooley,et al., 1999 Han and Kamber, 2000 Kosala and Blockeel, 2000 Spiliopoulou and Faulstich, 1998 Srivastava, et al., 2000].

Section 3.4.3 presents new conversion success metrics: a class of concept conversion [page 52↓] rates, and the offline conversion rate, that provide a fine-grained view on consumers’ conversion behavior. In order to calculate these metrics, a taxonomy of site concepts for the multi-channel retailer has been developed.

In Section 3.4.4, we calculate the conversion metrics and discuss our results. Recommendations for site improvement are derived.

3.4.1  Conversion success metrics

The processes whereby a visitor becomes a customer (cf. Section 3.3.1) have been described for an online retail context in related work: on a macro level, the processes of moving along the customer life cycle [Cutler and Sterne, 2000]; on a micro level, the processes of moving along the customer buying process [Lee,et al., 2001]. In each of these models, distinct stages (and user groups who are defined by having “reached” those stages) follow upon one another. In [Berthon, et al., 1996], the purchase process is modeled by distinguishing, within the set of all site users, the “short-term visitors” from the “active investigators”. Some of the latter eventually become “customers”. Metrics are proposed to measure how many site users reach these advanced stages. However, to find out why short-term visitors may not have become active investigators, or active investigators may not have become customers, it is necessary to consider the visited pages with respect to their potentials for further action. Criteria for classifying pages accordingly can be based on merchandizing purpose [Lee,et al., 2001] or, more generally, on service-based concept hierarchies [Spiliopoulou and Pohle, 2001]. The paths taken to goal pages, their lengths in particular, have been integrated as a further aspect of efficiency control [Spiliopoulou and Berendt, 2001].

3.4.2 An integrated framework for conversion success

As a first step towards a model of conversion success measurement, an integrated scheme for formalizing both the life-cycle metrics of Cutler and Sterne [2000] and the micro-conversion rates of Lee et al. [2001] has been proposed. Figure 3-5 illustrates the stages and processes of these models. The figure should be read as follows: the letter at a node identifies a set of people defined with reference to the site’s goal.8The subscript T [page 53↓]is omitted in the figure to enhance clarity. By the actions performed in T, each individual moves from being an element of the set at one node to being an element of either of the sets at the children of that node. For example, all “suspects” S T (i.e., people who have become aware of the site and are visiting it [Berthon,et al., 1996] are either “acquired” and become “prospects” P T (i.e., people who show interest by some kind of active participation, cf. the “active investigators” of Berthon et al. [1996] or not. In the latter case, they belong to the set nP T. Children of a node partition the set of their parent node, e.g., P T nP T = Ø, P T nP T = S T . Figure 3-5 (a) shows the stages and transitions involved in the life-cycle metrics of Cutler and Sterne [2000], and (b) shows an alternative partitioning of the set C T of customers in (a). That is, it is possible that U1 T C1 T Ø, U1 T CA T Ø, U1 T CR T Ø und UR T C1 T Ø, UR T CA T Ø, UR T CR T Ø. Figure 3-5 (c) shows a more fine-grained representation of the stages of the customer buying cycle that make up the steps that convert a prospect into a customer.9

Figure 3-5: (a), (b): Stages and transitions in the customer life cycle, and (c) in the customer buying cycle

In Table 3-5, we propose formalizations of the metrics associated with the transitions of [page 54↓]the customer life cycle [Cutler and Sterne, 2000] in Figure 3-5 (a) and (b), and we express the micro-conversion rates of Lee et al. [2001] (Figure 3-5 (c)) in the same framework. This representation assumes that to become a customer, one must follow the canonical sequence shown in Figure 3-5 (c).

The last column of Table 3-5 points out data requirements for calculating the metrics. If the rate of visits that lead to active participation is of more interest than numbers of individual customers, session IDs suffice, and acquisition can be measured as the number of visits with URL requests that indicate active participation, divided by the number of all visits, in T. Conversion and abandonment can be measured analogously, cf. Spiliopoulou and Berendt (2001) and Spiliopoulou and Pohle (2001) for examples. Measures like retention or attrition, of course, rely on the personal identity of the customer and therefore require at least cookie data as (quasi-)unique customer identifiers. Reach requires marketing data about the number of Internet users and the overall size of the target market.


[page 55↓]

Table 3-5: Metrics for e-business: life-cycle metrics and micro-conversion rates

Life Cycle Metrics

Metrics Definition

Data Requirements

 

Reach

S T / W T

M

 

Acquisition

P T / S T

C (SI)

 

Conversion

C T / P T

C (SI)

 

Retention

CR T / C T

C and/or TA

 

Loyalty

UR T / C T

C

 

Abandonment

C b T / P T

C (SI)

 

Attrition

CA T / C T

TA, M

 

Churn

TA, M

 

Micro-Conversion Rates

 

Look-to-click

M2 T / M1 T

C (SI)

Click-to-basket

M3 T / M2 T

C (SI)

Basket-to-buy

M4 T / M3 T

C (SI)

Look-to-buy

M4 T / M1 T

C (SI)

M = marketing, C = cookies, SI = session ids, TA = transaction

3.4.3  New conversion metrics

The formalization of the micro-conversion rates of Lee et al. [2001] presents two problems:

Problem 1. Although these metrics are useful for determining specific site events, the four conversion rates proposed by Lee et al. do not look more detailed into the users’ information behavior such as a user’s clickstream from a catalog site to a product page. In particular, they lack a definition of conversion in the context of multi-channel retailing.

Problem 2. The proposed conversions have been defined on the basis of sessions that reach the next phase in the buying process or not. However, they do not consider volume-based conversion (how many pages representing one phase have been visited relative to those representing another phase).

Our approach addresses these two issues. First, we use an OLAP-style analysis to address problem (1). We suggest a general formalization of fine-grained conversion rates that can be used on different Web sites. We develop and use a concept hierarchy to achieve a more aggregate view of the data, and we extend the classification of pages by merchandizing purpose to also measure cross-channel affinity. We then investigate session modeling in order to address problem (2), using feature vectors that indicate either whether a concept has been visited in a session or not, or how many times it has been visited. We use sessions instead of users as our basic unit of analysis because our focus is on the micro level of individual online interaction processes, rather than on the macro level of how a person moves along the customer life cycle. Session-based analysis has been shown to be useful for a number of applications such as recommender [Sarwar, et al., 2000] and personalization [Kobsa, et al., 2001 Mobasher, et al., 2002] systems. Moreover, session-based data collection (or the reconstruction of sessions from IP+agent) presents fewer privacy problems than cookie-based data collection, which will be [page 56↓]discussed in more detail in Chapter 5. Furthermore, cookies can be deleted, which impedes a re-identification of users [Fiutak, 2004]. However, the use of session IDs assumes that each session originated from a different user, which must not be true.

3.4.3.1  Multi-channel site taxonomy

For incorporating domain knowledge in the log analysis, we built a concept hierarchy as a model of the business purpose underlying the multi-channel Web site introduced in Section 2.3.1.

A concept hierarchy, also known as taxonomy, generalizes concrete objects into more abstract concepts [Berendt and Spiliopoulou, 2000 Pohle and Spiliopoulou, 2002 Spiliopoulou, 2000]. The development of concept hierarchies requires the mapping of user activities into generic user tasks. This procedure provides two main benefits: first, previous knowledge about a site’s business objectives can be integrated in the analysis process. Second, the data are much easier to interpret by the analyst, e.g. statistical analysis can be performed on product group rather than product level.

The mapping of site components to concepts is traditionally performed prior to the statistical analysis of the data. The establishment of a concept hierarchy cannot be automated, since the site semantics depend on the goals of the Web site and the objectives of the institution owning it. E-commerce sites usually have well-structured Web content, including predefined metadata or a database schema [Lynch and Horton, 2001 Shneiderman, 2000 van Duyne, et al., 2002].

Our classification covers the types of services that typically constitute a large multi-channel retail site. It extends the usual classification of the purchase decision process (cf. Figure 3-4) by a more fine-grained concept view that includes the service, offline information, information catalog and information product concept. The following concepts are included in the taxonomy:

  1. acquisition (home): all Web pages that are semantically related to the initial acquisition of a visitor (e.g., the home page).
  2. information catalog (infcat): pages providing an overview of product categories. This concept could be further differentiated with a number of sibling nodes describing the Web retailer’s product categories.
  3. information product (infprod): pages displaying information about a specific product. infprod is a child of infcat.[page 57↓]
  4. service: general company information, registration, games and other trust-building information.
  5. transaction: all transaction pages before an actual purchase, starting with a customer entering the order process, check-out of shopping cart, input of customer data, payment and delivery preferences.
  6. purchase: pages indicating the completion of the transaction process such as the invocation of an order confirmation page.
  7. offline: all pages related to any offline information: store locator (pages for finding physical stores in one’s neighborhood), information about offline services, or specific offline referrers.10

Figure 3-6 depicts the site taxonomy that was used for the analysis.11 Each of the 760,535 page requests that remained after data preprocessing were mapped onto concepts from the hierarchy.

Based on this categorization of pages, we propose concept conversion rates as ratios of page impressions between two concepts. Ideally, high transition rates between adjacent phases should be achieved.


[page 58↓]

Figure 3-6: Site taxonomy

3.4.3.2  Conversion rates and visit rates

Sessionized data can be analyzed in a number of ways. A session is usually treated as a bag of visited pages or visited page concepts, as a set, or as a sequence. Here, we will focus on analyses of bags or sets, which are useful for applications like market basket analysis and recommendation systems based on analyzing pages that were accessed together in users’ previous sessions [Cutler and Sterne, 2000 Perkowitz and Etzioni, 1998 Zaiane, et al., 1998]. Each session s from S, the set of all sessions, can then be represented as a feature vector (cf. Section 3.1.1 for a formal definition) with each component s[c], c=1,…,7 indicating either the number of visits to the respective concept 1–7 (bag), or, in a dichotomized fashion, stating whether or not that concept was visited in the session (set). In the following, we will refer to the first method as weighted-concept and to the second as dichotomized-concept, with s w [c] N 0and s d [c]{0,1}. In addition to concepts 1.–7., s d [0]denotes the visit to “any” concept, i.e., .

We first define the dichotomized-concept conversion rate from concept c i to concept c j as

.(1)


[page 59↓]

This notation shows that the conversion rate can also be read as the confidence of the association rule c i à c j.

Two cases can be distinguished. The first assumes that a visit to concept c j is only possible after a visit to concept c i . In this case, equation (1) can be simplified. Abbreviate the denominator as S i , and define S j , S i & j analogously. Then, because S j S i ,

Examples are the conversion rates shown in Table 3-5. In this fashion, one can also address the question whether a visit accessed a particular concept c i at all. This gives rise to total conversion rates c 0 to c i , which means that the denominator becomes ׀S 0 ׀׀S׀. We specify this for the offline concept. Let S offline S be the set of sessions that visit the offline concept at least once, i.e., S offline ={s S׀s d [offline]=1}. Then we define the offline conversion rate as (׀S offline ׀/׀S׀). We add a second case, which concerns two concepts that need not necessarily occur in the order i, j. An example is the prodinf_to_service conversion rate that we will investigate in the next section. Furthermore, we extend this analysis by a set of volume-based metrics. We define the weighted-concept visit rate from concept c i to concept c j as

(2)

While this cannot directly be broken down to the number of concept visits occurring within the same sessions (and thus does not describe the conversion of one visitor from being in one subgroup of S to being in another subgroup), it is a useful indicator of the different concepts’ relative importance throughout the whole log. The idea of using a concept hierarchy for analysis can be extended by further partitioning these sets. For example, we investigated the set of sessions that visit the store locator, SLV, and the set of sessions that exit via the store locator, SLE. Both are dichotomized-concept notions, and SLE SLV S offline . Finer-grained offline conversion rates can be calculated using these sets.


[page 60↓]

Visits to concepts and conversion rates not only produce numbers for eventual success measurement. They can also be used to gain insights into online users’ behavior, in particular if different groups of users are compared. In Section 3.4.4, we illustrate how the computation of concept visit frequencies and conversion rates can help to understand the use of a multi-channel Web site not only within the set of all sessions S as in equations (1) and (2), but also in other base sets.

3.4.4  Conversion metrics results

We modeled the visits in terms of the concepts introduced in Section 3.4.3.1 and computed the conversion rates defined in Section 3.4.3.2.

We first compared two groups of sessions: the set of all sessions S and the set of all purchase sessions C. Moreover, we differentiate between two multi-channel-specific session groups: within the set of purchase sessions, we compare the two groups with the different delivery choices pick-up in store and direct delivery. We use delivery choice as an exemplary feature of multi-channel affinity because Section 3.3 has shown that delivery services are one of the most important service advantages of multi-channel retailers over pure Internet merchants. The purchase behavior of these groups is particularly interesting as one group uses the direct delivery option preferred by traditional Internet shoppers whereas the other demonstrates a multi-channel affinity.

The first group is obtained from the Web logs, and the other three groups are obtained by (a) combining Web log data with the transaction back-end data, and (b) classification according to the values of the relevant attributes (purchase: yes/no, delivery choice: direct delivery/pick-up in store).

Figure 3-7 (a) shows the numbers of page impressions on the various concepts in the set of all sessions S and Figure 3-7 (b) the set of all purchase sessions.


[page 61↓]

Figure 3-7: (a) all sessions and (b) purchase sessions: normalized numbers of weighted and dichotomized concept visits per session

The upper bars show the average number of visits, in one session, to each of the 7 concepts, and the lower bars show the proportion of sessions that have visited each of the 7 concepts at least once. For example, the infcat concept was visited, on average, 1.44 times per session, but in fact only 52.5% of all sessions visited this concept at all. Visit rates correspond to the relative widths of the “weighted” bars. This normalization was done to allow the best possible comparison between usage behavior in the four groups of sessions we investigated (cf. Figure 3-7 and Figure 3-8).

The findings from this analysis suggest that not all sessions include the home concept. Some visitors follow links from affiliate sites that often lead directly to the infprod concept. As expected, most hits occur in the information phase, where users explore product information before they eventually visit service-related sites, purchase a product or leave the site. One-fourth of all user sessions visited the offline concept at least once. The conversion rates are based on single-session conversion from one concept to another, but they lack the volume information. Especially in a multi-channel context, the information on volume combined with the offline conversion could indicate that the site serves information needs and increases the interest in offline sales. Low visit rates indicate that one should look at data on a more detailed level to identify inefficiencies within certain site concepts. Figure 3-7 (b) shows the normalized numbers of page impressions on the various concepts in the set of all purchase sessions C. The purchase concept is not shown because it is, by definition, always visited.

The comparison with the group of all sessions indicates that users who decide to initiate a [page 62↓]purchase do this on a basis of a much more extensive interaction with the site. In particular, the total number of catalog and product information pages visited are much higher, on average, in a purchase session. Not surprisingly, nearly every purchase was preceded by a visit to a product information page. Service was used more often in purchase sessions. Offline pages were also visited by more than 50% of the user sessions.

Figure 3-8 shows the purchase sessions with direct delivery and pick-up preferences. The results are based on a sample of 621 transaction records that have been linked to the respective Web-usage records. Session IDs were used to link the purchase sessions and transaction records (cf. Section 3.1). 326 users preferred direct delivery, whereas 295 preferred pick-up in store.

Figure 3-8: (a) Direct delivery purchase sessions and (b) pick up purchase sessions: normalized numbers of weighted and dichotomized concept visits per session

The 326 sessions with direct delivery preference differed in their navigation behavior from the 295 sessions with pick-up in store preference. Figure 3-8 (a) and Figure 3-8 (b) illustrate the two subgroups’ concept visits. The figures show that the behavior is generally very similar, in particular when one looks at the dichotomized concepts. However, there are two key differences. Nearly all people with pick-up preference looked at offline concepts: they located the nearest shop. In contrast, for customers who chose direct-delivery, the service concept was very important; most probably serving a trust-building function.

The concept conversion rates summarizing this comparison between all four session groups are shown in Table 3-6.


[page 63↓]

Table 3-6: Selected conversion rates in the four sets of sessions

Base set

HàIC

ICàIP

IPàTA

TAàS

OCR

all

0.75

0.5

0.06

0.23

0.23

purchase

0.8

0.95

0.98

0.77

0.56

direct delivery

0.82

0.96

0.97

0.89

0.16

store pick-up

0.78

0.93

0.99

0.64

0.997

H = home, IC = infcat, IP = infprod, TA = transaction, S = service, OCR = offline conversion rate

We also investigated in more detail the store locator visits. We found that in the set of all sessions 13% of all user sessions included at least one invocation of the store locator concept (SLV=13%). This number demonstrates the importance of the multi-channel concept. For more than 6% of the sessions, pages belonging to the store locator were used as the exit page (SLE=6%). This indicates a group of visitors that collects information online before locating the next store. The store locator was also the concept with a high percentage of one-click visitors (12.5%). The behavior pattern of one-click visitors on the shop locator is interesting as it indicates shoppers who are solely interested in finding the next retail store. Thus, they use the Web as a type of “yellow pages”.

3.4.5 Summary and implications

In the Web, unlike in a physical store, it is feasible and economical to measure conversion at a much finer level of detail; the inspection of path-dependent conversion rates may therefore yield valuable insights into a retailer’s success in funneling consumers through a Web site before a purchase is made. From a marketing point of view, the proposed metrics provide site managers with arguments why a Web site contributes significantly to a retailers overall success even though this might not be reflected in actual Web sales figures. Fine-grained conversion rates allow the analyst to determine bottlenecks in the buying process and the newly introduced offline conversion rate is an indicator for the site’s success in inducing offline sales.12 The overview of Web metrics, their requirements and potential uses provides site analysts with a platform to efficiently determine [page 64↓]conversion success.

In the case of the multi-channel retailer, the results indicate that (a) purchase sessions have a much “broader funnel” than the average session, i.e., the large majority of users in purchase sessions proceed from each step to the subsequent one. (b) For sites with high percentages of direct delivery preferences, it is very important to maintain helpful service pages. (c) The analysis has shown that offline pages in general, and the store locator in particular, are highly relevant for transactions, particularly for customers with a preference for pick-up in store. We found that nearly one-fourth of all Web site visitors in our sample accessed the offline concept, which indicates the importance of physical stores to a Web site. (d) Lastly, our results indicate that not all visitors accessed the site via the home concept. Thus, the Web site should further analyze how visitors access and browse the site in order to identify the most profitable referrers and navigation paths.

3.5  Session cluster analyses

This section proposes a set of Web analyses that groups online visitors according to their interests, as evidenced by their browsing behavior. The results are useful to determine and segment users’ browsing behavior in order to improve site design and to derive information about a site’s success in attracting specific groups of visitors.

We distinguish three types of clustering approaches depending on the data used:

Single-session clustering Different clustering techniques have been applied on user sessions: k-means [Mobasher,et al., 2002 Shahabi, et al., 1997], hierarchical clustering using concept hierarchies to describe visited pages [Fu, et al., 1999], or more encompassing descriptions to create user profiles [Heer and Chi, 2002 Mobasher, et al., 2000].

Multi-session clustering By taking the set (or sequence) of all accesses associated with one cookie instead of the set (or sequence) of all accesses within one session, the basic unit of analysis again becomes the user. It can be expected that knowledge about multiple sessions of single users on the same site could lead to a number of valuable insights; every follow-up session of a single user could be used to confirm users’ interest in that information section. However, a repeat visit could also mean that information was not found. Furthermore, cookies reidentify visitors, not individuals. The predictive value of such information should therefore not be overestimated.

Transaction Clustering By adding demographic data about a user as further variables to the feature vector defined by that user’s navigation, further insights could be gained. Promising candidates for an analysis of multi-channel behavior include transaction [page 65↓]preferences (offline pick up, online payment, returns to stores, etc., cf. Section 3.3.2), or demographic data such as income. The combined analysis can provide useful insights into consumer preferences, as the example in the following section demonstrates.

3.5.1  Transaction clusters

We analyzed session clusters for the two transaction groups of online customers, one preferring direct delivery, the other pick-up in store (cf. Section 3.4.4). By again investigating their visits to the different concepts, we derive information about specific user profiles. Using k-means, we clustered the two groups of purchase sessions that have a preference for direct delivery and pick-up in store.

We obtained five clusters, each as shown in Table 3-7 (a) and (b).

Table 3-7: Cluster centers of weighted-concept purchase sessions with (a) direct delivery preference and (b) pick-up in store preference

(a)

      

(b)

     

Cluster

1

2

3

4

5

 

Cluster

1

2

3

4

5

Home

2

1

2

2

2

 

Home

1

4

18

1

4

Infocat

7

2

4

23

16

 

Infocat

22

30

6

1

6

Offinfo

3

0

1

2

0

 

Offinfo

1

7

1

5

19

Infprod

6

3

12

21

5

 

Infprod

1

27

5

22

8

Service

10

3

2

4

4

 

Service

5

4

0

0

12

Transact

6

2

3

4

4

 

Transact

3

7

3

3

4

Number of cases

29

188

45

15

37

 

Number of cases

25

15

147

40

55

Table 3-7 (a) shows visitors who chose direct delivery. They tend to be “true online users” (all clusters tend to rarely visit the offline concept). They fall into five subgroups: the largest group (cluster 2) tends to visit all other concepts except offline information. The number of page impressions is small. Groups 3, 4 and 5 tend to visit the semantically related concepts infcat and infprod and can be characterized as typical information seekers [Moe, 2001]. A small group (cluster 1) focuses on service-related information and exhibits the highest number of page impressions in this cluster group. The results are highly significant with p < 0.0001. Twelve sessions have been eliminated due to outlier sensitivity in k-means.


[page 66↓]

Table 3-7 (b) shows visitors who picked up their purchase in-store. They tend to be ”true multi-channel users” (nearly always visiting the offline concept). Its largest subgroup (cluster 3) takes advantage of all the site’s information offers and visits the offline concept at least once. A smaller subgroup (cluster 5) appears to be arriving with prior knowledge of their intended product choice; they do not need to consult the catalog or refer to service pages extensively but move directly to the service, offline and transaction concept. This may be interpreted as showing that these users combine the wish for a fast transaction process (online) with the reassurance that because they will pick up the product in-store, problems that may surface can be solved then. Clusters 1, 2 and 4 all focus on the concepts infcat and infprod before they move to the transaction concept. The results are highly significant with p < 0.0001, with the exception of the home concept (p < 0.15).

Similarities in the information behavior exist between cluster group 1 (pick-up) and group 2 (direct delivery). Cluster 1 in group 2 and cluster 5 in group 1 look at many catalog sites before moving to the transaction process; cluster 2 in group 2 and cluster 4 in group 1 both intensively explore information catalog and product information pages; cluster 4 in group 2 and cluster 3 in group 1 primarily look at product information.

3.5.2 Summary and implications

The presented clustering method demonstrated how user groups can be segmented based on Web usage data and how Web user data can further enrich the analysis. The analysis found several session clusters exhibiting a distinctive interest in offline information. These clusters indicate groups of site visitors that use traditional channels for purchases. The analyses are useful for Web marketing [Moe, 2001] and for Web applications such as recommendation engines or personalization systems that require a model of user behavior, which will be discussed in more detail in Chapter 5. Site managers can also use the analysis results to make the online presence more appealing to most profitable target groups. For example, site managers could improve the links between Web pages that are visited together. Our transaction clusters support the identification of those sets of pages that may lead to a purchase.

3.6  Demographic and order analyses

This section of our analysis framework will present a set of Web analyses for customer segmentation based on demographic and order characteristics.

Section 3.6.1 calculates the distance-to-store metric which measures the distance between customers’ zip code locations and the nearest store of the retailer and compares it with the purchase proclivity. The results can be useful to determine a Web site’s success in attracting new online customers, to determine places for new shop openings [page 67↓]and to investigate cross-channel effects between online and offline sales channels.

The second set of analyses focuses on the question of a customer’s value to a company. Section 3.6.2 introduces the revenue concentration and the Gini coefficient, which analyze the cumulative revenue generated by a cumulative proportion of customers. Section 3.6.3 introduces an index of customer value, which is based on the purchase variables frequency, recency and monetary value.

The analyses are calculated based on transaction data from the multi-channel retailer and on demographic data that has been acquired from Deutsche Post Direkt [Deutsche Post Direkt GmbH, 2004].

3.6.1  Distance-to-store distribution

This section investigates whether the distance from an online customer’s zip code location to the nearest physical shop has an influence on purchase proclivity. Two outcomes appear plausible: people who live farther away from a shop may have the same probability of becoming an online customer (easily substituting visits to physical stores for online purchases), or they may have a lower proclivity to purchase online (possibly because of a lack of trust in an online-only retailer). A third, though unexpected, option is that they may have a higher proclivity to purchase online. To obtain answers to these questions, a data set of online customers with home addresses that are distributed across the country is needed.

Our sample of 13,653 online customers was spread over an area of approximately 80,000 square kilometers (km2). Data was acquired that links a zip code area to a longitude/latitude value. The zip codes included an area of x av = 43 km2 on average with values ranging from 2 to 200 km2. For most countries, geographical data is also available on a more fine-grained basis such as on street and household level. However, for the purpose of a first approximation and demonstration of the measuring technique, five-digit zip code data was regarded as sufficient to match geographic coordinates with a customer’s location.

We therefore investigated this question by analyzing the larger sample of 13,653 customer records. Distance to the nearest store was calculated as follows: it was assumed that (a) customer, shop, and population are located at the center of their respective zip code areas; (b) home address and shipping address were identical [page 68↓](negligible error13); and (c) the online purchasing probability is equally distributed among the population.

We then calculated minimal distances between customer zip code and shop zip code14. The mean distance was x min = 10.01 km with a standard deviation of s min = 9.32 km. For the number of customers per zip code area, it was found that x cus = 2.98 with s cus = 2.81.

The mean population density for zip code areas was x pop = 12,469 with a standard deviation of s pop = 58,891. Then the correlation was measured between the number of customers from each zip code area – normalized with the respective population density in each zip code area – and their distance to the next shop.

Thus, let x be the number of online customers divided by the number of inhabitants in a given zip code area, n and y be the distance to the next store, then the distance-to-store correlation r dst can be calculated as

.

Figure 3-9 shows that the larger the distance of a region to the nearest shop, the fewer customers this region contains.


[page 69↓]

Figure 3-9: Histogram displaying the number of online customers and distance to store

We found a weak correlation of r = -0.3 and p < 0.001. This result could be an artifact if regions that are farther away from a shop (e.g., rural regions) simply contain fewer residents. However, in comparison, this relationship between population density in a zip code area and the next shop is so weak (r = 0.01; p < 0.001) as to be practically meaningless. That is, the presence of a physical store in one’s vicinity appears to heighten the probability of shopping online with that company. What effects does the vicinity of a store have, then, on transaction preferences? There is indeed evidence of the expected relationship: customers from the all-customers sample who picked up their purchases in-store (n = 9073) lived, on average, 7.87 km from the nearest branch, while those who chose direct delivery (n = 4580) lived, on average, 12.15 km away. This relation was also mirrored in our online sample (average distance of direct-delivery customers from the nearest shop, n = 621: 13.01 km). Delivery preference, in turn, can be linked to Web usage behavior, as we have seen above. The geographic distribution of stores and customers has been depicted in Figure 7-3 of the Appendix.

The results are consistent with [Kohavi, 2003], who found that people who live farther away from retail stores spend more on the average and account for most of the online revenues. Our results are also consistent with the findings of the multivariate analysis of user perceptions in Chapter 2 where online consumers’ trust in an e-shop has been influenced by perceived size and reputation of a retailer’s physical presence.

Summing up, a Web site must cater to the needs of those prospects who need to rely on direct delivery, in particular by providing adequate information about the company, the products and transaction terms in its service pages. Besides this rather evident conclusion, a site could use the geographical findings as an indicator for the site’s [page 70↓]success in attracting new customers through the Web. Consumers who live far away from the next shop are less exposed to physical stores and more likely to purchase online. Finally, the findings could be used to determine places for new shop openings in order to utilize the observed cross-channel effects between the Internet and a small-meshed store network. Combined with information about the offline conversion rate it may encourage companies to further integrate their online and offline marketing.

3.6.2  Concentration indices

A Web retailer must generate revenue to be successful. Thus, one of the most important segmentation criteria is the revenue contribution of customers. A Web site should cater considerably to the needs of those customers who generate the highest revenue.

In order to find out if there is a group of customers with a high revenue contribution, the Lorenz curve can be drawn, which is a useful method to depict, calculate and compare the revenue concentration in a customer sample. The Lorenz curve is defined as the function of the cumulative proportion of ordered individuals in subsets mapped onto the corresponding cumulative proportion of their size [Lorenz, 1905].

Given a sample of i ordered customers with the revenue r respectively, then the Lorenz curve can be expressed as

. In the case of the multi-channel retailer, the Lorenz curve revealed that 20% of the retailer’s customers generate 60% of the revenues. Though the often cited Pareto rule that 20% of customers typically generate 80% of revenue [Koch, 1998] could not be fully confirmed, a tendency towards revenue concentration could be observed.

The Gini coefficient is a summary statistic of the Lorenz curve and a measure of inequality in a population. The Gini coefficient G is defined as

, where ∂Y i and ∂X i are cumulative percentages of X i , the population variable, Y i the income (or revenue) variable and n the number of observations. G ranges from a minimum value of zero (total equality) to a theoretical maximum of one (total inequality). In the sample of 13,653 online customers at the multi-channel retailer, the Gini coefficient was G = 0.41.

3.6.3  Recency, frequency, monetary value

The question arises if revenue is a reliable indicator to determine a customer’s value to the company. Is a one-time customer who spends a lot in a single transaction more [page 71↓]valuable than a customer who spends less but more frequently on a long-term basis? Further purchase characteristics need to be examined to segment customers according to their value to a company. A typical index for determining customer value is based on three variables: the time of the most recent purchase(recency), the number of orders placed (frequency) and the amount of money spent15 (monetary value) within a specific time frame [Miglautsch, 2000].16

In order to calculate the index, the following scores have been assigned to the three purchase characteristics:

Table 3-8: Recency, frequency and monetary value scores

Score

Recency of last purchase

Score

Frequency of purchases

Score

Monetary value

1

> 6 months ago

1

one per year

1

< 200 euros

2

3 to 6 months ago

2

2-3 per year

2

200-600 euros

3

< 3 months

3

> 3 per year

3

> 600 euros

Customers were grouped according to their purchase characteristics. In total, 27 segments (3x3x3) were generated from the score combinations. For example, the segment with the score code 312 contains all customers whose last purchase took place more than six months ago, who purchased more than three times, and whose total purchase value was between 200 and 600 euros.

Customers with the same points in all categories were grouped and the results depicted in Figure 3-10. The abscissa is partitioned into 27 segments which are assigned the number of customers that belong to this class.


[page 72↓]

Figure 3-10: Recency, frequency, monetary value distribution for 13,653 customers

Segments 113, 211, 112 and 311 contain the most records. These segments rank lowest (1) in at least two variables. Only very few customers rank highest (333) in all three variables. The retailer should subsequently focus its business efforts on the needs of those segments with the highest scores in all three variables. One should note that the data sample of 13,653 customers in this analysis includes purchases from a time period of just eight months. The results will be different for longer time periods. Within the given time frame, the mean transaction amount per order was 672 euros, the mean number of purchases per customer, 1.14, and the mean interpurchase time between two consecutive orders of the same customer, 156 days.

The presented analysis is popular for customer segmentation due to its simplicity. Criticism concerns the creation of equal bins [Miglautsch, 2000]. More fundamental criticism aims at the variables used to determine customer value. Reinartz and Kumar [2003] compared transactions from more than 11,992 households at a catalog retailer over a three-year period and found that scoring approaches resulted in an overinvestment in advertising cost for lapsed customers.

3.6.4 Summary and implications

We demonstrated how users can be further segmented according to demographic and order characteristics.

The distance-to-store analysis, which indicates the site’s success in attracting new customers through the Web has been calculated. The findings could be used to determine places for new shop openings in order to utilize the observed cross-channel effects between the Internet and a small-meshed store network. Moreover, the correlation [page 73↓]provides insight into the potential relevance (and potential explanatory value) for different service choices in multi-channel retailing.

The concentration indices provide a better understanding of the customers’ revenue contribution to a company’s business success. A customer value index has been suggested that measures the value contribution of distinct customer segments.

The results can be also useful for recommendation and personalization systems [Kobsa,et al., 2001 Sarwar,et al., 2000].

3.7  User typology analyses

This last section of analyses within our framework will introduce a method of pattern discovery that allows the identification of user typologies expressed as browsing strategies. This notion of success is particularly useful for information Web sites where a site’s goal is to attract specific types of online visitors and to keep them recurring to the site.

Section 3.7.1 discusses the notion of success for an information site. Section 3.7.2 introduces how behavioral strategies can be modeled on Web usage data. Section 3.7.3 discusses how these strategies can be expressed in a Web mining language. Section 3.7.4 describes the information Web site, and Section 3.7.5 introduces a concept hierarchy for that site. Section 3.7.6 demonstrates how a specific behavioral strategy could be tested against Web usage logs from the information Web site. Section 3.7.7 presents the results and discusses the discovered patterns.

3.7.1  Success for an information site

The presented analyses from the previous sections consider user behavior in the context of Web merchandizing. However, the Internet contains an abundance of non-merchandizing sites, in which a similar behavior should be expected. On an information site, objectives of the interaction may be the retrieval of pages on a subject of interest: the enrollment in a course, the identification of an appropriate partner or the application for a job. Thus, success may have different meanings depending on the site’s goals. Events such as filling out a registration or application form, downloading information, ordering a newsletter, the use of a product configuration tool, signing a contract or contacting a physical person may define conversion success in a non-merchandizing context. This chapter will introduce a method how success can be determined on an information Web site.

In the following, we apply a Web analysis methodology on the Web log data of a non-merchandizing site. The data owner belongs to the category of organizations that use the [page 74↓]Web mainly as a contact point, in which visitors are motivated to a face-to-face contact. Thus, this category encompasses sites of sophisticated services, including Application Service Providers (ASPs), insurance companies and consultancies, as well as companies offering personalized customer support. In the absence of cookie identifiers, sessions were determined heuristically [Berendt,et al., 2001 Berendt and Spiliopoulou, 2000 Cooley,et al., 1999] specifying 30 minutes as a threshold for viewing a single page of a session. After cleaning and preprocessing, the cleaned server log contained 27,647 user sessions.

3.7.2  Modeling strategies as sequences of tasks

The process of becoming a customer has been described for e-commerce sites in Section 3.3.1 where the purchase process has been used as a model for site design and for the interpretation of the behavior of potential customers. This task-oriented view on browsing behavior can be useful in the context of information Web sites, too.

More generally, we define a “strategy” as a sequence of tasks, beginning at a start-task, ending at a target-task that corresponds to the fulfillment of the objective of the interaction, and containing an arbitrary number of intermediate tasks.

Hence, if we observe the set of conceivable tasks in an application as a set of symbols S, a strategy is a regular expression involving at least two symbols from S (the start-task and the target-task) and, optionally, a number of wildcards. Borrowing from the conventions on regular expressions upon strings, we propose the following notation for the representation of strategies:

A strategy is a sequence of symbols from the set of tasks S, optionally interleaved with an arbitrary number of associated wildcards.

A wildcard has the form [n;m], where n is a non-negative integer, m is a non-negative integer or a symbol denoting infinity, and n ≤ m.

A wildcard [n;m] appears as suffix to a task or a parenthesized subsequence of symbols, indicating that this task or subsequence should occur at least n and at most m times.

The first and the last element of a strategy and of any subsequence suffixed by a wildcard are tasks from S, i.e. they may not be wildcards.

The first task or subsequence of a strategy may be prefixed by a special symbol # indicating that this task is the very first occurring in data records conforming to the strategy.


[page 75↓]

Similarly to string matching for regular expressions, a strategy is matched against sequences of events from the dataset. In Web usage mining, these sequences are user sessions derived from the Web server log [Cooley,et al., 1999].

3.7.3  Expressing strategies in a mining language

The specification of a strategy according to the notation used in the previous section is appropriate for strategy generation. However, in order to discover patterns adhering to an anticipated strategy, we must express a strategy formalized in a mining language.

Our method of pattern discovery uses the specification of the behavioral strategy itself as guidance to the analysis software. Findings from cluster analysis or association rule mining (cf. Section 3.3.3 or Section 3.5.1) can be used as guidance for the strategy specification.

Hence, the challenge lays in modeling the behavioral strategies of users in such a way that they can be tested against Web usage data.

To this purpose, we use the Web mining language MINT of WUM (Web Utilization Miner) [Spiliopoulou, 1999 Spiliopoulou and Faulstich, 1999].

In MINT, a strategy is mapped onto a template. A template is similar to a regular expression, comprised of variables and wildcards. A task that should appear in a strategy corresponds to a bound variable. A wildcard in a strategy is directly mapped into a wildcard of the template. The constraints for the first and last elements of a strategy are also valid for templates.

During data mining, templates are matched against groups of sessions: a session matches a template if it contains all tasks contained in the template in the appropriate order and, further, satisfies all constraints posed by the template. In the context of strategy evaluation, strategies express the expected behavior of users, while sessions reflect the actual behavior recorded in the Web server log. Thus, a session is “conformant” to a strategy if and only if it matches the template expressing the strategy.

3.7.4  An informational Web site

The Web site of the case provides information material and contact points on several services. Visitors access the site to be informed about the company, its mission and profile, its product portfolio, its credentials, partners and reference customers. Conversion corresponds to the initiative of the visitor to contact or become contacted by the company. In some sites, the execution of a “Contact” task is a unique event during a session: the user provides her contact data, so that a meeting can be arranged. In other sites, [page 76↓]including the one at hand, a contact task may be the acquisition of information material on a given product or the registration to an event organized by the company. In such a case, a contact task may be executed multiple times, once per product or event of interest. Hence, a session may contain multiple “Contact” task invocations.

Its users include potential members, actual members, institutional partners, personnel and press. For the purposes of the analysis, we have concentrated on the behavior of potential members and have removed all sessions that could be identified as belonging to actual members or personnel, as well as visits of robots, archivers and administration services, which are identified by their IP address. Invocations of components of each individual page (tables, images, script invocations) were coerced into a single page view by a site expert.

3.7.5  Task-based site taxonomy

Figure 3-11 shows the task-based taxonomy of the Web site. The service pages provide primarily information for existing customers, including services and responsible contact persons. The research pages contain information about important projects and relevant reports. Of special interest for our study is the branch under marketing/public relations (PR). Here we aggregated all pages primarily dedicated to marketing purposes. Information pages under acquisition contain detailed information of programmes offered by the organization. Pages providing online registration forms, detailed contact data or downloads of application material were summarized under registration.

For the given information site, the conventional process of the customer purchase process must be replaced by a reasonable sequence of tasks modeled in the concept hierarchy. In our example, “Conversion” corresponds to the establishment of a contact, i.e. to the execution of a “Contact” task according to Figure 3-11.

Figure 3-11 also shows how each concept was assigned to one out of the three phases of the online information process consisting of background information, detail information and contact. The registration pages were assigned to the contact phase, while the acquisition-related information pages were mapped onto the detail information phase. All remaining pages where treated as providing background information.


[page 77↓]

Figure 3-11: Task-oriented taxonomy of the information site

3.7.6  Mining queries for template matching

This section shows an example how a behavioral strategy – namely the knowledge building strategy proposed by Moe [Moe, 2001] – could be tested against Web usage logs. The study of background information, corresponding to the invocation of the “BackgroundInfo” task in the taxonomy above, is expected to characterize the knowledge builders [Moe, 2001]. These users prefer to get the complete picture of the company, to check the mission and verify the trustworthiness of the institution, before deciding to establish a contact. Background information may be acquired before or after executing a “DetailInfo” task. As the behavior of these users cannot be traced beyond a single session, we have rather concentrated on a subgroup of knowledge builders, namely those that acquire enough information about the company and establish a contact within the same session. It should be noted that Moe’s model cannot be applied in its entirety, because it contains strategies that are only relevant for e-commerce sites.

According to the task-oriented taxonomy of Figure 3-11, the knowledge building strategy has the form:

# Home (BackgroundInfo[1;n] DetailInfo[1;n])[1;n]

We use the mining language MINT to extract the pattern for the templates of the knowledge-building strategy. MINT is an SQL-like mining language for the specification of templates and of constraints upon them. The full syntax of MINT is presented in [Spiliopoulou and Faulstich, 1998].


[page 78↓]

Table 3-9: Strategy specification in MINT

SELECT t

FROM NODE AS x y z w, TEMPLATE # x y * w * z AS t

WHERE x.url = "Home" AND y.url = "BackgroundInfo"

AND wildcard.w.url = "BackgroundInfo"

AND w.url = "DetailInfo"

AND wildcard.z.url ENDSWITH "Info" AND z.url = "Contact"

The template expresses the strategy as a sequence of variables and wildcards. The first three constraints bind the variables. The last constraint binds the contents of the wildcard.

3.7.7  Results and analysis of the discovered patterns

The navigation pattern returned a group of paths. Each task in each path has been invoked by a number of visitors, some of which followed the path to the end, while others have abandoned it. In our case, these are the routes from “BackgroundInfo” to “DetailInfo” and then to the invocation of the “Contact” task. All these paths consist of “BackgroundInfo” and “DetailInfo” tasks. However, one visitor may have invoked “DetailInfo” after asking for “BackgroundInfo” once, while another may have requested “BackgroundInfo” ten times beforehand. Moreover, each path has been entered by a number of visitors, some of which have followed it to the end, while others have abandoned it.

The invocation of detailed information indicates a serious interest in the offered product or service. Hence, we split the pattern of this strategy into two components, one until the first invocation of “DetailInfo” and one thereafter. The statistics of the first component of the knowledge-building strategy are shown in Figure 3-3.12. The horizontal axis represents steps, i.e. task invocations. At each step, a number of users asks for detailed information and thus proceeds to the second component of the strategy. These users are represented in the cumulative curve labeled “DetailInfo”. The vertical axis shows that from the 6,641 visitors that entered this strategy, about 14% (896 visitors) entered the second component. The remaining ones are depicted in the cumulative curve labeled “Exit”: they did not necessarily abandon the site, but their subsequent behavior does not correspond to the knowledge-building strategy any more. The curve labeled “BackgroundInfo”, represents the visitors that ask for further background information. All curves saturate fast, i.e. most users invoke only a few tasks.

The statistics of the second component are shown in Figure 3-13. After the first invocation [page 79↓]of “DetailInfo”, the 800 visitors that entered the second component acquired detailed or background information aggregated into the curve “Info” that covers invocations of both tasks. The “Contact”-curve and the “Exit”-curve are again cumulative. The former shows that 10% of these visitors establish contact, and that they do so after a small number of information acquisition tasks. This implies that many contact acquisition tasks do not increase the confidence of contact establishment. The large number of users represented by the “Exit”-curve indicates that the strategy does not represent all users. Hence, further tasks should be modeled and more strategies should be investigated.

Figure 3-12: Knowledge-Building Strategy until the first invocation of “Detail Info”

Figure 3-13: Knowledge-Building Strategy until contact establishment

3.7.8 Summary and implications

We addressed the issue of analyzing Web site usage according to the anticipated goals of site visitors. To this purpose, we have presented an approach for the modeling of user activities as tasks in pursue of a goal, and of sequences of tasks as strategies to achieve this goal. Our framework allows for the description of navigation strategies as anticipated [page 80↓]in marketing literature. However, our model is not limited to Web merchandizing. We have demonstrated the applicability of our approach by analyzing the behavior of two types of visitors on an information Web site.

The modeling of goal-oriented navigation strategies is a non-automatable task. However, the specification of appropriate constructs for the formulation of strategies is essential. The current framework and the mining language we use for the discovery of patterns adhering to a strategy are a first step in this direction.

3.8 Conclusion

Five groups of Web analyses have been presented that constitute our analysis framework. The group of service analyses in Section 3.3 is beneficial for multi-channel retailers in order to determine consumers’ delivery, payment and return preferences. The conversion metrics in Section 3.4 analyze consumers’ navigation behavior on a fine-grained level. The offline conversion rate can be used as an indicator for the site’s success in inducing offline sales. The clustering method presented in Section 3.5 is useful to improve Web site navigation and to identify navigation patterns of online buyers. Section 3.6 presented order and demographic analyses that group users according to demographic and order characteristics. The distance-to-store metric has been defined that indicates the site’s success in attracting new customers. The proposed customer value indices provide a first insight in a customer’s value contribution. The analysis of user typologies in Section 3.7 modeled user activities as sequences of tasks. The method allows searching for specific user navigation patterns in the Web log.

The results of our Web analysis framework should be compared over time. A comparison is beneficial for tracking how modifications of Web site design, product and service offerings or advertising may influence the analysis results and Web site success respectively. Moreover, a company can use the results to identify trends and patterns over time in order to predict future demand in Web site content, services and products.

The clustering results of Section 3.5, the order and demographic characteristics of Section 3.6 and the user typologies of Section 3.7 are particularly useful for user modeling in personalization systems, which will be discussed in more detail in Chapter 5. The importance of Web mining for personalization has been described in related work [Mobasher, et al., 2000 Mulvenna, et al., 2000 Perkowitz and Etzioni, 2000 Spiliopoulou, 2000]. Personalization systems need to acquire a certain amount of information about users’ interests, behavior, demographics and actions before they work efficiently. As multi-channel retailers can collect consumer information from several distribution channels, personalization can be particularly beneficial for these retailers.


[page 81↓]

3.9  Limitations

This chapter has used data samples from a retail and an information site to test the metrics and analytics of our analysis framework. However, site-specific parameters could limit the generalizability of the results. It could be that the specific structure of the Web sites or the products and services offered have an impact on the analysis results. Thus, if a company wants to compare its performance with other sites, site-specific criteria need to be included in the discussion of the results. As our sample of customers and Web logs is relatively large it could be used for comparisons with other sites.

For the development of conversion metrics and the modeling of user search strategies we referred to the purchase decision process, which is a well-known model of consumer purchasing behavior. However, decision processes could be more complex in reality, which may not be captured by the proposed analyses.

The list of 82 metrics and analytics (cf. Table 7-3) is a selection of analyses that covers important aspects of success measurement and customer relationship management for Web sites. It was considered useful by experts and the Web site owners. Of course, the selection is not exhaustive and can be further expanded.


Footnotes and Endnotes

3 With shopping cards customers can earn bonus points for each purchase, which can be redeemed in the form of discounts and/or other incentives. Though data from shopping cards is valuable for marketing, there is a potential bias because cardholders may have a stronger brand loyalty than the average customer.

4 E.g. unique visitors, page views, operating system, average time spent on pages, entry and exit pages, number of clicks or country code, search terms, referrers, server load, request errors, etc.

5 Only those retailers with a large number of stores were counted as retailers operating physical stores.

6 A recent study found that 78 percent of retailers offer in-store returns of online purchases (Shop.org 6.0).

7 The survey was placed on the Web site in 2002 [Teltzrow and Berendt, 2003]. 4267 respondents gave open text answer to the question “what do you like/dislike about this Web site”. 345 answers addressed multi-channel services.

8 For simplicity, we assume that in a given time period T, there is only one goal, or several which can be aggregated into one goal along concept hierarchies. This framework treats all users who visit a given class of pages as equal. It may be argued that this represents a simplified description of the complex goal-setting and decision-making processes that users go through when navigating a site. However, this simplification is justified by the purposes of a business-related outcome analysis.

9 Note that conversion, abandonment, etc. are defined relative to the site’s goal, so “customer” in the general case means “person who reached the site’s goal”, and “abandonment” means “abandoning a task on the site whose completion constitutes the site’s goal”.

10 Offline referrers are visits from referring URLs that are uniquely linked to offline stores, such as hits from affiliated stores that provide specific URLs to the main Web site.

11 More fine-grained taxonomies have been developed. However, the depicted aggregation suffices our analyses purposes.

12 It could be supplemented by retailers who track the number of visitors who come into a physical shop with a printout from the Web site.

13 Shipping and billing address were identical for 94% of the customers with delivery preferences. More than two-thirds of the customers specified that their billing address is their home address. One-third refused to provide this information. Most of the customers who preferred to pick up orders chose the store closest to their contact address.

14 MIN [D(km) = ARCCOS (SIN (Latitude CustomerZIP * PI / 180) * SIN (Latitude ShopZIP * PI / 180) + (COS (Latitude CustomerZIP * PI / 180) * COS (Latitude ShopZIP * PI / 180) * COS ((Latitude ShopZIP - (Longitude CustomerZIP)) * PI / 180))) * 6370 (=earth radius in km)]

15 Often profitability is used instead of revenue.

16 Recency and frequency have been used also in the context of Web site visitors [Cutler and Sterne, 2000]: Visit recency measures the time of the most recent visit and visit frequency the number of visits in a time frame.



© Die inhaltliche Zusammenstellung und Aufmachung dieser Publikation sowie die elektronische Verarbeitung sind urheberrechtlich geschützt. Jede Verwertung, die nicht ausdrücklich vom Urheberrechtsgesetz zugelassen ist, bedarf der vorherigen Zustimmung. Das gilt insbesondere für die Vervielfältigung, die Bearbeitung und Einspeicherung und Verarbeitung in elektronische Systeme.
DiML DTD Version 4.0Zertifizierter Dokumentenserver
der Humboldt-Universität zu Berlin
HTML generated:
14.08.2006